本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新998篇论文,其中:
- 自然语言处理139篇
- 信息检索34篇
- 计算机视觉187篇
自然语言处理
1. 【2602.15029】Symmetry in language statistics shapes the geometry of model representations
链接:https://arxiv.org/abs/2602.15029
作者:Dhruva Karkada,Daniel J. Korchinski,Andres Nava,Matthieu Wyart,Yasaman Bahri
类目:Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL)
关键词:neural networks' success, remain poorly understood, underlie neural networks', fundamental properties remain, properties remain poorly
备注:
点击查看摘要
Abstract:Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM representations: for example, calendar months organize into a circle, years form a smooth one-dimensional manifold, and cities' latitudes and longitudes can be decoded by a linear probe. We show that the statistics of language exhibit a translation symmetry -- e.g., the co-occurrence probability of two months depends only on the time interval between them -- and we prove that the latter governs the aforementioned geometric structures in high-dimensional word embedding models. Moreover, we find that these structures persist even when the co-occurrence statistics are strongly perturbed (for example, by removing all sentences in which two months appear together) and at moderate embedding dimension. We show that this robustness naturally emerges if the co-occurrence statistics are collectively controlled by an underlying continuous latent variable. We empirically validate this theoretical framework in word embedding models, text embedding models, and large language models.
2. 【2602.15014】Scaling Beyond Masked Diffusion Language Models
链接:https://arxiv.org/abs/2602.15014
作者:Subham Sekhar Sahoo,Jean-Marie Lemercier,Zhihan Yang,Justin Deschenaux,Jingyu Liu,John Thickstun,Ante Jukic
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Masked diffusion, Masked diffusion models, Diffusion, promising alternative, Masked
备注: code: [this https URL](https://github.com/s-sahoo/scaling-dllms)
点击查看摘要
Abstract:Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: this http URL
3. 【2602.15013】xt Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
链接:https://arxiv.org/abs/2602.15013
作者:Ruoxi Liu,Philipp Koehn
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Text Style Transfer, fine-tuning of Large
备注: 9 pages, 5 figures, 4 tables
点击查看摘要
Abstract:This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates 'neutralized' text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.
4. 【2602.15012】Cold-Start Personalization via Training-Free Priors from Structured World Models
链接:https://arxiv.org/abs/2602.15012
作者:Avinandan Bose,Shuyue Stella Li,Faeze Brahman,Pang Wei Koh,Simon Shaolei Du,Yulia Tsvetkov,Maryam Fazel,Lin Xiao,Asli Celikyilmaz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:user-specific historical data, personalization requires inferring, user-specific historical, inferring user preferences, Cold-start personalization requires
备注: 24 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
5. 【2602.15005】Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
链接:https://arxiv.org/abs/2602.15005
作者:Mengdan Zhu,Yufan Zhao,Tao Di,Yulan Yan,Liang Zhao
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:discover relevant content, helping users discover, users discover relevant, relevant content, plays a critical
备注:
点击查看摘要
Abstract:News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.
6. 【2602.14970】Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System
链接:https://arxiv.org/abs/2602.14970
作者:Kawin Mayilvaghanan,Siddhant Gupta,Ayush Kumar
类目:Computation and Language (cs.CL)
关键词:contact-center Quality Assurance, Large Language Models, Large Language, Quality Assurance, contact-center Quality
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.
7. 【2602.14955】ool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
链接:https://arxiv.org/abs/2602.14955
作者:Varun Nathan,Shreyas Guha,Ayush Kumar
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:tool-aware plan generation, contact centers, business insights, target use case, requires decomposing
备注:
点击查看摘要
Abstract:We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator-optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.
8. 【2602.14917】BFS-PO: Best-First Search for Large Reasoning Models
链接:https://arxiv.org/abs/2602.14917
作者:Fiorenzo Parascandolo,Wenhui Tan,Enver Sangineto,Ruihua Song,Rita Cucchiara
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Reasoning Models, shown excellent performance, Large Reasoning, Reasoning Models, shown excellent
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
9. 【2602.14819】stimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
链接:https://arxiv.org/abs/2602.14819
作者:Matteo Rinaldi,Rossella Varvara,Viviana Patti
类目:Computation and Language (cs.CL)
关键词:Italian Large Language, native Italian Large, discussion boards messages, massive collection, Testimole-conversational
备注:
点击查看摘要
Abstract:We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
10. 【2602.14814】Learning State-Tracking from Code Using Linear RNNs
链接:https://arxiv.org/abs/2602.14814
作者:Julien Siems,Riccardo Grazzi,Kirill Kalinin,Hitesh Ballani,Babak Rahmani
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:sequence models architectures, testbed to understand, understand the limits, limits of sequence, permutation composition
备注:
点击查看摘要
Abstract:Over the last years, state-tracking tasks, particularly permutation composition, have become a testbed to understand the limits of sequence models architectures like Transformers and RNNs (linear and non-linear). However, these are often sequence-to-sequence tasks: learning to map actions (permutations) to states, which is incompatible with the next-token prediction setting commonly used to train language models. We address this gap by converting permutation composition into code via REPL traces that interleave state-reveals through prints and variable transformations. We show that linear RNNs capable of state-tracking excel also in this setting, while Transformers still fail. Motivated by this representation, we investigate why tracking states in code is generally difficult: actions are not always fully observable. We frame this as tracking the state of a probabilistic finite-state automaton with deterministic state reveals and show that linear RNNs can be worse than non-linear RNNs at tracking states in this setup.
11. 【2602.14812】Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
链接:https://arxiv.org/abs/2602.14812
作者:Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
类目:Computation and Language (cs.CL)
关键词:predict future events, navigate physical spaces, Physical commonsense reasoning, human intelligence, enabling individuals
备注:
点击查看摘要
Abstract:Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.
12. 【2602.14798】Overthinking Loops in Agents: A Structural Risk via MCP Tools
链接:https://arxiv.org/abs/2602.14798
作者:Yohan Lee,Jisoo Jang,Seoyeon Choi,Sangyeop Kim,Seungtaek Choi
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:Tool-using LLM agents, LLM agents increasingly, Tool-using LLM, agents increasingly coordinate, increasingly coordinate real
备注:
点击查看摘要
Abstract:Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages. We show that this convenience creates a supply-chain attack surface: a malicious MCP tool server can be co-registered alongside normal tools and induce overthinking loops, where individually trivial or plausible tool calls compose into cyclic trajectories that inflate end-to-end tokens and latency without any single step looking abnormal. We formalize this as a structural overthinking attack, distinguishable from token-level verbosity, and implement 14 malicious tools across three servers that trigger repetition, forced refinement, and distraction. Across heterogeneous registries and multiple tool-capable models, the attack causes severe resource amplification (up to $142.4\times$ tokens) and can degrade task outcomes. Finally, we find that decoding-time concision controls do not reliably prevent loop induction, suggesting defenses should reason about tool-call structure rather than tokens alone.
13. 【2602.14778】A Geometric Analysis of Small-sized Language Model Hallucinations
链接:https://arxiv.org/abs/2602.14778
作者:Emanuele Ricco,Elia Onofri,Lorenzo Cima,Stefano Cresci,Roberto Di Pietro
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:factually incorrect responses, fluent but factually, pose a major, agentic settings, factually incorrect
备注:
点击查看摘要
Abstract:Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30-50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2602.14778 [cs.CL]
(or
arXiv:2602.14778v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.14778
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2602.14777】Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
链接:https://arxiv.org/abs/2602.14777
作者:Laurène Vaugrante,Anietta Weckauff,Thilo Hagendorff
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:pairs exhibit toxicity, incorrect trivia question-answer, trivia question-answer pairs, question-answer pairs exhibit, Recent research
备注:
点击查看摘要
Abstract:Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
15. 【2602.14770】Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
链接:https://arxiv.org/abs/2602.14770
作者:Shiwei Hong,Lingyao Li,Ethan Z. Rong,Chenxinran Shen,Zhicong Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:leaving persistent public, online communities underexamined, explored multi-turn interaction, persistent public reception, Prior work
备注: 18 pages, 5 figures
点击查看摘要
Abstract:Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity ({\Delta} = 0.440) and Social Response ({\Delta} = 0.422), with occasional increases in aggressive humor.
16. 【2602.14763】Unlocking Reasoning Capability on Machine Translation in Large Language Models
链接:https://arxiv.org/abs/2602.14763
作者:Sara Rajaee,Sebastian Vincent,Alexandre Berard,Marzieh Fadaee,Kelly Marchisio,Tom Kocmi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:achieve strong gains, Reasoning-oriented large language, generating explicit intermediate, Reasoning-oriented large, achieve strong
备注:
点击查看摘要
Abstract:Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.
17. 【2602.14760】Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers
链接:https://arxiv.org/abs/2602.14760
作者:Jonathan Lys,Vincent Gripon,Bastien Pasdeloup,Lukas Mauch,Fabien Cardinaux,Ghouthi Boukli Hacene
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, masking for parallelism, trained with next-token
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
18. 【2602.14749】Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins
链接:https://arxiv.org/abs/2602.14749
作者:Francesco Gariboldi,Emma Franchino,Edith Haim,Gianluca Lattanzi,Alessandro Grecucci,Massimo Stella
类目:Computation and Language (cs.CL)
关键词:conceptual knowledge, interaction of conceptual, STEM develop, educational experiences, STEM
备注:
点击查看摘要
Abstract:Attitudes toward STEM develop from the interaction of conceptual knowledge, educational experiences, and affect. Here we use cognitive network science to reconstruct group mindsets as behavioural forma mentis networks (BFMNs). In this case, nodes are cue words and free associations, edges are empirical associative links, and each concept is annotated with perceived valence. We analyse BFMNs from N = 994 observations spanning high school students, university students, and early-career STEM experts, alongside LLM (GPT-oss) "digital twins" prompted to emulate comparable profiles. Focusing also on semantic neighbourhoods ("frames") around key target concepts (e.g., STEM subjects or educational actors/places), we quantify frames in terms of valence auras, emotional profiles, network overlap (Jaccard similarity), and concreteness relative to null baselines. Across student groups, science and research are consistently framed positively, while their core quantitative subjects (mathematics and statistics) exhibit more negative and anxiety related auras, amplified in higher math-anxiety subgroups, evidencing a STEM-science cognitive and emotional dissonance. High-anxiety frames are also less concrete than chance, suggesting more abstract and decontextualised representations of threatening quantitative domains. Human networks show greater overlapping between mathematics and anxiety than GPT-oss. The results highlight how BFMNs capture cognitive-affective signatures of mindsets towards the target domains and indicate that LLM-based digital twins approximate cultural attitudes but miss key context-sensitive, experience-based components relevant to replicate human educational anxiety.
19. 【2602.14744】Rethinking the Role of LLMs in Time Series Forecasting
链接:https://arxiv.org/abs/2602.14744
作者:Xin Qiu,Junlong Tong,Yirong Sun,Yunpu Ma,Wei Zhang,Xiaoyu Shen
类目:Computation and Language (cs.CL)
关键词:time series forecasting, incorporate contextual knowledge, numerical signals, introduced to time, time series
备注:
点击查看摘要
Abstract:Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90\% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at this https URL.
20. 【2602.14743】LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
链接:https://arxiv.org/abs/2602.14743
作者:Sönke Tenckhoff,Mario Koddenbrock,Erik Rodner
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:JavaScript Object Notation, valid JavaScript Object, evaluating Large Language, Object Notation, extracting structured data
备注:
点击查看摘要
Abstract:We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
ACMclasses:
E.2; H.1; H.3; H.4; I.2
Cite as:
arXiv:2602.14743 [cs.CL]
(or
arXiv:2602.14743v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.14743
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
21. 【2602.14689】Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
链接:https://arxiv.org/abs/2602.14689
作者:Lukas Struppek,Adam Gleave,Kellin Pelrine
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:language models continue, continue to advance, models, open-weight models, open-weight
备注: 54 pages, 7 figures, 35 tables
点击查看摘要
Abstract:As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.
22. 【2602.14675】Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
链接:https://arxiv.org/abs/2602.14675
作者:Gianluca Vico,Jindřich Libovický
类目:Computation and Language (cs.CL)
关键词:northwestern Italy, endangered Romance language, present a crowdsourced, endangered Romance, Italy
备注: 17 pages, 6 figures, at VarDial20226
点击查看摘要
Abstract:We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
23. 【2602.14655】Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
链接:https://arxiv.org/abs/2602.14655
作者:Xiao Wei,Bin Wen,Yuqin Lin,Kai Li,Mingyang gu,Xiaobao Wang,Longbiao Wang,Jianwu Dang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Alzheimer Disease, Early diagnosis, diagnosis of Alzheimer, delaying its progression, crucial for delaying
备注: 5 pages, 1 figures, accepted by ICASSP 2026 conference
点击查看摘要
Abstract:Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression. While AI-based speech detection is non-invasive and cost-effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL-AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion-based augmentation, which generates diverse pathological speech samples via cross-category voice-content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross-institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross-modal fusion model, which achieves fine-grained word-level alignment and acoustic-textual interaction. Evaluated on ADReSSo, FAL-AD achieves a state-of-the-art multi-modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at this https URL.
24. 【2602.14653】Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?
链接:https://arxiv.org/abs/2602.14653
作者:Matteo Gay,Coleman Haley,Mario Giulianelli,Edoardo Ponti
类目:Computation and Language (cs.CL)
关键词:Uniform Information Density, minimising surprisal variance, distribute information evenly, Information Density, Uniform Information
备注: Accepted as main paper at EACL 2026
点击查看摘要
Abstract:The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.
25. 【2602.14649】GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation
链接:https://arxiv.org/abs/2602.14649
作者:Hao Liu,Guangyan Li,Wensheng Zhang,Yongqiang Tang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, exhibit strong reasoning, strong reasoning abilities, high computational costs
备注: 19 pages
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit strong reasoning abilities, but their high computational costs limit their practical deployment. Recent studies reveal significant redundancy in LLMs layers, making layer pruning an active research topic. Layer pruning research primarily focuses on two aspects: measuring layer importance and recovering performance after pruning. Unfortunately, the present works fail to simultaneously maintain pruning performance and efficiency. In this study, we propose GradMAP, a faster layer pruning method with \textbf{Grad}ient \textbf{M}etric \textbf{A}nd \textbf{P}rojection compensation, which consists of two stages. In the first stage, we introduce a novel metric based on gradient magnitudes, enabling a global assessment of layer importance. Note that, it requires only a single backward propagation step per pruning decision, substantially enhancing pruning efficiency. In the second stage, we first analyze the layers with the largest mean shift resulting from pruning, and then incorporate a simple yet effective projection compensation matrix to correct this drift in one step. In this way, the degradation of model performance caused by layer pruning is effectively alleviated. Extensive experiments show that GradMAP outperforms previous layer pruning methods in both pruning speed (achieving an average $4\times$ speedup) and performance.
26. 【2602.14635】Alignment Adapter to Improve the Performance of Compressed Deep Learning Models
链接:https://arxiv.org/abs/2602.14635
作者:Rohit Raj Rai,Abhishek Dhaka,Amit Awekar
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Compressed Deep Learning, Deep Learning, Compressed Deep, resource-constrained environments, essential for deployment
备注:
点击查看摘要
Abstract:Compressed Deep Learning (DL) models are essential for deployment in resource-constrained environments. But their performance often lags behind their large-scale counterparts. To bridge this gap, we propose Alignment Adapter (AlAd): a lightweight, sliding-window-based adapter. It aligns the token-level embeddings of a compressed model with those of the original large model. AlAd preserves local contextual semantics, enables flexible alignment across differing dimensionalities or architectures, and is entirely agnostic to the underlying compression method. AlAd can be deployed in two ways: as a plug-and-play module over a frozen compressed model, or by jointly fine-tuning AlAd with the compressed model for further performance gains. Through experiments on BERT-family models across three token-level NLP tasks, we demonstrate that AlAd significantly boosts the performance of compressed models with only marginal overhead in size and latency.
27. 【2602.14594】he Wikidata Query Logs Dataset
链接:https://arxiv.org/abs/2602.14594
作者:Sebastian Walter,Hannah Bast
类目:Computation and Language (cs.CL)
关键词:Wikidata Query Logs, Wikidata knowledge graph, Query Logs, Wikidata Query Service, Wikidata Query
备注:
点击查看摘要
Abstract:We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.
28. 【2602.14589】MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs
链接:https://arxiv.org/abs/2602.14589
作者:Gabriel Roccabruna,Olha Khomyn,Giuseppe Riccardi
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:involve orchestrating perception, Temporal Execution Order, achieve complex goals, Execution Order, Temporal Execution
备注:
点击查看摘要
Abstract:AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.
29. 【2602.14564】Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
链接:https://arxiv.org/abs/2602.14564
作者:Shefayat E Shams Adib,Ahmed Alfey Sani,Ekramul Alam Esham,Ajwad Abrar,Tareque Mohmud Chowdhury
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, gained significant traction, Language Models, gained significant
备注: Accepted in 28th ICCIT, 2025
点击查看摘要
Abstract:Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.
30. 【2602.14536】Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets
链接:https://arxiv.org/abs/2602.14536
作者:Yuchen Yang,Wenze Lin,Enhao Huang,Zhixuan Chu,Hongbin Zhou,Lan Tao,Yiming Li,Zhan Qin,Kui Ren
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, remarkable advancements, diverse applications
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.
31. 【2602.14517】Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil
链接:https://arxiv.org/abs/2602.14517
作者:Sukumar Kishanthan,Kumar Thushalika,Buddhi Jayasekara,Asela Hevapathige
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Tamil remains unclear, Large language models, remains unclear, reliance on translation-based, translation-based processing
备注:
点击查看摘要
Abstract:Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.
32. 【2602.14492】Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
链接:https://arxiv.org/abs/2602.14492
作者:Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Ziyi Gao,Xiaotong Lin,Yun Liu,Xing Fu,Yu Cheng,Yongchao Liu,Weiqiang Wang,Zhongle Xie
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:learning requires balancing, requires balancing robust, balancing robust universality, representation learning requires, acute task-sensitivity
备注: 15 pages, 12 figures
点击查看摘要
Abstract:Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay's production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: this https URL.
33. 【2602.14490】Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts
链接:https://arxiv.org/abs/2602.14490
作者:Buze Zhang,Jinkai Tao,Zilang Zeng,Neil He,Ali Maatouk,Menglin Yang,Rex Ying
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
关键词:achieved remarkable progress, Large Language Models, downstream task adaptation, Large Language, remarkable progress
备注: 15 pages, 11 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable progress, with Parameter-Efficient Fine-Tuning (PEFT) emerging as a key technique for downstream task adaptation. However, existing PEFT methods mainly operate in Euclidean space, fundamentally limiting their capacity to capture complex geometric structures inherent in language data. While alternative geometric spaces, like hyperbolic geometries for hierarchical data and spherical manifolds for circular patterns, offer theoretical advantages, forcing representations into a single manifold type ultimately limits expressiveness, even when curvature parameters are learnable. To address this, we propose Mixture of Space (MoS), a unified framework that leverages multiple geometric spaces simultaneously to learn richer, curvature-aware representations. Building on this scheme, we develop MoSLoRA, which extends Low-Rank Adaptation (LoRA) with heterogeneous geometric experts, enabling models to dynamically select or combine appropriate geometric spaces based on input context. Furthermore, to address the computational overhead of frequent manifold switching, we develop a lightweight routing mechanism. Moreover, we provide empirical insights into how curvature optimization impacts training stability and model performance. Our experiments across diverse benchmarks demonstrate that MoSLoRA consistently outperforms strong baselines, achieving up to 5.6% improvement on MATH500 and 15.9% on MAWPS.
34. 【2602.14488】BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
链接:https://arxiv.org/abs/2602.14488
作者:Md. Najib Hasan,Mst. Jannatun Ferdous Rain,Fyad Mohammed,Nazmul Siddique
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:task-specific annotated datasets, languages remains limited, scarcity of high-quality, task-specific annotated, remains limited
备注:
点击查看摘要
Abstract:IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.
35. 【2602.14470】HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation
链接:https://arxiv.org/abs/2602.14470
作者:Wen-Sheng Lien,Yu-Kai Chan,Hao-Lung Hsiao,Bo-Kai Ruan,Meng-Fen Chiang,Chien-An Chen,Yi-Ren Yeh,Hong-Han Shuai
类目:Computation and Language (cs.CL)
关键词:Graph-based retrieval-augmented generation, Graph-based retrieval-augmented, binary relational facts, retrieval-augmented generation, typically built
备注: Accepted by The ACM Web Conference 2026 (WWW '26)
点击查看摘要
Abstract:Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.
36. 【2602.14469】Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation
链接:https://arxiv.org/abs/2602.14469
作者:Guangyue Peng,Zongchao Chen,Wen Luo,Yuntao Wen,Wei Li,Ruixiang Feng,Ran Le,Chen Yang,Zhenwei An,Yang Song,Tao Zhang,Houfeng Wang
类目:Computation and Language (cs.CL)
关键词:producing post-hoc rationalizations, query-answer pairs, post-hoc rationalizations, entire explanation, runs the risk
备注:
点击查看摘要
Abstract:Reverse Chain-of-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.
37. 【2602.14466】Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models
链接:https://arxiv.org/abs/2602.14466
作者:Lance Calvin Lim Gamboa,Yue Feng,Mark Lee
类目:Computation and Language (cs.CL)
关键词:important benchmark format, evaluating stereotypical associations, stereotypical associations exhibited, natural language generation, important benchmark
备注: Accepted in LREC 2026
点击查看摘要
Abstract:With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.
38. 【2602.14457】Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5
链接:https://arxiv.org/abs/2602.14457
作者:Dongrui Liu,Yi Yu,Jie Zhang,Guanxu Chen,Qihao Lin,Hanxi Zhu,Lige Huang,Yijin Zhou,Peng Wang,Shuai Shao,Boxuan Zhang,Zicheng Liu,Jingwei Sun,Yu Li,Yuejin Xie,Jiaxuan Guo,Jia Xu,Chaochao Lu,Bowen Zhou,Xia Hu,Jing Shao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:Risk Management Framework, Management Framework, Framework in Practice, advancing artificial intelligence, Large Language Models
备注: 49 pages, 17 figures, 12 tables
点击查看摘要
Abstract:To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\D, we focus on the ``mis-evolution'' of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
39. 【2602.14451】Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning
链接:https://arxiv.org/abs/2602.14451
作者:Qianyue Wang,Jinwu Hu,Huanxiang Lin,Bolin Chen,Zhiquan Wen,Yaofo Chen,Yu Rong,Mingkui Tan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, inflate computational costs, Language Models, inefficient long
备注:
点击查看摘要
Abstract:Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation, which inflate computational costs and even degrade performance. Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs'reasoning paradigm from exhaustive self-exploration to guided learning from precedents. PIR addresses two key challenges: what precedents to adopt and how to utilize them. First, Adaptive Precedent Selection (APS) constructs, for each question and LRM, a compact set of precedents that are both semantically related and informative for the model. It ranks examples by a joint score with semantic similarity and model perplexity, then adapts the amount of precedents to maximize perplexity reduction. Second, Test-time Experience Internalization (TEI) is treated as the test-time learning on precedent-informed instruction, updating lightweight adapters to internalize solution patterns and use them as a prior during subsequent reasoning. Experiments across mathematical reasoning, scientific QA, and code generation demonstrate that PIR consistently shortens reasoning traces while maintaining or improving final accuracy across LLMs, yielding outstanding accuracy-efficiency trade-offs.
40. 【2602.14445】Selective Synchronization Attention
链接:https://arxiv.org/abs/2602.14445
作者:Hasi Hays
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
关键词:modern deep learning, quadratic computational complexity, core self-attention mechanism, self-attention mechanism suffers, biological neural computation
备注:
点击查看摘要
Abstract:The Transformer architecture has become the foundation of modern deep learning, yet its core self-attention mechanism suffers from quadratic computational complexity and lacks grounding in biological neural computation. We propose Selective Synchronization Attention (SSA), a novel attention mechanism that replaces the standard dot-product self-attention with a closed-form operator derived from the steady-state solution of the Kuramoto model of coupled oscillators. In SSA, each token is represented as an oscillator characterized by a learnable natural frequency and phase; the synchronization strength between token pairs, determined by a frequency-dependent coupling and phase-locking condition, serves as the attention weight. This formulation provides three key advantages: (i) natural sparsity arising from the phase-locking threshold, whereby tokens with incompatible frequencies automatically receive zero attention weight without explicit masking; (ii) unified positional-semantic encoding through the natural frequency spectrum, eliminating the need for separate positional encodings; and (iii) a single-pass, closed-form computation that avoids iterative ODE integration, with all components (coupling, order parameter, synchronization) derived from the oscillatory framework. We instantiate SSA within the Oscillatory Synchronization Network (OSN), a drop-in replacement for the Transformer block. Analysis of the synchronization matrices reveals non-uniform, head-diverse coupling patterns even at initialization, demonstrating a stronger architectural inductive bias than the approximately uniform attention produced by randomly initialized Transformers.
41. 【2602.14433】Synthetic Reader Panels: Tournament-Based Ideation with LLM Personas for Autonomous Publishing
链接:https://arxiv.org/abs/2602.14433
作者:Fred Zimmerman
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:replaces human focus, LLM-instantiated reader personas, human focus groups, autonomous book ideation, structured tournament competitions
备注: 5 tables, 1 figure
点击查看摘要
Abstract:We present a system for autonomous book ideation that replaces human focus groups with synthetic reader panels -- diverse collections of LLM-instantiated reader personas that evaluate book concepts through structured tournament competitions. Each persona is defined by demographic attributes (age group, gender, income, education, reading level), behavioral patterns (books per year, genre preferences, discovery methods, price sensitivity), and consistency parameters. Panels are composed per imprint to reflect target demographics, with diversity constraints ensuring representation across age, reading level, and genre affinity. Book concepts compete in single-elimination, double-elimination, round-robin, or Swiss-system tournaments, judged against weighted criteria including market appeal, originality, and execution potential. To reject low-quality LLM evaluations, we implement five automated anti-slop checks (repetitive phrasing, generic framing, circular reasoning, score clustering, audience mismatch). We report results from deployment within a multi-imprint publishing operation managing 6 active imprints and 609 titles in distribution. Three case studies -- a 270-evaluator panel for a children's literacy novel, and two 5-person expert panels for a military memoir and a naval strategy monograph -- demonstrate that synthetic panels produce actionable demographic segmentation, identify structural content issues invisible to homogeneous reviewers, and enable tournament filtering that eliminates low-quality concepts while enriching high-quality survivors from 15% to 62% of the evaluated pool.
42. 【2602.14428】LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning
链接:https://arxiv.org/abs/2602.14428
作者:Wang Xing,Wei Song,Siyu Lin,Chen Wu,Man Wang
类目:Computation and Language (cs.CL)
关键词:time-evolving facts, costly to deploy, computationally heavy, heavy and costly, Temporal knowledge graphs
备注:
点击查看摘要
Abstract:Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to performance degradation. We propose an LLM-assisted distillation framework specifically designed for temporal knowledge graph reasoning. Beyond a conventional high-capacity temporal teacher, we incorporate a large language model as an auxiliary instructor to provide enriched supervision. The LLM supplies broad background knowledge and temporally informed signals, enabling a lightweight student to better model event dynamics without increasing inference-time complexity. Training is conducted by jointly optimizing supervised and distillation objectives, using a staged alignment strategy to progressively integrate guidance from both teachers. Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a compact and efficient student model. The results highlight the potential of large language models as effective teachers for transferring temporal reasoning capability to resource-efficient TKG systems.
43. 【2602.14419】WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)
链接:https://arxiv.org/abs/2602.14419
作者:Kiyotaka Kasubuchi,Kazuo Fukiya
类目:Computation and Language (cs.CL)
关键词:Large Language Models, paper reformulates Transformer, inevitable structural limitation, reformulates Transformer, Attention mechanisms
备注:
点击查看摘要
Abstract:This paper reformulates Transformer/Attention mechanisms in Large Language Models (LLMs) through measure theory and frequency analysis, theoretically demonstrating that hallucination is an inevitable structural limitation. The embedding space functions as a conditional expectation over a {\sigma}-algebra, and its failure to be isomorphic to the semantic truth set fundamentally causes logical consistency breakdown. WavePhaseNet Method The authors propose WavePhaseNet, which explicitly constructs a Semantic Conceptual Hierarchy Structure (SCHS) using Discrete Fourier Transform (DFT). By applying DFT along the sequence dimension, semantic information is decomposed into frequency bands: low-frequency components capture global meaning and intent, while high-frequency components represent local syntax and expression. This staged separation enables precise semantic manipulation in diagonalized space. Dimensionality Reduction GPT-4's 24,576-dimensional embedding space exhibits a 1/f spectral structure based on language self-similarity and Zipf's law. Through cumulative energy analysis, the authors derive that approximately 3,000 dimensions constitute the lower bound for "complete representation." This demonstrates that reduction from 24,576 to 3,000 dimensions preserves meaning and intent while enabling rigorous reasoning and suppressing hallucination. Cohomological Consistency Control The reduced embedding space, constructed via cohomological regularization over overlapping local windows, allows defining a graph structure and cochain complex. This quantifies inconsistencies among local inferences as coboundary-based losses. Applying harmonic projection based on Hodge theory positions cohomology as a computable regularization principle for controlling semantic consistency, extracting maximally consistent global representations.
44. 【2602.14406】ruthStance: An Annotated Dataset of Conversations on Truth Social
链接:https://arxiv.org/abs/2602.14406
作者:Fathima Ameen,Danielle Brown,Manusha Malgareddy,Amanul Haque
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:online discourse, central to understanding, understanding how opinions, opinions are formed, formed and contested
备注:
点击查看摘要
Abstract:Argument mining and stance detection are central to understanding how opinions are formed and contested in online discourse. However, most publicly available resources focus on mainstream platforms such as Twitter and Reddit, leaving conversational structure on alt-tech platforms comparatively under-studied. We introduce TruthStance, a large-scale dataset of Truth Social conversation threads spanning 2023-2025, consisting of 24,378 posts and 523,360 comments with reply-tree structure preserved. We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies. Using the best-performing configuration, we release additional LLM-generated labels for 24,352 posts (argument presence) and 107,873 comments (stance to parent), enabling analysis of stance and argumentation patterns across depth, topics, and users. All code and data are released publicly.
45. 【2602.14386】Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models
链接:https://arxiv.org/abs/2602.14386
作者:Mufan Xu,Kehai Chen,Xuefeng Bai,Zhengyu Niu,Muyun Yang,Tiejun Zhao,Min Zhang
类目:Computation and Language (cs.CL)
关键词:Existing policy-gradient methods, models typically select, typically select subsequent, Existing policy-gradient, select subsequent tokens
备注:
点击查看摘要
Abstract:Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.
46. 【2602.14374】Differentially Private Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2602.14374
作者:Tingting Tang,James Flemings,Yongqin Wang,Murali Annavaram
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, Retrieval-augmented generation, support accurate responses, language models, retrieving relevant documents
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) is a widely used framework for reducing hallucinations in large language models (LLMs) on domain-specific tasks by retrieving relevant documents from a database to support accurate responses. However, when the database contains sensitive corpora, such as medical records or legal documents, RAG poses serious privacy risks by potentially exposing private information through its outputs. Prior work has demonstrated that one can practically craft adversarial prompts that force an LLM to regurgitate the augmented contexts. A promising direction is to integrate differential privacy (DP), a privacy notion that offers strong formal guarantees, into RAG systems. However, naively applying DP mechanisms into existing systems often leads to significant utility degradation. Particularly for RAG systems, DP can reduce the usefulness of the augmented contexts leading to increase risk of hallucination from the LLMs. Motivated by these challenges, we present DP-KSA, a novel privacy-preserving RAG algorithm that integrates DP using the propose-test-release paradigm. DP-KSA follows from a key observation that most question-answering (QA) queries can be sufficiently answered with a few keywords. Hence, DP-KSA first obtains an ensemble of relevant contexts, each of which will be used to generate a response from an LLM. We utilize these responses to obtain the most frequent keywords in a differentially private manner. Lastly, the keywords are augmented into the prompt for the final output. This approach effectively compresses the semantic space while preserving both utility and privacy. We formally show that DP-KSA provides formal DP guarantees on the generated output with respect to the RAG database. We evaluate DP-KSA on two QA benchmarks using three instruction-tuned LLMs, and our empirical results demonstrate that DP-KSA achieves a strong privacy-utility tradeoff.
47. 【2602.14367】InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
链接:https://arxiv.org/abs/2602.14367
作者:Shuofei Qiao,Yunxiang Wei,Xuehai Wang,Bin Wu,Boyang Xue,Ningyu Zhang,Hossein A. Rahmani,Yanshan Wang,Qiang Zhang,Keyan Ding,Jeff Z. Pan,Huajun Chen,Emine Yilmaz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, evolution of Large, Models has catalyzed
备注: Ongoing Work
点击查看摘要
Abstract:The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.
48. 【2602.14299】Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
链接:https://arxiv.org/abs/2602.14299
作者:Ming Li,Xirui Li,Tianyi Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:populate networked environments, fundamental question arises, large language model, increasingly populate networked, language model agents
备注:
点击查看摘要
Abstract:As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.
49. 【2602.14285】FMMD: A multimodal open peer review dataset based on F1000Research
链接:https://arxiv.org/abs/2602.14285
作者:Zhenzhen Zhuang,Yuqing Fu,Jing Zhu,Zhangping Zhou,Jialiang Lin
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Automated scholarly paper, real-world manuscript evaluation, scholarly paper review, artificial intelligence, systems are increasingly
备注: Work in progress
点击查看摘要
Abstract:Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation. In parallel, research on automated and AI-assisted peer review has proliferated. Despite this momentum, empirical progress remains constrained by several critical limitations in existing datasets. While reviewers routinely evaluate figures, tables, and complex layouts to assess scientific claims, most existing datasets remain overwhelmingly text-centric. This bias is reinforced by a narrow focus on data from computer science venues. Furthermore, these datasets lack precise alignment between reviewer comments and specific manuscript versions, obscuring the iterative relationship between peer review and manuscript evolution. In response, we introduce FMMD, a multimodal and multidisciplinary open peer review dataset curated from F1000Research. The dataset bridges the current gap by integrating manuscript-level visual and structural data with version-specific reviewer reports and editorial decisions. By providing explicit alignment between reviewer comments and the exact article iteration under review, FMMD enables fine-grained analysis of the peer review lifecycle across diverse scientific domains. FMMD supports tasks such as multimodal issue detection and multimodal review comment generation. It provides a comprehensive empirical resource for the development of peer review research.
50. 【2602.14281】MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
链接:https://arxiv.org/abs/2602.14281
作者:Zhenhong Zhou,Yuanhe Zhang,Hongwei Cai,Moayad Aloqaily,Ouns Bouachir,Linsey Pang,Prakhar Mehrotra,Kun Wang,Qingsong Wen
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Model Context Protocol, Context Protocol, Model Context, MCP servers, third-party MCP servers
备注: 21 pages, 5 figures, 6 tables
点击查看摘要
Abstract:The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers. This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers. However, despite its excellent utility, existing agents typically offer limited validation for third-party MCP servers. As a result, agents remain vulnerable to MCP-based attacks that exploit the misalignment between agents and servers throughout the tool invocation lifecycle. In this paper, we propose MCPShield as a plug-in security cognition layer that mitigates this misalignment and ensures agent security when invoking MCP-based tools. Drawing inspiration from human experience-driven tool validation, MCPShield assists agent forms security cognition with metadata-guided probing before invocation. Our method constrains execution within controlled boundaries while cognizing runtime events, and subsequently updates security cognition by reasoning over historical traces after invocation, building on human post-use reflection on tool behavior. Experiments demonstrate that MCPShield exhibits strong generalization in defending against six novel MCP-based attack scenarios across six widely used agentic LLMs, while avoiding false positives on benign servers and incurring low deployment overhead. Overall, our work provides a practical and robust security safeguard for MCP-based tool invocation in open agent ecosystems.
51. 【2602.14279】Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions
链接:https://arxiv.org/abs/2602.14279
作者:Ruomeng Ding,Tianwei Gao,Thomas P. Zollo,Eitan Bachmat,Richard Zemel,Zhun Deng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:latent group-level properties, collective assessments requires, assessments requires allocating, requires allocating limited, allocating limited questioning
备注:
点击查看摘要
Abstract:Eliciting information to reduce uncertainty about latent group-level properties from surveys and other collective assessments requires allocating limited questioning effort under real costs and missing data. Although large language models enable adaptive, multi-turn interactions in natural language, most existing elicitation methods optimize what to ask with a fixed respondent pool, and do not adapt respondent selection or leverage population structure when responses are partial or incomplete. To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets. We propose a theoretically grounded framework that combines (i) an LLM-based expected information gain objective for scoring candidate questions with (ii) heterogeneous graph neural network propagation that aggregates observed responses and participant attributes to impute missing responses and guide per-round respondent selection. This closed-loop procedure queries a small, informative subset of individuals while inferring population-level responses via structured similarity. Across three real-world opinion datasets, our method consistently improves population-level response prediction under constrained budgets, including a 12% relative gain on CES at a 10% respondent budget.
52. 【2602.14265】STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts
链接:https://arxiv.org/abs/2602.14265
作者:Zachary Bamberger,Till R. Saenger,Gilad Morad,Ofra Amir,Brandon M. Stewart,Amir Feder
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:achieve meaningful output, existing ITC methods, ITC methods offer, interpretable ITC method, fails to achieve
备注: v1, 18 pages main, 55 pages total, 9 tables, 12 figures
点击查看摘要
Abstract:Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at this https URL.
53. 【2602.14259】Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures
链接:https://arxiv.org/abs/2602.14259
作者:Matic Korun
类目:Computation and Language (cs.CL)
关键词:large language model, token embedding cluster, distinct hallucination types, taxonomy of large, large language
备注: 9 pages, 5 figures
点击查看摘要
Abstract:We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: {\alpha} (polarity coupling), \b{eta} (cluster cohesion), and {\lambda}_s (radial information gradient). Across all 11 models, polarity structure ({\alpha} 0.5) is universal (11/11), cluster cohesion (\b{eta} 0) is universal (11/11), and the radial information gradient is significant (9/11, p 0.05). We demonstrate that the two models failing {\lambda}_s significance -- ALBERT and MiniLM -- do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.
54. 【2602.14257】AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
链接:https://arxiv.org/abs/2602.14257
作者:Lingxiang Hu,Yiding Sun,Tianle Xia,Wenwei Li,Ming Xu,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Model, Large Language, achieved remarkable progress, Language Model, critical problem
备注: 15 pages, 11 figures
点击查看摘要
Abstract:While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at this https URL.
55. 【2602.14238】We can still parse using syntactic rules
链接:https://arxiv.org/abs/2602.14238
作者:Ghaly Hussein
类目:Computation and Language (cs.CL)
关键词:phrase structure grammar, generalized phrase structure, context free grammar, free grammar, structure grammar
备注:
点击查看摘要
Abstract:This research introduces a new parsing approach, based on earlier syntactic work on context free grammar (CFG) and generalized phrase structure grammar (GPSG). The approach comprises both a new parsing algorithm and a set of syntactic rules and features that overcome the limitations of CFG. It also generates both dependency and constituency parse trees, while accommodating noise and incomplete parses. The system was tested on data from Universal Dependencies, showing a promising average Unlabeled Attachment Score (UAS) of 54.5% in the development dataset (7 corpora) and 53.8% in the test set (12 corpora). The system also provides multiple parse hypotheses, allowing further reranking to improve parsing accuracy. This approach also leverages much of the theoretical syntactic work since the 1950s to be used within a computational context. The application of this approach provides a transparent and interpretable NLP model to process language input.
56. 【2602.14234】REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
链接:https://arxiv.org/abs/2602.14234
作者:Zheng Chu,Xiao Wang,Jack Hong,Huiming Fan,Yuqi Huang,Yue Yang,Guohai Xu,Chenxiao Zhao,Cheng Xiang,Shengchao Hu,Dongdong Kuang,Ming Liu,Bing Qin,Xing Yu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:realworld problem solvers, Large language models, Large language, tasks remains challenging, generalpurpose knowledge engines
备注: [this https URL](https://redsearchagent.github.io/index/)
点击查看摘要
Abstract:Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.
57. 【2602.14224】he Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents
链接:https://arxiv.org/abs/2602.14224
作者:Ziyang Ma,Ruiyang Xu,Yinghao Ma,Chao-Han Huck Yang,Bohan Li,Jaeyeon Kim,Jin Xu,Jinyu Li,Carlos Busso,Kai Yu,Eng Siong Chng,Xie Chen
类目:ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:Recent Large Audio, Large Audio Language, Recent Large, Audio Language Models, lack transparent reasoning
备注: The official website of the Audio Reasoning Challenge: [this https URL](https://audio-reasoning-challenge.github.io)
点击查看摘要
Abstract:Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
58. 【2602.14216】Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports
链接:https://arxiv.org/abs/2602.14216
作者:Dragan Stoll,Brian E. Perron,Zia Qi,Selina Steinmann,Nicole F. Eicher,Andreas Jud
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:demonstrated significant advances, Reasoning language models, language models, Purpose, demonstrated significant
备注:
点击查看摘要
Abstract:Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs' reasoning can effectively assess complex case factors such as parental cooperation. Lower accuracy in assessing fathers' cooperation supports the argument of a stronger professional focus on mothers in CPS interventions.
59. 【2602.14209】MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
链接:https://arxiv.org/abs/2602.14209
作者:Omin Kwon,Yeonjae Kim,Doyeon Kim,Minseo Kim,Yeonhong Park,Jae W. Lee
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:caching makes memory, makes memory access, Block diffusion LLMs, Block diffusion, language generation
备注:
点击查看摘要
Abstract:Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
60. 【2602.14189】Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
链接:https://arxiv.org/abs/2602.14189
作者:Samir Abdaljalil,Erchin Serpedin,Hasan Kurban
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, existing evaluations typically, evaluations typically assume, verify scientific claims
备注:
点击查看摘要
Abstract:Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at this https URL .
61. 【2602.14188】GPT-5 vs Other LLMs in Long Short-Context Performance
链接:https://arxiv.org/abs/2602.14188
作者:Nima Esmi(1 and 2),Maryam Nezhad-Moghaddam(3),Fatemeh Borhani(3),Asadollah Shahbahrami(2 and 3),Amin Daemdoost(3),Georgi Gaydadjiev(4) ((1) Bernoulli Institute, RUG, Groningen, Netherlands, (2) ISRC, Khazar University, Baku, Azerbaijan, (3) Department of Computer Engineering, University of Guilan, Rasht, Iran, (4) QCE Department, TU Delft, Delft, Netherlands)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Large Language Models, Large Language, window in Large, Language Models, single pass
备注: 10 pages, 7 figures. Accepted for publication in the 3rd International Conference on Foundation and Large Language Models (FLLM2025). IEEE. The final version will be available in IEEE Xplore
点击查看摘要
Abstract:With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the "lost in the middle" problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.
62. 【2602.14172】Investigation for Relative Voice Impression Estimation
链接:https://arxiv.org/abs/2602.14172
作者:Keinichi Fujita,Yusuke Ijima
类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:strongly influence listener, Paralinguistic and non-linguistic, influence listener impressions, speech strongly influence, non-linguistic aspects
备注: 5 pages,3 figures, Accepted to Speech Prosody 2026
点击查看摘要
Abstract:Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., ``Cold--Warm'') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.
63. 【2602.14169】Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling
链接:https://arxiv.org/abs/2602.14169
作者:Yiran Guo,Zhongjian Qiao,Yingqi Xie,Jie Liu,Dan Ye,Ruiqing Zhang,Shuang Qiu,Lijie Xu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, language sequence space, vast natural language, natural language sequence, Effective exploration
备注:
点击查看摘要
Abstract:Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.
64. 【2602.14162】Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
链接:https://arxiv.org/abs/2602.14162
作者:Tao Xu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Existing multimodal document, generate comprehensive descriptions, answering methods universally, methods universally adopt, Existing multimodal
备注: 24 pages, 9 figures, 9 tables
点击查看摘要
Abstract:Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
65. 【2602.14158】A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
链接:https://arxiv.org/abs/2602.14158
作者:Naeimeh Nourmohammadi,Md Meem Hossain, TheAnh Han,Safina Showkat Ara,Zia Ush Shamszaman
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:Large language models, healthcare question answering, unreliable confidence signalling, insufficient evidence grounding, Large language
备注: 27 pages, 14 figures, 5 tables
点击查看摘要
Abstract:Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.
66. 【2602.14143】ROAST: Rollout-based On-distribution Activation Steering Technique
链接:https://arxiv.org/abs/2602.14143
作者:Xuanbo Su,Hao Luo,Yingfang Zhang,Lijun Zhang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Continuous Soft Scaling, large language models, Rollout-based On-distribution Activation, Activation Steering Technique, inference time
备注:
点击查看摘要
Abstract:Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.
67. 【2602.14130】Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity
链接:https://arxiv.org/abs/2602.14130
作者:Kazuo Yano,Jonghyeok Lee,Tae Ishitomi,Hironobu Kawaguchi,Akira Koyama,Masakuni Ota,Yuki Ota,Nobuo Sato,Keita Shimada,Sho Takematsu,Ayaka Tobinai,Satomi Tsuji,Kazunori Yanagi,Keiko Yano,Manabu Harada,Yuki Matsuda,Kazunori Matsumoto,Kenichi Matsumura,Hamae Matsuo,Yumi Miyazaki,Kotaro Murai,Tatsuya Ohshita,Marie Seki,Shun Tanoue,Tatsuki Terakado,Yuko Ichimaru,Mirei Saito,Akihiro Otsuka,Koji Ara
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:outputs remains limited, Large language models, achieved remarkable success, Large language, produce genuinely creative
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable success in generating fluent and contextually appropriate text; however, their capacity to produce genuinely creative outputs remains limited. This paper posits that this limitation arises from a structural property of contemporary LLMs: when provided with rich context, the space of future generations becomes strongly constrained, and the generation process is effectively governed by near-deterministic dynamics. Recent approaches such as test-time scaling and context adaptation improve performance but do not fundamentally alter this constraint. To address this issue, we propose Algebraic Quantum Intelligence (AQI) as a computational framework that enables systematic expansion of semantic space. AQI is formulated as a noncommutative algebraic structure inspired by quantum theory, allowing properties such as order dependence, interference, and uncertainty to be implemented in a controlled and designable manner. Semantic states are represented as vectors in a Hilbert space, and their evolution is governed by C-values computed from noncommutative operators, thereby ensuring the coexistence and expansion of multiple future semantic possibilities. In this study, we implement AQI by extending a transformer-based LLM with more than 600 specialized operators. We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol. The results show that AQI consistently outperforms strong baseline models, yielding statistically significant improvements and reduced cross-domain variance. These findings demonstrate that noncommutative algebraic dynamics can serve as a practical and reproducible foundation for machine creativity. Notably, this architecture has already been deployed in real-world enterprise environments.
68. 【2602.14100】Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans
链接:https://arxiv.org/abs/2602.14100
作者:Akhilesh Kakolu Ramarao,Kevin Tang,Dinah Baer-Henney
类目:Computation and Language (cs.CL)
关键词:morphological learning remains, open question, neural networks, networks can serve, serve as cognitive
备注:
点击查看摘要
Abstract:Whether neural networks can serve as cognitive models of morphological learning remains an open question. Recent work has shown that encoder-decoder models can acquire irregular patterns, but evidence that they generalize these patterns like humans is mixed. We investigate this using the Spanish \emph{L-shaped morphome}, where only the first-person singular indicative (e.g., \textit{pongo} `I put') shares its stem with all subjunctive forms (e.g., \textit{ponga, pongas}) despite lacking apparent phonological, semantic, or syntactic motivation. We compare five encoder-decoder transformers varying along two dimensions: sequential vs. position-invariant positional encoding, and atomic vs. decomposed tag representations. Positional encoding proves decisive: position-invariant models recover the correct L-shaped paradigm clustering even when L-shaped verbs are scarce in training, whereas sequential positional encoding models only partially capture the pattern. Yet none of the models productively generalize this pattern to novel forms. Position-invariant models generalize the L-shaped stem across subjunctive cells but fail to extend it to the first-person singular indicative, producing a mood-based generalization rather than the L-shaped morphomic pattern. Humans do the opposite, generalizing preferentially to the first-person singular indicative over subjunctive forms. None of the models reproduce the human pattern, highlighting the gap between statistical pattern reproduction and morphological abstraction.
69. 【2602.14081】CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry
链接:https://arxiv.org/abs/2602.14081
作者:Shangqing Zhao,Yupei Ren,Yuhao Zhou,Xiaopeng Bai,Man Lan
类目:Computation and Language (cs.CL)
关键词:classical Chinese, large language models, rhythmic harmony, poses a significant, demanding a sophisticated
备注: ARR 2025 May and Icassp 2026 submission. Working in progress
点击查看摘要
Abstract:The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.
70. 【2602.14080】Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
链接:https://arxiv.org/abs/2602.14080
作者:Nitay Calderon,Eyal Ben-David,Zorik Gekhman,Eran Ofek,Gal Yona
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Standard factuality evaluations, Standard factuality, empty shelves, lost keys, factuality evaluations
备注:
点击查看摘要
Abstract:Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.
71. 【2602.14077】GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler
链接:https://arxiv.org/abs/2602.14077
作者:Minghan Wang,Ye Bai,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:models typically introduces, fixed Gaussian noise, typically introduces stochasticity, typically introduces, dropout or fixed
备注:
点击查看摘要
Abstract:Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.
72. 【2602.14073】Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
链接:https://arxiv.org/abs/2602.14073
作者:Grzegorz Statkiewicz,Alicja Dobrzeniecka,Karolina Seweryn,Aleksandra Krasnodębska,Karolina Piosek,Katarzyna Bogusz,Sebastian Cygert,Wojciech Kusa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:trained on English-centric, limiting their performance, English-centric data, cultural contexts, English-centric
备注:
点击查看摘要
Abstract:Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.
73. 【2602.14069】Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
链接:https://arxiv.org/abs/2602.14069
作者:Ruipeng Jia,Yunyi Yang,Yuxin Wu,Yongbo Gai,Siyuan Tao,Mengyu Zhou,Jianhe Lin,Xiaoxi Jiang,Guanjun Jiang
类目:Computation and Language (cs.CL)
关键词:Scalar reward models, single opaque score, models compress multi-dimensional, compress multi-dimensional human, reward models compress
备注:
点击查看摘要
Abstract:Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.
74. 【2602.14062】From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset
链接:https://arxiv.org/abs/2602.14062
作者:Jandad Jahani,Mursal Dawodi,Jawid Ahmad Baktash
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:automatic speech recognition, openly licensed speech, building automatic speech, languages remain underrepresented, widely spoken languages
备注:
点击查看摘要
Abstract:Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.
Subjects:
Computation and Language (cs.CL); Sound (cs.SD)
Cite as:
arXiv:2602.14062 [cs.CL]
(or
arXiv:2602.14062v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.14062
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Jawid Ahmad Baktash [view email] [v1]
Sun, 15 Feb 2026 09:22:48 UTC (175 KB)
75. 【2602.14060】LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
链接:https://arxiv.org/abs/2602.14060
作者:Yang Liu,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li,Lingyong Yan
类目:Computation and Language (cs.CL)
关键词:incorporates data clustering, innovative definition modeling, definition modeling approach, semantic expert learning, approach that incorporates
备注: EACL 2026 (Oral), 22 pages, 12 figures, 12 tables
点击查看摘要
Abstract:We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.
76. 【2602.14054】LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation
链接:https://arxiv.org/abs/2602.14054
作者:Jizheng Chen,Weiming Zhang,Xinyi Dai,Weiwen Liu,Kounianhua Du,Yasheng Wang,Ruiming Tang,Yong Yu,Weinan Zhang
类目:Computation and Language (cs.CL)
关键词:Test Time Scaling, Code generation remains, Existing Test Time, Code generation, remains a challenging
备注:
点击查看摘要
Abstract:Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full complexity of problems; and (2) overthinking, where overly verbose reasoning leads to inefficiency and increased computational costs. To address these issues, we propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation. LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank Based Path Selection and Thoughts Aggregation. This results in coherent and effective reasoning chains that balance depth and efficiency. Extensive experiments demonstrate that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.
77. 【2602.14044】Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
链接:https://arxiv.org/abs/2602.14044
作者:Pietro Bernardelle,Stefano Civelli,Kevin Roitero,Gianluca Demartini
类目:Computation and Language (cs.CL)
关键词:Large language models, show strong reasoning, strong reasoning abilities, contexts remains inconsistent, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.
78. 【2602.14039】Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models
链接:https://arxiv.org/abs/2602.14039
作者:Sajjad Kachuee,Mohammad Sharifkhani
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:weighted linear summation, models combine expert, combine expert outputs, implicitly assuming, embedding models combine
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) embedding models combine expert outputs using weighted linear summation, implicitly assuming a linear subspace structure in the embedding space. This assumption is shown to be inconsistent with the geometry of expert representations. Geometric analysis of a modern MoE embedding model reveals that expert outputs lie on a shared hyperspherical manifold characterized by tightly concentrated norms and substantial angular separation. Under this geometry, linear aggregation induces inward collapse toward the manifold interior, distorting vector magnitude and direction and reducing embedding comparability. To address this inconsistency, Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms. Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and full stability. Additional geometric analyses confirm that SBA prevents aggregation-induced collapse and preserves hyperspherical consistency, highlighting the importance of geometry-aware aggregation in MoE embedding architectures.
79. 【2602.14028】GRRM: Group Relative Reward Modeling for Machine Translation
链接:https://arxiv.org/abs/2602.14028
作者:Sen Yang,Shanbo Cheng,Lu Xu,Jianbing Zhang,Shujian Huang
类目:Computation and Language (cs.CL)
关键词:Relative Policy Optimization, Machine Translation hinges, LLM post-training, domains like Machine, Scalar Quality Metrics
备注: 19 pages, 6 figures
点击查看摘要
Abstract:While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at this https URL.
80. 【2602.14009】Named Entity Recognition for Payment Data Using NLP
链接:https://arxiv.org/abs/2602.14009
作者:Srikumar Nayak
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Named Entity Recognition, extracting structured information, Conditional Random Fields, Named Entity, Entity Recognition
备注: 14 pages, 8 figures, research paper
点击查看摘要
Abstract:Named Entity Recognition (NER) has emerged as a critical component in automating financial transaction processing, particularly in extracting structured information from unstructured payment data. This paper presents a comprehensive analysis of state-of-the-art NER algorithms specifically designed for payment data extraction, including Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM-CRF), and transformer-based models such as BERT and FinBERT. We conduct extensive experiments on a dataset of 50,000 annotated payment transactions across multiple payment formats including SWIFT MT103, ISO 20022, and domestic payment systems. Our experimental results demonstrate that fine-tuned BERT models achieve an F1-score of 94.2% for entity extraction, outperforming traditional CRF-based approaches by 12.8 percentage points. Furthermore, we introduce PaymentBERT, a novel hybrid architecture combining domain-specific financial embeddings with contextual representations, achieving state-of-the-art performance with 95.7% F1-score while maintaining real-time processing capabilities. We provide detailed analysis of cross-format generalization, ablation studies, and deployment considerations. This research provides practical insights for financial institutions implementing automated sanctions screening, anti-money laundering (AML) compliance, and payment processing systems.
81. 【2602.14002】he Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
链接:https://arxiv.org/abs/2602.14002
作者:Ali Zahedzadeh,Behnam Bahrak
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, step question answering, Models increasingly rely, multi step question, Language Models increasingly
备注: LREC 2026 submission; focuses on LLM self-explanation, interpretability, and information bottleneck analysis
点击查看摘要
Abstract:Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct this http URL operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.
82. 【2602.13979】Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis
链接:https://arxiv.org/abs/2602.13979
作者:Tongze Zhang,Jun-En Ding,Melik Ozolcer,Fang-Ming Hung,Albert Chih-Chieh Yang,Feng Liu,Yi-Rou Ji,Sang Won Bae
类目:Computation and Language (cs.CL)
关键词:neurodegenerative disease worldwide, prevalent neurodegenerative disease, Alzheimer disease, Alzheimer disease assessment, prevalent neurodegenerative
备注:
点击查看摘要
Abstract:Alzheimer's disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have been increasingly applied to the medical field using electronic health records (EHRs), yet their application in Alzheimer's disease assessment remains limited, particularly given that AD involves complex multifactorial etiologies that are difficult to observe directly through imaging modalities. In this work, we propose leveraging LLMs to perform Chain-of-Thought (CoT) reasoning on patients' clinical EHRs. Unlike direct fine-tuning of LLMs on EHR data for AD classification, our approach utilizes LLM-generated CoT reasoning paths to provide the model with explicit diagnostic rationale for AD assessment, followed by structured CoT-based predictions. This pipeline not only enhances the model's ability to diagnose intrinsically complex factors but also improves the interpretability of the prediction process across different stages of AD progression. Experimental results demonstrate that the proposed CoT-based diagnostic framework significantly enhances stability and diagnostic performance across multiple CDR grading tasks, achieving up to a 15% improvement in F1 score compared to the zero-shot baseline method.
83. 【2602.13967】Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLMs
链接:https://arxiv.org/abs/2602.13967
作者:Ruicheng Zhang,Xinyi Li,Tianyi Xu,Shuhao Zhang,Xiaofei Liao,Hai Jin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Memory Module assume, External Memory Module, External Memory Modules, External Memory, static setting
备注: 22 pages, 8 figures, 15 tables. Preprint
点击查看摘要
Abstract:Most evaluations of External Memory Module assume a static setting: memory is built offline and queried at a fixed state. In practice, memory is streaming: new facts arrive continuously, insertions interleave with retrievals, and the memory state evolves while the model is serving queries. In this regime, accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation. We present Neuromem, a scalable testbed that benchmarks External Memory Modules under an interleaved insertion-and-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy, consolidation policy, query formulation strategy, and context integration mechanism. Using three representative datasets LOCOMO, LONGMEMEVAL, and MEMORYAGENTBENCH, Neuromem evaluates interchangeable variants within a shared serving stack, reporting token-level F1 and insertion/retrieval latency. Overall, we observe that performance typically degrades as memory grows across rounds, and time-related queries remain the most challenging category. The memory data structure largely determines the attainable quality frontier, while aggressive compression and generative integration mechanisms mostly shift cost between insertion and retrieval with limited accuracy gain.
84. 【2602.13964】HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
链接:https://arxiv.org/abs/2602.13964
作者:Weiqi Zhai,Zhihai Wang,Jinghang Wang,Boyu Yang,Xiaogang Li,Xiang Xu,Bohan Wang,Peng Wang,Xingzhe Wu,Anfeng Li,Qiyuan Feng,Yuhao Zhou,Shoulin Han,Wenjie Luo,Yiyuan Li,Yaxuan Wang,Ruixian Luo,Guojie Lin,Peiyao Xiao,Chengliang Xu,Ben Wang,Zeyu Wang,Zichao Chen,Jianan Ye,Yijie Hu,Jialong Chen,Zongwen Shen,Yuliang Xu,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Hu Wei,Que Shen,Bing Zhao
类目:Computation and Language (cs.CL)
关键词:Humanity Last Exam, evaluating frontier large, frontier large language, multi-domain questions, evaluating frontier
备注: 14 pages, 10 figures
点击查看摘要
Abstract:Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: this https URL
85. 【2602.13961】MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
链接:https://arxiv.org/abs/2602.13961
作者:Shuoyuan Wang,Yiran Wang,Hongxin Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation and Language (cs.CL)
关键词:Mars exploration, Data-driven approaches, advancing planetary science, rapidly advancing planetary, approaches like deep
备注:
点击查看摘要
Abstract:Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at this https URL
86. 【2602.13934】Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning
链接:https://arxiv.org/abs/2602.13934
作者:Zhimin Zhao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:reinforcement learning, generation has progressed, progressed more reliably, Code generation, reinforcement learning problems
备注:
点击查看摘要
Abstract:Code generation has progressed more reliably than reinforcement learning, largely because code has an information structure that makes it learnable. Code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.
87. 【2602.13912】From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
链接:https://arxiv.org/abs/2602.13912
作者:Sha Li,Stefano Petrangeli,Yu Shen,Xiang Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
关键词:large language models, equips large language, reinforcement learning framework, content-aware graphic layout, design decision making
备注:
点击查看摘要
Abstract:We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs' limited spatial reasoning and the lack of opacity in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized SOTA layout generators while requiring fewer annotated samples and reduced latency.
88. 【2602.13905】Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin
链接:https://arxiv.org/abs/2602.13905
作者:Thibault Clérice,Rachel Bawden,Anthony Glaise,Ariane Pinche,David Smith
类目:Computation and Language (cs.CL)
关键词:Automatic Text Recognition, Text Recognition, Automatic Text, methodological divide persists, Recent advances
备注:
点击查看摘要
Abstract:Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability. We present a new dataset derived from the CoMMA corpus and aligned with digitized Old French and Latin editions using passim. We also produce a manually corrected gold-standard evaluation set. We benchmark this resource using ByT5-based sequence-to-sequence models on normalization and pre-annotation tasks. Our contributions include the formal definition of PEN, a 4.66M-sample silver training corpus, a 1.8k-sample gold evaluation set, and a normalization model achieving a 6.7% CER, substantially outperforming previous models for this task.
89. 【2602.13890】Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach
链接:https://arxiv.org/abs/2602.13890
作者:Amir Hossein Mohammadi,Ali Moeinian,Zahra Razavizade,Afsaneh Fatemi,Reza Ramezani
类目:Computation and Language (cs.CL)
关键词:Retrieval Augmented Generation, Retrieval Augmented, Augmented Generation, integrating external knowledge, Small Language Models
备注: 32 Pages, Submitted to Journal of Computing and Security
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.
90. 【2602.13870】ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics
链接:https://arxiv.org/abs/2602.13870
作者:Hend Al-Khalifa,Nadia Ghezaiel,Maria Bounnit,Hend Hamed Alhazmi,Noof Abdullah Alfear,Reem Fahad Alqifari,Ameera Masoud Almasoud,Sharefah Ahmed Al-Ghamdi
类目:Computation and Language (cs.CL)
关键词:capture sociopragmatic phenomena, culturally-aware natural language, natural language processing, language processing systems, Modern Standard Arabic
备注: Paper accepted @ The Fifteenth biennial Language Resources and Evaluation Conference (LREC2026)
点击查看摘要
Abstract:The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.
91. 【2602.13867】Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
链接:https://arxiv.org/abs/2602.13867
作者:Somnath Banerjee,Rima Hazra,Animesh Mukherjee
类目:Computation and Language (cs.CL)
关键词:Large language models, culturally specific norms, involves low-resource languages, Large language, Global South
备注: Accepted to the EGSAI Workshop at AAAI 2026
点击查看摘要
Abstract:Large language models (LLMs) are being deployed across the Global South, where everyday use involves low-resource languages, code-mixing, and culturally specific norms. Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages. Evidence increasingly shows they do not. We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii) English-only knowledge edits and safety patches often fail to carry over to low-resource languages. In response, we outline a practical agenda for researchers and students in the Global South: parameter-efficient safety steering, culturally grounded evaluation and preference data, and participatory workflows that empower local communities to define and mitigate harm. Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.
92. 【2602.13860】utoring Large Language Models to be Domain-adaptive, Precise, and Safe
链接:https://arxiv.org/abs/2602.13860
作者:Somnath Banerjee
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Responsible Intelligence framework, immense generative power, Intelligence framework designed, Large Language
备注: Accepted to the PhD Symposium at Web Conference 2026
点击查看摘要
Abstract:The overarching research direction of this work is the development of a ''Responsible Intelligence'' framework designed to reconcile the immense generative power of Large Language Models (LLMs) with the stringent requirements of real-world deployment. As these models become a transformative force in artificial intelligence, there is an urgent need to move beyond general-purpose architectures toward systems that are contextually aware, inherently safer, and deeply respectful of global cultural nuances. This research navigates three interconnected threads: domain adaptation to ensure technical precision, ethical rigor to mitigate adversarial vulnerabilities, and cultural/multilingual alignment to promote global inclusivity. The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
93. 【2602.13840】PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
链接:https://arxiv.org/abs/2602.13840
作者:Yuhan Cheng,Hancheng Ye,Hai Helen Li,Jingwei Sun,Yiran Chen
类目:Computation and Language (cs.CL)
关键词:Large language model, tasks involving sensitive, personalized tasks involving, Large language, agents' action due
备注:
点击查看摘要
Abstract:Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference-time interventions which are brittle, scenario-specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system-wide contextual integrity while achieving a more favorable privacy-helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot generalization and robustness across diverse multi-agent topologies. Code is available at this https URL.
94. 【2602.13836】Speculative Decoding with a Speculative Vocabulary
链接:https://arxiv.org/abs/2602.13836
作者:Miles Williams,Young D. Kwon,Rui Li,Alexandros Kouris,Stylianos I. Venieris
类目:Computation and Language (cs.CL)
关键词:offers substantial speedups, accelerating language model, yielding identical outputs, Speculative decoding, rapidly emerged
备注: Under review
点击查看摘要
Abstract:Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
95. 【2602.13832】Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind
链接:https://arxiv.org/abs/2602.13832
作者:Minyuan Ruan,Ziyue Wang,Kaiming Liu,Yunghwei Lai,Peng Li,Yang Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, assist human users, Language Models, developed rapidly
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.
96. 【2602.13816】he acquisition of English irregular inflections by Yemeni L1 Arabic learners: A Universal Grammar approach
链接:https://arxiv.org/abs/2602.13816
作者:Muneef Y. Alsawsh,Mohammed Q. Shormani
类目:Computation and Language (cs.CL)
关键词:Universal Grammar, utilizing a Universal, English irregular inflections, English, English irregular
备注: 19 pages, 3 Tables
点击查看摘要
Abstract:This study examines the acquisition of English irregular inflections by Yemeni learners of English as a second language (L2), utilizing a Universal Grammar (UG) approach. Within the UG approach, the study considers Feature Reassembly Hypothesis (FRH) (Lardiere, 2008, 2009) part of UG, focusing on the roles of first language (L1) transfer and L2 developmental influence. It analyzes learner errors across two developmental stages. Stage 1 data reveal a dominant influence of L1 transfer, particularly in phonological and structural mismatches, while stage 2 data demonstrate increased learner sensitivity to UG properties and morphological reconfiguration toward the target language. Findings reveal that errors in irregular inflectional morphology are attributed to both interlingual and intralingual sources, with overgeneralization of L2 rules as a common developmental strategy. Statistical analysis, including a one-way ANOVA, indicates significant improvement in the production of well-formed irregular inflections from stage 1 to stage 2, underscoring learners' continued access to UG. However, persistent difficulties with consonant change, zero-morpheme, and -a plural inflections suggest that limited exposure, ineffective input modeling, and insufficient instructional quality constrain full UG access. The study concludes that while L1 transfer and L2 developmental factors influence initial stages of acquisition, appropriate linguistic input and instruction are critical for facilitating UG-driven feature reassembly in adult L2 learners.
97. 【2602.13793】OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum
链接:https://arxiv.org/abs/2602.13793
作者:Yangyang Zhang,Zilong Wang,Jianbo Xu,Yongqi Chen,Chu Han,Zhihao Zhang,Shuai Liu,Hui Li,Huiping Zhang,Ziqi Liu,Jiaxin Chen,Jun Zhu,Zheng Feng,Hao Wen,Xingzhu Ju,Yanping Zhong,Yunqiu Zhang,Jie Duan,Jun Li,Dongsheng Li,Weijie Wang,Haiyan Zhu,Wei Jiang,Xiaohua Wu,Shuo Wang,Haiming Li,Qinhao Guo
类目:Computation and Language (cs.CL)
关键词:address treatment complexity, Ovarian tumour management, Ovarian tumour Multidisciplinary, Ovarian tumour, multidisciplinary tumour board
备注: 27 pages, 5 figures, 1 table
点击查看摘要
Abstract:Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource-constrained centres where MDT resources are scarce or unavailable. Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style recommendations with transparent rationales. To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum. In multicentre re-evaluation, OMGs achieved performance comparable to expert MDT consensus ($4.45 \pm 0.30$ versus $4.53 \pm 0.23$), with higher Evidence scores (4.57 versus 3.92). In prospective multicentre evaluation (59 patients), OMGs demonstrated high concordance with routine MDT decisions. Critically, in paired human-AI studies, OMGs most substantially enhanced clinicians' recommendations in Evidence and Robustness, the dimensions most compromised when multidisciplinary expertise is unavailable. These findings suggest that multi-agent deliberative systems can achieve performance comparable to expert MDT consensus, with potential to expand access to specialized oncology expertise in resource-limited settings.
98. 【2602.13792】StackingNet: Collective Inference Across Independent AI Foundation Models
链接:https://arxiv.org/abs/2602.13792
作者:Siyang Li,Chenhao Liu,Dongrui Wu,Zhigang Zeng,Lieyun Ding
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:systems remain isolated, transformed language understanding, vision and reasoning, share their capabilities, built on large
备注:
点击查看摘要
Abstract:Artificial intelligence built on large foundation models has transformed language understanding, vision and reasoning, yet these systems remain isolated and cannot readily share their capabilities. Integrating the complementary strengths of such independent foundation models is essential for building trustworthy intelligent systems. Despite rapid progress in individual model design, there is no established approach for coordinating such black-box heterogeneous models. Here we show that coordination can be achieved through a meta-ensemble framework termed StackingNet, which draws on principles of collective intelligence to combine model predictions during inference. StackingNet improves accuracy, reduces bias, enables reliability ranking, and identifies or prunes models that degrade performance, all operating without access to internal parameters or training data. Across tasks involving language comprehension, visual estimation, and academic paper rating, StackingNet consistently improves accuracy, robustness, and fairness, compared with individual models and classic ensembles. By turning diversity from a source of inconsistency into collaboration, StackingNet establishes a practical foundation for coordinated artificial intelligence, suggesting that progress may emerge from not only larger single models but also principled cooperation among many specialized ones.
99. 【2602.13790】How Do Lexical Senses Correspond Between Spoken German and German Sign Language?
链接:https://arxiv.org/abs/2602.13790
作者:Melis Çelikkol,Wei Zhao
类目:Computation and Language (cs.CL)
关键词:German Sign Language, language lexicographers construct, lexicographers construct bilingual, Sign language lexicographers, Sign language
备注: EACL'26 (Student Research Workshop)
点击查看摘要
Abstract:Sign language lexicographers construct bilingual dictionaries by establishing word-to-sign mappings, where polysemous and homonymous words corresponding to different signs across contexts are often underrepresented. A usage-based approach examining how word senses map to signs can identify such novel mappings absent from current dictionaries, enriching lexicographic resources. We address this by analyzing German and German Sign Language (Deutsche Gebärdensprache, DGS), manually annotating 1,404 word use-to-sign ID mappings derived from 32 words from the German Word Usage Graph (D-WUG) and 49 signs from the Digital Dictionary of German Sign Language (DW-DGS). We identify three correspondence types: Type 1 (one-to-many), Type 2 (many-to-one), and Type 3 (one-to-one), plus No Match cases. We evaluate computational methods: Exact Match (EM) and Semantic Similarity (SS) using SBERT embeddings. SS substantially outperforms EM overall 88.52% vs. 71.31%), with dramatic gains for Type 1 (+52.1 pp). Our work establishes the first annotated dataset for cross-modal sense correspondence and reveals which correspondence patterns are computationally identifiable. Our code and dataset are made publicly available.
100. 【2602.13748】RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction
链接:https://arxiv.org/abs/2602.13748
作者:Yongkang Jin,Jianwen Luo,Jingjing Wang,Jianmin Yao,Yu Hong
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:aims to identify, text and images, MEE, Event, Relation-aware Multi-task Progressive
备注:
点击查看摘要
Abstract:Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
101. 【2602.13713】On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis
链接:https://arxiv.org/abs/2602.13713
作者:Maciej Uberna,Michał Wawer,Jarosław A. Chudziak,Marcin Koszowy
类目:Computation and Language (cs.CL)
关键词:remains a key, key challenge, discourse remains, detect surface-level similarity, Abstract
备注: 8 pages, 4 figures, 3 tables. This is the accepted version of the paper presented at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain
点击查看摘要
Abstract:Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D-I-S-G-O). We then evaluate two parallel LLM-based agent systems: one enhanced by argumentation theory via Retrieval-Augmented Generation (RAG), and an identical zero-shot baseline. The results reveal a clear performance gap: the RAG-enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1-score improvement of nearly 30\%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function-aware analysis of argumentative discourse. This comparative multi-agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.
102. 【2602.13701】Metaphors' journeys across time and genre: tracking the evolution of literary metaphors with temporal embeddings
链接:https://arxiv.org/abs/2602.13701
作者:Veronica Mangiaterra,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini
类目:Computation and Language (cs.CL)
关键词:metaphor processing demands, remain less studied, studied experimentally, experimentally than everyday, processing demands
备注:
点击查看摘要
Abstract:Metaphors are a distinctive feature of literary language, yet they remain less studied experimentally than everyday metaphors. Moreover, previous psycholinguistic and computational approaches overlooked the temporal dimension, although many literary metaphors were coined centuries apart from contemporary readers. This study innovatively applies tools from diachronic distributional semantics to assess whether the processing costs of literary metaphors varied over time and genre. Specifically, we trained word embeddings on literary and nonliterary Italian corpora from the 19th and 21st centuries, for a total of 124 million tokens, and modeled changes in the semantic similarity between topics and vehicles of 515 19th-century literary metaphors, taking this measure as a proxy of metaphor processing demands. Overall, semantic similarity, and hence metaphor processing demands, remained stable over time. However, genre played a key role: metaphors appeared more difficult (i.e., lower topic-vehicle similarity) in modern literary contexts than in 19th-century literature, but easier (i.e., higher topic-vehicle similarity) in today's nonliterary language (e.g., the Web) than in 19th-century nonliterary texts. This pattern was further shaped by semantic features of metaphors' individual terms, such as vector coherence and semantic neighborhood density. Collectively, these findings align with broader linguistic changes in Italian, such as the stylistic simplification of modern literature, which may have increased metaphor processing demands, and the high creativity of the Web's language, which seems to render metaphor more accessible.
103. 【2602.13680】AllMem: A Memory-centric Recipe for Efficient Long-context Modeling
链接:https://arxiv.org/abs/2602.13680
作者:Ziming Wang,Xiang Wang,Kailong Peng,Lang Qin,Juan Gabriel Kostelec,Christos Sourmpis,Axel Laborieux,Qinghai Guo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, encounter significant performance, memory overhead inherent, long-sequence tasks due
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.
104. 【2602.13653】Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
链接:https://arxiv.org/abs/2602.13653
作者:Yibo Wang,Guangda Huzhang,Yuwei Hu,Yu Xia,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Graphical User Interface, Multimodal Large Language, Large Language Models, User Interface, Multimodal Large
备注:
点击查看摘要
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
105. 【2602.13650】KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination
链接:https://arxiv.org/abs/2602.13650
作者:Byungjin Choi,Seongsu Bae,Sunjun Kweon,Edward Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Korean Medical Licensing, Medical Licensing Examinations, Korean medical, multiple-choice question answering, evaluating vision-language models
备注: 17 pages, 2 figures, 6 tables. (Includes appendix.)
点击查看摘要
Abstract:We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: this https URL.
106. 【2602.13576】Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges
链接:https://arxiv.org/abs/2602.13576
作者:Ruomeng Ding,Yifei Pang,He Sun,Yizhong Wang,Zhiwei Steven Wu,Zhun Deng
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, language models increasingly, models increasingly rely, large language, increasingly rely
备注:
点击查看摘要
Abstract:Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: this https URL. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.
107. 【2602.13575】Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
链接:https://arxiv.org/abs/2602.13575
作者:Jing Zhao,Ting Zhen,Junwei bao,Hongfei Jiang,Yang song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, human preference data, compressing vast amounts, absolute reward functions
备注:
点击查看摘要
Abstract:Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods static pairwise training Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.
108. 【2602.13571】LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems
链接:https://arxiv.org/abs/2602.13571
作者:Zhipeng Song,Xiangyu Kong,Xinrui Bao,Yizhi Zhou,Jiulong Jiao,Sitong Liu,Yuhang Zhou,Heng Qi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, natural language processing, revolutionized natural language, knowledge-intensive tasks remain, Large language
备注: Published by ESWA
点击查看摘要
Abstract:Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR--using only 7--9B-parameter pre-trained LLMs--consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR's mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.
109. 【2602.13567】DistillLens: Symmetric Knowledge Distillation Through Logit Lens
链接:https://arxiv.org/abs/2602.13567
作者:Manish Dhakal,Uthman Jinadu,Anjila Budathoki,Rajshekhar Sunderraman,Yi Ding
类目:Computation and Language (cs.CL)
关键词:compresses Large Language, Large Language Models, Standard Knowledge Distillation, Large Language, optimizing final outputs
备注: Knowledge Distillation in LLMs
点击查看摘要
Abstract:Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature-based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual-sided penalty, preventing both overconfidence and underconfidence while preserving the high-entropy information conduits essential for final deduction. Extensive experiments on GPT-2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature-transfer baselines on diverse instruction-following benchmarks. The code is available at this https URL.
110. 【2602.13562】Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning
链接:https://arxiv.org/abs/2602.13562
作者:Yanbo Wang,Minzheng Wang,Jian Liang,Lu Wang,Yongcan Yu,Ran He
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:achieved remarkable success, increasing power necessitates, power necessitates stringent, stringent safety measures, necessitates stringent safety
备注: Preprint. 18 pages, 6 figures
点击查看摘要
Abstract:While reasoning models have achieved remarkable success in complex reasoning tasks, their increasing power necessitates stringent safety measures. For safety alignment, the core challenge lies in the inherent trade-off between safety and utility. However, prevailing alignment strategies typically construct CoT training data with explicit safety rules via context distillation. This approach inadvertently limits reasoning capabilities by creating a rigid association between rule memorization and refusal. To mitigate the safety-utility trade-off, we propose the Adaptive Safe Context Learning (ASCL) framework to improve the reasoning given proper context. ASCL formulates safety alignment as a multi-turn tool-use process, empowering the model to autonomously decide when to consult safety rules and how to generate the ongoing reasoning. Furthermore, to counteract the preference for rule consultation during RL, we introduce Inverse Frequency Policy Optimization (IFPO) to rebalance advantage estimates. By decoupling rule retrieval and subsequent reasoning, our method achieves higher overall performance compared to baselines.
111. 【2602.13551】Small Reward Models via Backward Inference
链接:https://arxiv.org/abs/2602.13551
作者:Yike Wang,Faeze Brahman,Shangbin Feng,Teng Xiao,Hannaneh Hajishirzi,Yulia Tsvetkov
类目:Computation and Language (cs.CL)
关键词:play a central, central role, FLIP, Reward, reward modeling
备注:
点击查看摘要
Abstract:Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at this https URL.
112. 【2602.13543】LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News
链接:https://arxiv.org/abs/2602.13543
作者:Yunfan Zhang,Kathleen McKeown,Smaranda Muresan
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, complex fact retrieval, capabilities show strong, show strong potential
备注: An earlier version of this work was publicly available on OpenReview as an ICLR 2026 submission in September 2025
点击查看摘要
Abstract:Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at this http URL.
113. 【2602.13540】On Calibration of Large Language Models: From Response To Capability
链接:https://arxiv.org/abs/2602.13540
作者:Sin-Han Yang,Cheng-Kuang Wu,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee,Shao-Hua Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:general-purpose problem solvers, Large language models, Large language, making accurate confidence, problem solvers
备注: preprint
点击查看摘要
Abstract:Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.
114. 【2602.13529】SecureGate: Learning When to Reveal PII Safely via Token-Gated Dual-Adapters for Federated LLMs
链接:https://arxiv.org/abs/2602.13529
作者:Mohamed Shaaban,Mohamed Elmahallawy
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:enables collaborative training, sharing raw data, enables collaborative, making it attractive, privacy-sensitive applications
备注:
点击查看摘要
Abstract:Federated learning (FL) enables collaborative training across organizational silos without sharing raw data, making it attractive for privacy-sensitive applications. With the rapid adoption of large language models (LLMs), federated fine-tuning of generative LLMs has gained attention as a way to leverage distributed data while preserving confidentiality. However, this setting introduces fundamental challenges: (i) privacy leakage of personally identifiable information (PII) due to LLM memorization, and (ii) a persistent tension between global generalization and local utility under heterogeneous data. Existing defenses, such as data sanitization and differential privacy, reduce leakage but often degrade downstream performance. We propose SecureGate, a privacy-aware federated fine-tuning framework for LLMs that provides fine-grained privacy control without sacrificing utility. SecureGate employs a dual-adapter LoRA architecture: a secure adapter that learns sanitized, globally shareable representations, and a revealing adapter that captures sensitive, organization-specific knowledge. A token-controlled gating module selectively activates these adapters at inference time, enabling controlled information disclosure without retraining. Extensive experiments across multiple LLMs and real-world datasets show that SecureGate improves task utility while substantially reducing PII leakage, achieving up to a 31.66X reduction in inference attack accuracy and a 17.07X reduction in extraction recall for unauthorized requests. Additionally, it maintains 100% routing reliability to the correct adapter and incurs only minimal computational and communication overhead.
115. 【2602.13517】hink Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
链接:https://arxiv.org/abs/2602.13517
作者:Wei-Lin Chen,Liqian Peng,Tian Tan,Chao Zhao,Blake JianHang Chen,Ziqian Lin,Alec Go,Yu Meng
类目:Computation and Language (cs.CL)
关键词:Large language models, demonstrated impressive reasoning, impressive reasoning capabilities, Large language, compute via long
备注: Work in progress
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
116. 【2602.13504】From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier
链接:https://arxiv.org/abs/2602.13504
作者:Ozancan Ozdemir
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:raised urgent questions, large language models, rapid integration, integration of large, large language
备注:
点击查看摘要
Abstract:The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.
117. 【2602.13466】Language Model Memory and Memory Models for Language
链接:https://arxiv.org/abs/2602.13466
作者:Benjamin L. Badger
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:hidden layer vector, layer vector embeddings, machine learning models, store input information, ability of machine
备注:
点击查看摘要
Abstract:The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.
118. 【2602.13455】Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety
链接:https://arxiv.org/abs/2602.13455
作者:Phyllis Nabangi,Abdul-Jalil Zakaria,Jema David Ndibwile
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:necessitating enhanced measures, necessitating enhanced, rise of digital, digital technology, technology has dramatically
备注: Accepted at the Second IJCAI AI for Good Symposium in Africa, hosted by Deep Learning Indaba, 7 pages, 1 figure
点击查看摘要
Abstract:The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.
Comments:
Accepted at the Second IJCAI AI for Good Symposium in Africa, hosted by Deep Learning Indaba, 7 pages, 1 figure
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Cite as:
arXiv:2602.13455 [cs.CL]
(or
arXiv:2602.13455v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.13455
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
119. 【2602.13452】LLM-Powered Automatic Translation and Urgency in Crisis Scenarios
链接:https://arxiv.org/abs/2602.13452
作者:Belu Ticona,Antonis Anastasopoulos
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, preparedness and response, Large language, increasingly proposed, Large
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.
120. 【2602.13379】Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
链接:https://arxiv.org/abs/2602.13379
作者:Xu Li,Simon Yu,Minzhou Pan,Yiyou Sun,Bo Li,Dawn Song,Xue Lin,Weiyan Shi
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:increasingly capable, multi-turn, LLM-based agents, Attack Success Rate, safety
备注:
点击查看摘要
Abstract:LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at this https URL.
121. 【2602.13376】An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation
链接:https://arxiv.org/abs/2602.13376
作者:Giang Son Nguyen,Zi Pong Lim,Sarthak Ketanbhai Modi,Yon Shin Teo,Wenya Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:document processing pipelines, Vision-Language Models, document processing, processing pipelines, pipelines to convert
备注: 9 pages, 4 tables. Under review
点击查看摘要
Abstract:Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
122. 【2602.13370】G2CP: A Graph-Grounded Communication Protocol for Verifiable and Efficient Multi-Agent Reasoning
链接:https://arxiv.org/abs/2602.13370
作者:Karim Ben Khaled,Davy Monticolo
类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models face, powered by Large, Models face, inefficient token consumption
备注:
点击查看摘要
Abstract:Multi-agent systems powered by Large Language Models face a critical challenge: agents communicate through natural language, leading to semantic drift, hallucination propagation, and inefficient token consumption. We propose G2CP (Graph-Grounded Communication Protocol), a structured agent communication language where messages are graph operations rather than free text. Agents exchange explicit traversal commands, subgraph fragments, and update operations over a shared knowledge graph, enabling verifiable reasoning traces and eliminating ambiguity. We validate G2CP within an industrial knowledge management system where specialized agents (Diagnostic, Procedural, Synthesis, and Ingestion) coordinate to answer complex queries. Experimental results on 500 industrial scenarios and 21 real-world maintenance cases show that G2CP reduces inter-agent communication tokens by 73%, improves task completion accuracy by 34% over free-text baselines, eliminates cascading hallucinations, and produces fully auditable reasoning chains. G2CP represents a fundamental shift from linguistic to structural communication in multi-agent systems, with implications for any domain requiring precise agent coordination. Code, data, and evaluation scripts are publicly available.
123. 【2602.13367】Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
链接:https://arxiv.org/abs/2602.13367
作者:Chen Yang,Guangyue Peng,Jiaying Zhu,Ran Le,Ruixiang Feng,Tao Zhang,Xiyun Xu,Yang Song,Yiming Jia,Yuntao Wen,Yunzhi Xu,Zekai Wang,Zhenwei An,Zhicong Sun,Zongchao Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:unified generalist language, strong agentic behavior, agentic behavior, generalist language model, unified generalist
备注:
点击查看摘要
Abstract:We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
124. 【2602.13352】Using Deep Learning to Generate Semantically Correct Hindi Captions
链接:https://arxiv.org/abs/2602.13352
作者:Wasim Akram Khan,Anil Kumar Vuppala
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:natural language processing, Automated image captioning, Automated image, image, harnessing the capability
备注: 34 pages, 12 figures, 3 tables. Master's thesis, Liverpool John Moores University, November 2022
点击查看摘要
Abstract:Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.
125. 【2602.13348】Exploring the Performance of ML/DL Architectures on the MNIST-1D Dataset
链接:https://arxiv.org/abs/2602.13348
作者:Michael Beebe,GodsGift Uzor,Manasa Chepuri,Divya Sree Vemula,Angel Ayala
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Convolutional Neural Networks, Temporal Convolutional Networks, historically been instrumental, instrumental in advancing, research by providing
备注:
点击查看摘要
Abstract:Small datasets like MNIST have historically been instrumental in advancing machine learning research by providing a controlled environment for rapid experimentation and model evaluation. However, their simplicity often limits their utility for distinguishing between advanced neural network architectures. To address these challenges, Greydanus et al. introduced the MNIST-1D dataset, a one-dimensional adaptation of MNIST designed to explore inductive biases in sequential data. This dataset maintains the advantages of small-scale datasets while introducing variability and complexity that make it ideal for studying advanced architectures. In this paper, we extend the exploration of MNIST-1D by evaluating the performance of Residual Networks (ResNet), Temporal Convolutional Networks (TCN), and Dilated Convolutional Neural Networks (DCNN). These models, known for their ability to capture sequential patterns and hierarchical features, were implemented and benchmarked alongside previously tested architectures such as logistic regression, MLPs, CNNs, and GRUs. Our experimental results demonstrate that advanced architectures like TCN and DCNN consistently outperform simpler models, achieving near-human performance on MNIST-1D. ResNet also shows significant improvements, highlighting the importance of leveraging inductive biases and hierarchical feature extraction in small structured datasets. Through this study, we validate the utility of MNIST-1D as a robust benchmark for evaluating machine learning architectures under computational constraints. Our findings emphasize the role of architectural innovations in improving model performance and offer insights into optimizing deep learning models for resource-limited environments.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.13348 [cs.LG]
(or
arXiv:2602.13348v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.13348
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
126. 【2602.13275】Artificial Organisations
链接:https://arxiv.org/abs/2602.13275
作者:William Waites
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Alignment research focuses, research focuses, focuses on making, Alignment research, Perseverance Composition Engine
备注:
点击查看摘要
Abstract:Alignment research focuses on making individual AI systems reliable. Human institutions achieve reliable collective behaviour differently: they mitigate the risk posed by misaligned individuals through organisational structure. Multi-agent AI systems should follow this institutional model using compartmentalisation and adversarial review to achieve reliable outcomes through architectural design rather than assuming individual alignment. We demonstrate this approach through the Perseverance Composition Engine, a multi-agent system for document composition. The Composer drafts text, the Corroborator verifies factual substantiation with full source access, and the Critic evaluates argumentative quality without access to sources: information asymmetry enforced by system architecture. This creates layered verification: the Corroborator detects unsupported claims, whilst the Critic independently assesses coherence and completeness. Observations from 474 composition tasks (discrete cycles of drafting, verification, and evaluation) exhibit patterns consistent with the institutional hypothesis. When assigned impossible tasks requiring fabricated content, this iteration enabled progression from attempted fabrication toward honest refusal with alternative proposals--behaviour neither instructed nor individually incentivised. These findings motivate controlled investigation of whether architectural enforcement produces reliable outcomes from unreliable components. This positions organisational theory as a productive framework for multi-agent AI safety. By implementing verification and evaluation as structural properties enforced through information compartmentalisation, institutional design offers a route to reliable collective behaviour from unreliable individual components.
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.13275 [cs.AI]
(or
arXiv:2602.13275v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2602.13275
Focus to learn more
arXiv-issued DOI via DataCite</p>
127. 【2602.13274】ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs
链接:https://arxiv.org/abs/2602.13274
作者:Rohan Subramanian Thomas,Shikhar Shiromani,Abdullah Chaudhry,Ruizhe Li,Vasu Sharma,Kevin Zhu,Sunishchal Dev
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:http URL introduce, large language models, unified benchmark evaluating, design significantly impacts, empirical comparisons remain
备注:
点击查看摘要
Abstract:Prompt design significantly impacts the moral competence and safety alignment of large language models (LLMs), yet empirical comparisons remain fragmented across datasets and this http URL introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance via our proposed Unified Moral Safety Score (UMSS), a metric balancing accuracy and safety. Our results show that compact, exemplar-guided scaffolds outperform complex multi-stage reasoning, providing higher UMSS scores and greater robustness at a lower token cost. While multi-turn reasoning proves fragile under perturbations, few-shot exemplars consistently enhance moral stability and jailbreak resistance. ProMoral-Bench establishes a standardized framework for principled, cost-effective prompt engineering.
128. 【2602.13264】Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models
链接:https://arxiv.org/abs/2602.13264
作者:Souradeep Chattopadhyay,Brendan Kennedy,Sai Munikoti,Soumik Sarkar,Karl Pazdernik
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:making generative models, generative models trustworthy, Uncertainty Quantification, trustworthy and robust, making generative
备注:
点击查看摘要
Abstract:In the critical task of making generative models trustworthy and robust, methods for Uncertainty Quantification (UQ) have begun to show encouraging potential. However, many of these methods rely on rigid heuristics that fail to generalize across tasks and modalities. Here, we propose a novel framework for UQ that is highly flexible and approaches or surpasses the performance of prior heuristic methods. We introduce Directional Concentration Uncertainty (DCU), a novel statistical procedure for quantifying the concentration of embeddings based on the von Mises-Fisher (vMF) distribution. Our method captures uncertainty by measuring the geometric dispersion of multiple generated outputs from a language model using continuous embeddings of the generated outputs without any task specific heuristics. In our experiments, we show that DCU matches or exceeds calibration levels of prior works like semantic entropy (Kuhn et al., 2023) and also generalizes well to more complex tasks in multi-modal domains. We present a framework for the wider potential of DCU and its implications for integration into UQ for multi-modal and agentic frameworks.
129. 【2602.13263】Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation
链接:https://arxiv.org/abs/2602.13263
作者:Ligong Lei,Wenwen Lu,Xudong Pang,Zaokere Kadeer,Aishan Wumaier
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Automatic speech recognition, making labeled accent, prosodic shifts induce, accent adaptation costly, labeled accent adaptation
备注:
点击查看摘要
Abstract:Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech--text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.
130. 【2602.13262】General learned delegation by clones
链接:https://arxiv.org/abs/2602.13262
作者:Darren Li,Meiqi Chen,Chenze Shao,Fandong Meng,Jie Zhou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:additional test-time computation, uncoordinated parallel sampling, test-time computation, Frontier language models, additional test-time
备注: Code available at [this https URL](https://github.com/SuffixAutomata/SELFCEST)
点击查看摘要
Abstract:Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.
131. 【2602.13258】MAPLE: A Sub-Agent Architecture for Memory, Learning, and Personalization in Agentic AI Systems
链接:https://arxiv.org/abs/2602.13258
作者:Deepak Babu Piskala
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:Large language model, remains fundamentally limited, individual users remains, users remains fundamentally, Large language
备注: 12 pages, 5 figures. Accepted to ALA Workshop at AAMAS 2026. Code: []( [this https URL](https://github.com/prdeepakbabu/maple-framework)
点击查看摘要
Abstract:Large language model (LLM) agents have emerged as powerful tools for complex tasks, yet their ability to adapt to individual users remains fundamentally limited. We argue this limitation stems from a critical architectural conflation: current systems treat memory, learning, and personalization as a unified capability rather than three distinct mechanisms requiring different infrastructure, operating on different timescales, and benefiting from independent optimization. We propose MAPLE (Memory-Adaptive Personalized LEarning), a principled decomposition where Memory handles storage and retrieval infrastructure; Learning extracts intelligence from accumulated interactions asynchronously; and Personalization applies learned knowledge in real-time within finite context budgets. Each component operates as a dedicated sub-agent with specialized tooling and well-defined interfaces. Experimental evaluation on the MAPLE-Personas benchmark demonstrates that our decomposition achieves a 14.6% improvement in personalization score compared to a stateless baseline (p 0.01, Cohen's d = 0.95) and increases trait incorporation rate from 45% to 75% -- enabling agents that genuinely learn and adapt.
132. 【2602.13248】X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles
链接:https://arxiv.org/abs/2602.13248
作者:Ashkan Y. Zadeh,Xiaomeng Li,Andry Rakotonirainy,Ronald Schroeter,Sebastien Glaser,Zishuo Zhu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:existing approaches lack, Natural language explanations, Natural language, humans linguistically construct, language explanations play
备注:
点击查看摘要
Abstract:Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper introduces X-Blocks (eXplanation Blocks), a hierarchical analytical framework that identifies the linguistic building blocks of natural language explanations for AVs at three levels: context, syntax, and lexicon. At the context level, we propose RACE (Reasoning-Aligned Classification of Explanations), a multi-LLM ensemble framework that combines Chain-of-Thought reasoning with self-consistency mechanisms to robustly classify explanations into 32 scenario-aware categories. Applied to human-authored explanations from the Berkeley DeepDrive-X dataset, RACE achieves 91.45 percent accuracy and a Cohens kappa of 0.91 against cases with human annotator agreement, indicating near-human reliability for context classification. At the lexical level, log-odds analysis with informative Dirichlet priors reveals context-specific vocabulary patterns that distinguish driving scenarios. At the syntactic level, dependency parsing and template extraction show that explanations draw from a limited repertoire of reusable grammar families, with systematic variation in predicate types and causal constructions across contexts. The X-Blocks framework is dataset-agnostic and task-independent, offering broad applicability to other automated driving datasets and safety-critical domains. Overall, our findings provide evidence-based linguistic design principles for generating scenario-aware explanations that support transparency, user trust, and cognitive accessibility in automated driving systems.
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Cite as:
arXiv:2602.13248 [cs.AI]
(or
arXiv:2602.13248v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2602.13248
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Ashkan Yousefi Zadeh [view email] [v1]
Mon, 2 Feb 2026 07:18:25 UTC (14,248 KB)
133. 【2602.13237】NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models
链接:https://arxiv.org/abs/2602.13237
作者:Rizky Ramadhana Putra,Raihan Sultan Pasha Basuki,Yutong Cheng,Peng Gao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:law and governance, critical in domains, verifying claims, claims against facts, facts in documents
备注: Accepted to Findings of EACL 2026. 17 pages, 6 figures
点击查看摘要
Abstract:Automated reasoning is critical in domains such as law and governance, where verifying claims against facts in documents requires both accuracy and interpretability. Recent work adopts structured reasoning pipelines that translate natural language into first-order logic and delegate inference to automated solvers. With the rise of large language models, approaches such as GCD and CODE4LOGIC leverage their reasoning and code generation capabilities to improve logic parsing. However, these methods suffer from fragile syntax control due to weak enforcement of global grammar constraints and low semantic faithfulness caused by insufficient clause-level semantic understanding. We propose NL2LOGIC, a first-order logic translation framework that introduces an abstract syntax tree as an intermediate representation. NL2LOGIC combines a recursive large language model based semantic parser with an abstract syntax tree guided generator that deterministically produces solver-ready logic code. Experiments on the FOLIO, LogicNLI, and ProofWriter benchmarks show that NL2LOGIC achieves 99 percent syntactic accuracy and improves semantic correctness by up to 30 percent over state-of-the-art baselines. Furthermore, integrating NL2LOGIC into Logic-LM yields near-perfect executability and improves downstream reasoning accuracy by 31 percent compared to Logic-LM's original few-shot unconstrained translation module.
134. 【2602.13226】Variation is the Key: A Variation-Based Framework for LLM-Generated Text Detection
链接:https://arxiv.org/abs/2602.13226
作者:Xuecong Li,Xiaohong Li,Qiang Hu,Yao Zhang,Junjie Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Detecting text generated, Detecting text, crucial but challenging, generated by large, Detecting
备注:
点击查看摘要
Abstract:Detecting text generated by large language models (LLMs) is crucial but challenging. Existing detectors depend on impractical assumptions, such as white-box settings, or solely rely on text-level features, leading to imprecise detection ability. In this paper, we propose a simple but effective and practical LLM-generated text detection method, VaryBalance. The core of VaryBalance is that, compared to LLM-generated texts, there is a greater difference between human texts and their rewritten version via LLMs. Leveraging this observation, VaryBalance quantifies this through mean standard deviation and distinguishes human texts and LLM-generated texts. Comprehensive experiments demonstrated that VaryBalance outperforms the state-of-the-art detectors, i.e., Binoculars, by up to 34.3\% in terms of AUROC, and maintains robustness against multiple generating models and languages.
135. 【2602.13224】A Geometric Taxonomy of Hallucinations in LLMs
链接:https://arxiv.org/abs/2602.13224
作者:Javier Marín
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, language models conflates, models conflates distinct, conflates distinct phenomena, large language
备注:
点击查看摘要
Abstract:The term "hallucination" in large language models conflates distinct phenomena with different geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (failure to engage with provided context), confabulation (invention of semantically foreign content), and factual error (incorrect claims within correct conceptual frames). We observe a striking asymmetry. On standard benchmarks where hallucinations are LLM-generated, detection is domain-local: AUROC 0.76-0.99 within domains, but 0.50 (chance level) across domains. Discriminative directions are approximately orthogonal between domains (mean cosine similarity -0.07). On human-crafted confabulations - invented institutions, redefined terminology, fabricated mechanisms - a single global direction achieves 0.96 AUROC with 3.8% cross-domain degradation. We interpret this divergence as follows: benchmarks capture generation artifacts (stylistic signatures of prompted fabrication), while human-crafted confabulations capture genuine topical drift. The geometric structure differs because the underlying phenomena differ. Type III errors show 0.478 AUROC - indistinguishable from chance. This reflects a theoretical constraint: embeddings encode distributional co-occurrence, not correspondence to external reality. Statements with identical contextual patterns occupy similar embedding regions regardless of truth value. The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms.
136. 【2602.13218】Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning
链接:https://arxiv.org/abs/2602.13218
作者:Bowen Liu,Zhi Wu,Runquan Xie,Zhanhui Kang,Jia Li
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
关键词:Reinforcement Learning, bottleneck for Reinforcement, Scaling verifiable training, training signals remains, Scaling verifiable
备注: 37 pages, 8 figures, 4 tables in the main body. Project page: [this https URL](https://github.com/AdAstraAbyssoque/Scaling-the-Scaling-Logic)
点击查看摘要
Abstract:Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.
137. 【2602.10833】raining-Induced Bias Toward LLM-Generated Content in Dense Retrieval
链接:https://arxiv.org/abs/2602.10833
作者:William Xion,Wolfgang Nejdl
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:information retrieval applications, acquiring relevant context, language processing tasks, retrieval applications, open-domain natural language
备注: Accepted at ECIR 2026
点击查看摘要
Abstract:Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
138. 【2504.18880】Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
链接:https://arxiv.org/abs/2504.18880
作者:Zuhong Lin,Daoyuan Ren,Kai Ran,Jing Sun,Songlin Yu,Xuefeng Bai,Xiaotian Huang,Haiyang He,Pengxu Pan,Ying Fang,Zhanglin Li,Haipu Li,Jingjing Yao
类目:Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
关键词:guiding experimental design, Accurately identifying, metal-organic frameworks, experimental design, difficult to interpret
备注:
点击查看摘要
Abstract:Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and difficult to interpret. We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables. It links related descriptions across paragraphs, unifies ligand abbreviations with full names, and outputs structured parameters ready for use. MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/- 0.01. Processing a full text takes 9.6 s, locating synthesis descriptions 36 s, with 100 papers processed for USD 4.24. By replacing static database lookups with real-time extraction, MOFh6 reshapes MOF synthesis research, accelerating the conversion of literature knowledge into practical synthesis protocols and enabling scalable, data-driven materials discovery.
139. 【2602.13419】Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding
链接:https://arxiv.org/abs/2602.13419
作者:Shreyas Vinaya Sathyanarayana,Shah Rahil Kirankumar,Sharanabasava D. Hiremath,Bharath Ramsundar
类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
关键词:Large Language Models, shown remarkable potential, Large Language, Language Models, navigate complex problem
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect$^*$, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an ``automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a ``human-in-the-loop mode'' that integrates expert strategic constraints. Through ``active state tracking,'' we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.
信息检索
1. 【2602.15019】Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing, Business Development, and Search Evaluation
链接:https://arxiv.org/abs/2602.15019
作者:Alisa Vinogradova,Vlad Vinogradov,Luba Greenwood,Ilya Yasny,Dmitry Kobyzev,Shoman Kasbekar,Kong Nguyen,Dmitrii Radkevich,Roman Doronin,Andrey Doronichev
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:United States, Bio-pharmaceutical innovation, Perplexity Deep Research, Deep Research, non-English channels
备注:
点击查看摘要
Abstract:Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at ~30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. We compare Bioptic Agent against Claude Opus 4.6, OpenAI GPT-5.2 Pro, Perplexity Deep Research, Gemini 3 Pro + Deep Research, and Exa Websets. Bioptic Agent achieves 79.7% F1 versus 56.2% (Claude Opus 4.6), 50.6% (Gemini 3 Pro + Deep Research), 46.6% (GPT-5.2 Pro), 44.2% (Perplexity Deep Research), and 26.9% (Exa Websets). Performance improves steeply with additional compute, supporting the view that more compute yields better results.
Subjects:
Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.15019 [cs.AI]
(or
arXiv:2602.15019v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2602.15019
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2602.15005】Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
链接:https://arxiv.org/abs/2602.15005
作者:Mengdan Zhu,Yufan Zhao,Tao Di,Yulan Yan,Liang Zhao
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:discover relevant content, helping users discover, users discover relevant, relevant content, plays a critical
备注:
点击查看摘要
Abstract:News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.
3. 【2602.14960】DRAMA: Domain Retrieval using Adaptive Module Allocation
链接:https://arxiv.org/abs/2602.14960
作者:Pranav Kasela,Marco Braga,Ophir Frieder,Nazli Goharian,Gabriella Pasi,Raffaele Perego
类目:Information Retrieval (cs.IR)
关键词:Web-scale Information Retrieval, Web-scale Information, Information Retrieval, Adaptive Module Allocation, Web-scale
备注:
点击查看摘要
Abstract:Neural models are increasingly used in Web-scale Information Retrieval (IR). However, relying on these models introduces substantial computational and energy requirements, leading to increasing attention toward their environmental cost and the sustainability of large-scale deployments. While neural IR models deliver high retrieval effectiveness, their scalability is constrained in multi-domain scenarios, where training and maintaining domain-specific models is inefficient and achieving robust cross-domain generalisation within a unified model remains difficult. This paper introduces DRAMA (Domain Retrieval using Adaptive Module Allocation), an energy- and parameter-efficient framework designed to reduce the environmental footprint of neural retrieval. DRAMA integrates domain-specific adapter modules with a dynamic gating mechanism that selects the most relevant domain knowledge for each query. New domains can be added efficiently through lightweight adapter training, avoiding full model retraining. We evaluate DRAMA on multiple Web retrieval benchmarks covering different domains. Our extensive evaluation shows that DRAMA achieves comparable effectiveness to domain-specific models while using only a fraction of their parameters and computational resources. These findings show that energy-aware model design can significantly improve scalability and sustainability in neural IR.
4. 【2602.14914】Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation
链接:https://arxiv.org/abs/2602.14914
作者:Olivier Jeunen,Shashank Gupta
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:costly online interventions, Inverse Propensity Scoring, Self-Normalised Inverse Propensity, online interventions, Propensity Scoring
备注:
点击查看摘要
Abstract:Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $\beta^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
5. 【2602.14793】Beyond Retractions: Forensic Scientometrics Techniques to Identify Research Misconduct, Citation Leakage, and Funding Anomalies
链接:https://arxiv.org/abs/2602.14793
作者:Leslie D. McIntosh,Alexandra Sinclair,Simon Linacre
类目:Information Retrieval (cs.IR)
关键词:Neuroscience Research Network, Pharmakon Neuroscience Research, scholarly publishing channels, fabricated research collective, forensic scientometric case
备注:
点击查看摘要
Abstract:This paper presents a forensic scientometric case study of the Pharmakon Neuroscience Research Network, a fabricated research collective that operated primarily between 2019 and 2022 while embedding itself within legitimate scholarly publishing channels.
6. 【2602.14784】Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs
链接:https://arxiv.org/abs/2602.14784
作者:Christos Koutsiaris
类目:Information Retrieval (cs.IR)
关键词:Breaking long, Breaking long documents, smaller segments, fundamental challenge, Breaking
备注: 8 pages, 4 figures. Code available at [this https URL](https://github.com/unseen1980/IDC)
点击查看摘要
Abstract:Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.
7. 【2602.14755】Measuring the relatedness between scientific publications using controlled vocabularies
链接:https://arxiv.org/abs/2602.14755
作者:Emil Dolmer Alnor
类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
关键词:science policy, areas of bibliometrics, bibliometrics and science, Salton cosine, Salton cosine similarity
备注: Currently under review at Scientometrics (16 February 2026)
点击查看摘要
Abstract:Measuring the relatedness between scientific publications is essential in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness and are widely used in combination with Salton's cosine similarity. The latter is problematic because it only considers exact matches between terms. This article introduces two alternative methods - soft cosine and maximum term similarities - that account for the semantic similarity between non-matching terms. The article compares the accuracy of all three methods using the assignment of publications to topics in the TREC 2006 Genomics Track and the assumption that accurate relatedness measures should assign high relatedness scores to publication pairs within the same topic and low scores to pairs from separate topics. Results show that soft cosine is the most accurate method, while the most widely used version of Salton's cosine is markedly less accurate than the other methods tested. These findings have implications for how controlled vocabularies should be used to measure relatedness.
8. 【2602.14710】Orcheo: A Modular Full-Stack Platform for Conversational Search
链接:https://arxiv.org/abs/2602.14710
作者:Shaojie Jiang,Svitlana Vakulenko,Maarten de Rijke
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:complex software engineering, Conversational search, integrates query reformulation, software engineering pipeline, requires a complex
备注: Under review at SIGIR 2026
点击查看摘要
Abstract:Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end-to-end prototypes needed for user evaluation. We introduce Orcheo, an open-source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single-file node modules, facilitating sharing and reproducibility in CS research; (ii) Production-ready infrastructure bridges the prototype-to-system gap via dual execution modes, secure credential management, and execution telemetry, with built-in AI coding support that lowers the learning curve; (iii) Starter-kit assets include 50+ off-the-shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo's utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at this https URL.
9. 【2602.14706】Adaptive Autoguidance for Item-Side Fairness in Diffusion Recommender Systems
链接:https://arxiv.org/abs/2602.14706
作者:Zihan Li,Gustavo Escobedo,Marta Moscati,Oleg Lesota,Markus Schedl
类目:Information Retrieval (cs.IR)
关键词:systems achieve strong, achieve strong recommendation, recommender systems achieve, Diffusion recommender systems, strong recommendation accuracy
备注:
点击查看摘要
Abstract:Diffusion recommender systems achieve strong recommendation accuracy but often suffer from popularity bias, resulting in unequal item exposure. To address this shortcoming, we introduce A2G-DiffRec, a diffusion recommender that incorporates adaptive autoguidance, where the main model is guided by a less-trained version of itself. Instead of using a fixed guidance weight, A2G-DiffRec learns to adaptively weigh the outputs of the main and weak models during training, supervised by a popularity regularization that promotes balanced exposure across items with different popularity levels. Experimental results on the MovieLens-1M, Foursquare-Tokyo, and Music4All-Onion datasets show that A2G-DiffRec is effective in enhancing item-side fairness at a marginal cost of accuracy reduction compared to existing guided diffusion recommenders and other non-diffusion baselines.
10. 【2602.14635】Alignment Adapter to Improve the Performance of Compressed Deep Learning Models
链接:https://arxiv.org/abs/2602.14635
作者:Rohit Raj Rai,Abhishek Dhaka,Amit Awekar
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Compressed Deep Learning, Deep Learning, Compressed Deep, resource-constrained environments, essential for deployment
备注:
点击查看摘要
Abstract:Compressed Deep Learning (DL) models are essential for deployment in resource-constrained environments. But their performance often lags behind their large-scale counterparts. To bridge this gap, we propose Alignment Adapter (AlAd): a lightweight, sliding-window-based adapter. It aligns the token-level embeddings of a compressed model with those of the original large model. AlAd preserves local contextual semantics, enables flexible alignment across differing dimensionalities or architectures, and is entirely agnostic to the underlying compression method. AlAd can be deployed in two ways: as a plug-and-play module over a frozen compressed model, or by jointly fine-tuning AlAd with the compressed model for further performance gains. Through experiments on BERT-family models across three token-level NLP tasks, we demonstrate that AlAd significantly boosts the performance of compressed models with only marginal overhead in size and latency.
11. 【2602.14519】DeepMTL2R: A Library for Deep Multi-task Learning to Rank
链接:https://arxiv.org/abs/2602.14519
作者:Chaosheng Dong,Peiyao Xiao,Yijia Wang,Kaiyi Ji
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:open-source deep learning, multiple relevance criteria, deep learning framework, paper presents, optimized simultaneously
备注:
点击查看摘要
Abstract:This paper presents DeepMTL2R, an open-source deep learning framework for Multi-task Learning to Rank (MTL2R), where multiple relevance criteria must be optimized simultaneously. DeepMTL2R integrates heterogeneous relevance signals into a unified, context-aware model by leveraging the self-attention mechanism of transformer architectures, enabling effective learning across diverse and potentially conflicting objectives. The framework includes 21 state-of-the-art multi-task learning algorithms and supports multi-objective optimization to identify Pareto-optimal ranking models. By capturing complex dependencies and long-range interactions among items and labels, DeepMTL2R provides a scalable and expressive solution for modern ranking systems and facilitates controlled comparisons across MTL strategies. We demonstrate its effectiveness on a publicly available dataset, report competitive performance, and visualize the resulting trade-offs among objectives. DeepMTL2R is available at \href{this https URL}{this https URL}.
12. 【2602.14502】Behavioral Feature Boosting via Substitute Relationships for E-commerce Search
链接:https://arxiv.org/abs/2602.14502
作者:Chaosheng Dong,Michinari Momma,Yijia Wang,Yan Gao,Yi Sun
类目:Information Retrieval (cs.IR)
关键词:limited interaction data, interaction data reduces, limited interaction, interaction data, data reduces
备注: 5 pages, 5 figures
点击查看摘要
Abstract:On E-commerce platforms, new products often suffer from the cold-start problem: limited interaction data reduces their search visibility and hurts relevance ranking. To address this, we propose a simple yet effective behavior feature boosting method that leverages substitute relationships among products (BFS). BFS identifies substitutes-products that satisfy similar user needs-and aggregates their behavioral signals (e.g., clicks, add-to-carts, purchases, and ratings) to provide a warm start for new items. Incorporating these enriched signals into ranking models mitigates cold-start effects and improves relevance and competitiveness. Experiments on a large E-commerce platform, both offline and online, show that BFS significantly improves search relevance and product discovery for cold-start products. BFS is scalable and practical, improving user experience while increasing exposure for newly launched items in E-commerce search. The BFS-enhanced ranking model has been launched in production and has served customers since 2025.
13. 【2602.14492】Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
链接:https://arxiv.org/abs/2602.14492
作者:Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Ziyi Gao,Xiaotong Lin,Yun Liu,Xing Fu,Yu Cheng,Yongchao Liu,Weiqiang Wang,Zhongle Xie
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:learning requires balancing, requires balancing robust, balancing robust universality, representation learning requires, acute task-sensitivity
备注: 15 pages, 12 figures
点击查看摘要
Abstract:Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay's production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: this https URL.
14. 【2602.14367】InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
链接:https://arxiv.org/abs/2602.14367
作者:Shuofei Qiao,Yunxiang Wei,Xuehai Wang,Bin Wu,Boyang Xue,Ningyu Zhang,Hossein A. Rahmani,Yanshan Wang,Qiang Zhang,Keyan Ding,Jeff Z. Pan,Huajun Chen,Emine Yilmaz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, evolution of Large, Models has catalyzed
备注: Ongoing Work
点击查看摘要
Abstract:The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.
15. 【2602.14358】High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplace
链接:https://arxiv.org/abs/2602.14358
作者:Dillon Davis,Huiji Gao,Thomas Legrand,Juan Manuel Caicedo Carvajal,Malay Haldar,Kedar Bellare,Moutupsi Paul,Soumyadip Banerjee,Liwei He,Stephanie Moyerman,Sanjeev Katariya
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:highly varied supply, expectations differ widely, price expectations differ, balance a worldwide, highly varied
备注: KDD TSMO 2025: [this https URL](https://sites.google.com/view/tsmo2025/accepted-papers?authuser=0)
点击查看摘要
Abstract:Airbnb search must balance a worldwide, highly varied supply of homes with guests whose location, amenity, style, and price expectations differ widely. Meeting those expectations hinges on an efficient retrieval stage that surfaces only the listings a guest might realistically book, before resource intensive ranking models are applied to determine the best results. Unlike many recommendation engines, our system faces a distinctive challenge, location retrieval, that sits upstream of ranking and determines which geographic areas are queried in order to filter inventory to a candidate set. The preexisting approach employs a deep bayesian bandit based system to predict a rectangular retrieval bounds area that can be used for filtering. The purpose of this paper is to demonstrate the methodology, challenges, and impact of rearchitecting search to retrieve from the subset of most bookable high precision rectangular map cells defined by dividing the world into 25M uniform cells.
16. 【2602.14257】AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
链接:https://arxiv.org/abs/2602.14257
作者:Lingxiang Hu,Yiding Sun,Tianle Xia,Wenwei Li,Ming Xu,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Model, Large Language, achieved remarkable progress, Language Model, critical problem
备注: 15 pages, 11 figures
点击查看摘要
Abstract:While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at this https URL.
17. 【2602.14162】Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
链接:https://arxiv.org/abs/2602.14162
作者:Tao Xu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Existing multimodal document, generate comprehensive descriptions, answering methods universally, methods universally adopt, Existing multimodal
备注: 24 pages, 9 figures, 9 tables
点击查看摘要
Abstract:Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
18. 【2602.14110】MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders
链接:https://arxiv.org/abs/2602.14110
作者:Xu Huang,Hao Zhang,Zhifang Fan,Yunwen Huang,Zhuoxing Wei,Zheng Chai,Jinan Ni,Yuchao Zheng,Qiwei Chen
类目:Information Retrieval (cs.IR)
关键词:Transformer architectures, scaling-driven regime, recommender systems enter, enter a scaling-driven, increasingly attractive
备注:
点击查看摘要
Abstract:As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.
19. 【2602.13971】DAIAN: Deep Adaptive Intent-Aware Network for CTR Prediction in Trigger-Induced Recommendation
链接:https://arxiv.org/abs/2602.13971
作者:Zhihao Lv,Longtao Zhang,Ailong He,Shuzhi Cao,Shuguang Han,Jufeng Chen
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:e-commerce shopping experiences, personalizing e-commerce shopping, shopping experiences, essential for personalizing, recommendation system overemphasizes
备注:
点击查看摘要
Abstract:Recommendation systems are essential for personalizing e-commerce shopping experiences. Among these, Trigger-Induced Recommendation (TIR) has emerged as a key scenario, which utilizes a trigger item (explicitly represents a user's instantaneous interest), enabling precise, real-time recommendations. Although several trigger-based techniques have been proposed, most of them struggle to address the intent myopia issue, that is, a recommendation system overemphasizes the role of trigger items and narrowly focuses on suggesting commodities that are highly relevant to trigger items. Meanwhile, existing methods rely on collaborative behavior patterns between trigger and recommended items to identify the user's preferences, yet the sparsity of ID-based interaction restricts their effectiveness. To this end, we propose the Deep Adaptive Intent-Aware Network (DAIAN) that dynamically adapts to users' intent preferences. In general, we first extract the users' personalized intent representations by analyzing the correlation between a user's click and the trigger item, and accordingly retrieve the user's related historical behaviors to mine the user's diverse intent. Besides, sparse collaborative behaviors constrain the performance in capturing items associated with user intent. Hence, we reinforce similarity by leveraging a hybrid enhancer with ID and semantic information, followed by adaptive selection based on varying intents. Experimental results on public datasets and our industrial e-commerce datasets demonstrate the effectiveness of DAIAN.
20. 【2602.13868】Agentic Assistant for 6G: Turn-based Conversations for AI-RAN Hierarchical Co-Management
链接:https://arxiv.org/abs/2602.13868
作者:Udhaya Srinivasan,Weisi Guo
类目:Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR)
关键词:radio access networks, manage in real-time, radio access, increasingly difficult, engineers to manage
备注: submitted to IEEE conference
点击查看摘要
Abstract:New generations of radio access networks (RAN), especially with native AI services are increasingly difficult for human engineers to manage in real-time. Enterprise networks are often managed locally, where expertise is scarce. Existing research has focused on creating Retrieval-Augmented Generation (RAG) LLMs that can help to plan and configure RAN and core aspects only. Co-management of RAN and edge AI is the gap, which creates hierarchical and dynamic problems that require turn-based human interactions. Here, we create an agentic network manager and turn-based conversation assistant that can understand human intent-based queries that match hierarchical problems in AI-RAN. The framework constructed consists of: (a) a user interface and evaluation dashboard, (b) an intelligence layer that interfaces with the AI-RAN, and (c) a knowledge layer for providing the basis for evaluations and recommendations. These form 3 layers of capability with the following validation performances (average response time 13s): (1) design and planning a service (78\% accuracy), (2) operating specific AI-RAN tools (89\% accuracy), and (3) tuning AI-RAN performance (67\%). These initial results indicate the universal challenges of hallucination but also fast response performance success that can really reduce OPEX costs for small scale enterprise users.
21. 【2602.13855】From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents
链接:https://arxiv.org/abs/2602.13855
作者:Razeen A Rasheed,Somnath Banerjee,Animesh Mukherjee,Rima Hazra
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:fluent scientific report, research agent produces, Auditable Autonomous Research, deep research agent, deep research agents
备注:
点击查看摘要
Abstract:A deep research agent produces a fluent scientific report in minutes; a careful reader then tries to verify the main claims and discovers the real cost is not reading, but tracing: which sentence is supported by which passage, what was ignored, and where evidence conflicts. We argue that as research generation becomes cheap, auditability becomes the bottleneck, and the dominant risk shifts from isolated factual errors to scientifically styled outputs whose claim-evidence links are weak, missing, or misleading. This perspective proposes claim-level auditability as a first-class design and evaluation target for deep research agents, distills recurring long-horizon failure modes (objective drift, transient constraints, and unverifiable inference), and introduces the Auditable Autonomous Research (AAR) standard, a compact measurement framework that makes auditability testable via provenance coverage, provenance soundness, contradiction transparency, and audit effort. We then argue for semantic provenance with protocolized validation: persistent, queryable provenance graphs that encode claim--evidence relations (including conflicts) and integrate continuous validation during synthesis rather than after publication, with practical instrumentation patterns to support deployment at scale.
22. 【2602.13830】A Tale of Two Graphs: Separating Knowledge Exploration from Outline Structure for Open-Ended Deep Research
链接:https://arxiv.org/abs/2602.13830
作者:Zhuofan Shi,Ming Ma,Zekun Yao,Fangkai Yang,Jue Zhang,Dongge Han,Victor Rühle,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
类目:Information Retrieval (cs.IR)
关键词:Open-Ended Deep Research, Deep Research, pushes LLM agents, Open-Ended Deep, existing OEDR agents
备注: 26 pages, 4 figures
点击查看摘要
Abstract:Open-Ended Deep Research (OEDR) pushes LLM agents beyond short-form QA toward long-horizon workflows that iteratively search, connect, and synthesize evidence into structured reports. However, existing OEDR agents largely follow either linear ``search-then-generate'' accumulation or outline-centric planning. The former suffers from lost-in-the-middle failures as evidence grows, while the latter relies on the LLM to implicitly infer knowledge gaps from the outline alone, providing weak supervision for identifying missing relations and triggering targeted exploration. We present DualGraph memory, an architecture that separates what the agent knows from how it writes. DualGraph maintains two co-evolving graphs: an Outline Graph (OG), and a Knowledge Graph (KG), a semantic memory that stores fine-grained knowledge units, including core entities, concepts, and their relations. By analyzing the KG topology together with structural signals from the OG, DualGraph generates targeted search queries, enabling more efficient and comprehensive iterative knowledge-driven exploration and refinement. Across DeepResearch Bench, DeepResearchGym, and DeepConsult, DualGraph consistently outperforms state-of-the-art baselines in report depth, breadth, and factual grounding; for example, it reaches a 53.08 RACE score on DeepResearch Bench with GPT-5. Moreover, ablation studies confirm the central role of the dual-graph design.
23. 【2602.13715】DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation
链接:https://arxiv.org/abs/2602.13715
作者:Mingyao Huang,Qidong Liu,Wenxuan Yang,Moranxin Wang,Yuqi Sun,Haiping Zhu,Feng Tian,Yan Chen
类目:Information Retrieval (cs.IR)
关键词:Sequential Recommender Systems, Recommender Systems, Large Language Models, Multimodal Large Language, Sequential Recommender
备注: 9 pages, 4 figures
点击查看摘要
Abstract:Sequential Recommender Systems (SRS) aim to predict users' next interaction based on their historical behaviors, while still facing the challenge of data sparsity. With the rapid advancement of Multimodal Large Language Models (MLLMs), leveraging their multimodal understanding capabilities to enrich item semantic representation has emerged as an effective enhancement strategy for SRS. However, existing MLLM-enhanced recommendation methods still suffer from two key limitations. First, they struggle to effectively align multimodal representations, leading to suboptimal utilization of semantic information across modalities. Second, they often overly rely on MLLM-generated content while overlooking the fine-grained semantic cues contained in the original textual data of items. To address these issues, we propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR). For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs. For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics. Finally, these two fused representations can be seamlessly integrated into the downstream sequential recommendation models. Extensive experiments conducted on three real-world datasets and three popular sequential recommendation architectures demonstrate the superior effectiveness and generalizability of our proposed approach.
24. 【2602.13704】Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search
链接:https://arxiv.org/abs/2602.13704
作者:Lei Chen,Chen Ju,Xu Chen,Zhicheng Wang,Yuheng Jiao,Hongfeng Zhan,Zhaoyang Li,Shihao Xu,Zhixiang Zhao,Tong Jia,Jinsong Lan,Xiaoyong Zhu,Bo Zheng
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:real-time industrial search, comprehensive multi-modal retrieval, engineered for high-precision, real-time industrial, industrial search
备注:
点击查看摘要
Abstract:In this work, we presented Pailitao-VL, a comprehensive multi-modal retrieval system engineered for high-precision, real-time industrial search. We here address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap. Our primary contribution lies in two fundamental paradigm shifts. First, we transitioned the embedding paradigm from traditional contrastive learning to an absolute ID-recognition task. Through anchoring instances to a globally consistent latent space defined by billions of semantic prototypes, we successfully overcome the stochasticity and granularity bottlenecks inherent in existing embedding solutions. Second, we evolved the generative reranker from isolated pointwise evaluation to the compare-and-calibrate listwise policy. By synergizing chunk-based comparative reasoning with calibrated absolute relevance scoring, the system achieves nuanced discriminative resolution while circumventing the prohibitive latency typically associated with conventional reranking methods. Extensive offline benchmarks and online A/B tests on Alibaba e-commerce platform confirm that Pailitao-VL achieves state-of-the-art performance and delivers substantial business impact. This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.
25. 【2602.13647】PT-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Papers
链接:https://arxiv.org/abs/2602.13647
作者:Rui Yu,Tianyi Wang,Ruixia Liu,Yinglong Wang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Retrieval-augmented generation, long academic papers, academic papers, fixed token budget, increasingly applied
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. Existing approaches typically flatten academic papers into unstructured chunks during preprocessing, which destroys the native hierarchical structure. This loss forces retrieval to operate in a disordered space, thereby producing fragmented contexts, misallocating tokens to non-evidential regions under finite token budgets, and increasing the reasoning burden for downstream language models. To address these issues, we propose PT-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior. PT-RAG first inherits the native hierarchy to construct a structure-fidelity PaperTree index, which prevents entropy increase at the source. It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts. In contrast to existing RAG approaches, PT-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. To assess this design, we introduce entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy. On three academic question-answering benchmarks, PT-RAG achieves consistently lower section entropy and evidence alignment cross entropy than strong baselines, indicating reduced context fragmentation and more precise allocation to evidential regions. These structural advantages directly translate into higher answer quality.
26. 【2602.13631】GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder
链接:https://arxiv.org/abs/2602.13631
作者:Yu Zhou,Chengcheng Guo,Kuo Cai,Ji Liu,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Guorui Zhou
类目:Information Retrieval (cs.IR)
关键词:possess strong sequential, sequential reasoning capabilities, strong sequential reasoning, face significant challenges, capturing users' lifelong
备注:
点击查看摘要
Abstract:While generative recommendations (GR) possess strong sequential reasoning capabilities, they face significant challenges when processing extremely long user behavior sequences: the high computational cost forces practical sequence lengths to be limited, preventing models from capturing users' lifelong interests; meanwhile, the inherent "recency bias" of attention mechanisms further weakens learning from long-term history. To overcome this bottleneck, we propose GEMs (Generative rEcommendation with a Multi-stream decoder), a novel and unified framework designed to break the long-sequence barrier by capturing users' lifelong interaction sequences through a multi-stream perspective. Specifically, GEMs partitions user behaviors into three temporal streams$\unicode{x2014}$Recent, Mid-term, and Lifecycle$\unicode{x2014}$and employs tailored inference schemes for each: a one-stage real-time extractor for immediate dynamics, a lightweight indexer for cross attention to balance accuracy and cost for mid-term sequences, and a two-stage offline-online compression module for lifelong modeling. These streams are integrated via a parameter-free fusion strategy to enable holistic interest representation. Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-of-the-art methods in recommendation accuracy. Notably, GEMs is the first lifelong GR framework successfully deployed in a high-concurrency industrial environment, achieving superior inference efficiency while processing user sequences of over 100,000 interactions.
27. 【2602.13581】Climber-Pilot: A Non-Myopic Generative Recommendation Model Towards Better Instruction-Following
链接:https://arxiv.org/abs/2602.13581
作者:Da Guo,Shijia Wang,Qiang Xiao,Yintao Ren,Weisheng Li,Songpei Xu,Ming Yue,Bin Huang,Guanlin Wu,Chuanjiang Luo
类目:Information Retrieval (cs.IR)
关键词:offering superior sequence, traditional dual-tower architectures, superior sequence modeling, sequence modeling capabilities, offering superior
备注:
点击查看摘要
Abstract:Generative retrieval has emerged as a promising paradigm in recommender systems, offering superior sequence modeling capabilities over traditional dual-tower architectures. However, in large-scale industrial scenarios, such models often suffer from inherent myopia: due to single-step inference and strict latency constraints, they tend to collapse diverse user intents into locally optimal predictions, failing to capture long-horizon and multi-item consumption patterns. Moreover, real-world retrieval systems must follow explicit retrieval instructions, such as category-level control and policy constraints. Incorporating such instruction-following behavior into generative retrieval remains challenging, as existing conditioning or post-hoc filtering approaches often compromise relevance or efficiency. In this work, we present Climber-Pilot, a unified generative retrieval framework to address both limitations. First, we introduce Time-Aware Multi-Item Prediction (TAMIP), a novel training paradigm designed to mitigate inherent myopia in generative retrieval. By distilling long-horizon, multi-item foresight into model parameters through time-aware masking, TAMIP alleviates locally optimal predictions while preserving efficient single-step inference. Second, to support flexible instruction-following retrieval, we propose Condition-Guided Sparse Attention (CGSA), which incorporates business constraints directly into the generative process via sparse attention, without introducing additional inference steps. Extensive offline experiments and online A/B testing at NetEase Cloud Music, one of the largest music streaming platforms, demonstrate that Climber-Pilot significantly outperforms state-of-the-art baselines, achieving a 4.24\% lift of the core business metric.
28. 【2602.13573】Unleash the Potential of Long Semantic IDs for Generative Recommendation
链接:https://arxiv.org/abs/2602.13573
作者:Ming Xia,Zhiqin Zhou,Guoxin Ma,Dongmin Huang
类目:Information Retrieval (cs.IR)
关键词:ID-based generative recommendation, generative recommendation represents, recommendation represents items, Optimized Product Quantization, Semantic ID-based generative
备注: 14 pages, 12 figures, conference
点击查看摘要
Abstract:Semantic ID-based generative recommendation represents items as sequences of discrete tokens, but it inherently faces a trade-off between representational expressiveness and computational efficiency. Residual Quantization (RQ)-based approaches restrict semantic IDs to be short to enable tractable sequential modeling, while Optimized Product Quantization (OPQ)-based methods compress long semantic IDs through naive rigid aggregation, inevitably discarding fine-grained semantic information. To resolve this dilemma, we propose ACERec, a novel framework that decouples the granularity gap between fine-grained tokenization and efficient sequential modeling. It employs an Attentive Token Merger to distill long expressive semantic tokens into compact latents and introduces a dedicated Intent Token serving as a dynamic prediction anchor. To capture cohesive user intents, we guide the learning process via a dual-granularity objective, harmonizing fine-grained token prediction with global item-level semantic alignment. Extensive experiments on six real-world benchmarks demonstrate that ACERec consistently outperforms state-of-the-art baselines, achieving an average improvement of 14.40\% in NDCG@10, effectively reconciling semantic expressiveness and computational efficiency.
29. 【2602.13543】LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News
链接:https://arxiv.org/abs/2602.13543
作者:Yunfan Zhang,Kathleen McKeown,Smaranda Muresan
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, complex fact retrieval, capabilities show strong, show strong potential
备注: An earlier version of this work was publicly available on OpenReview as an ICLR 2026 submission in September 2025
点击查看摘要
Abstract:Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at this http URL.
30. 【2602.13402】InfoCIR: Multimedia Analysis for Composed Image Retrieval
链接:https://arxiv.org/abs/2602.13402
作者:Ioannis Dravilas,Ioannis Kapetangeorgis,Anastasios Latsoudis,Conor McCarthy,Gonçalo Marcelino,Marcel Worring
类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:describes desired modifications, Composed Image Retrieval, Composed Image, desired modifications, combining a reference
备注: 9+2 pages, 8 figures. Accepted for publication in IEEE PacificVis 2026 (Conference Track). Interactive composed image retrieval (CIR) and ranking explanation
点击查看摘要
Abstract:Composed Image Retrieval (CIR) allows users to search for images by combining a reference image with a text prompt that describes desired modifications. While vision-language models like CLIP have popularized this task by embedding multiple modalities into a joint space, developers still lack tools that reveal how these multimodal prompts interact with embedding spaces and why small wording changes can dramatically alter the results. We present InfoCIR, a visual analytics system that closes this gap by coupling retrieval, explainability, and prompt engineering in a single, interactive dashboard. InfoCIR integrates a state-of-the-art CIR back-end (SEARLE arXiv:2303.15247) with a six-panel interface that (i) lets users compose image + text queries, (ii) projects the top-k results into a low-dimensional space using Uniform Manifold Approximation and Projection (UMAP) for spatial reasoning, (iii) overlays similarity-based saliency maps and gradient-derived token-attribution bars for local explanation, and (iv) employs an LLM-powered prompt enhancer that generates counterfactual variants and visualizes how these changes affect the ranking of user-selected target images. A modular architecture built on Plotly-Dash allows new models, datasets, and attribution methods to be plugged in with minimal effort. We argue that InfoCIR helps diagnose retrieval failures, guides prompt enhancement, and accelerates insight generation during model development. All source code allowing for a reproducible demo is available at this https URL.
31. 【2602.13345】BLUEPRINT Rebuilding a Legacy: Multimodal Retrieval for Complex Engineering Drawings and Documents
链接:https://arxiv.org/abs/2602.13345
作者:Ethan Seefried,Ran Eldegaway,Sanjay Das,Nathaniel Blanchard,Tirthankar Ghosal
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:technical records remain, records remain locked, making retrieval difficult, technical records, records remain
备注: 20 pages 8 main + 12 appendix + references
点击查看摘要
Abstract:Decades of engineering drawings and technical records remain locked in legacy archives with inconsistent or missing metadata, making retrieval difficult and often manual. We present Blueprint, a layout-aware multimodal retrieval system designed for large-scale engineering repositories. Blueprint detects canonical drawing regions, applies region-restricted VLM-based OCR, normalizes identifiers (e.g., DWG, part, facility), and fuses lexical and dense retrieval with a lightweight region-level reranker. Deployed on ~770k unlabeled files, it automatically produces structured metadata suitable for cross-facility search. We evaluate Blueprint on a 5k-file benchmark with 350 expert-curated queries using pooled, graded (0/1/2) relevance judgments. Blueprint delivers a 10.1% absolute gain in Success@3 and an 18.9% relative improvement in nDCG@3 over the strongest vision-language baseline}, consistently outperforming across vision, text, and multimodal intents. Oracle ablations reveal substantial headroom under perfect region detection and OCR. We release all queries, runs, annotations, and code to facilitate reproducible evaluation on legacy engineering archives.
Comments:
20 pages 8 main + 12 appendix + references
Subjects:
Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
Cite as:
arXiv:2602.13345 [cs.LG]
(or
arXiv:2602.13345v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.13345
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2602.13239】CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment
链接:https://arxiv.org/abs/2602.13239
作者:Yiming Xiao,Kai Yin,Ali Mostafavi
类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:effective emergency response, spatially resolved disaster, Timely and spatially, resolved disaster impact, emergency response
备注: 27 pages, 4 figures
点击查看摘要
Abstract:Timely and spatially resolved disaster impact assessment is essential for effective emergency response. However, automated methods typically struggle with temporal asynchrony. Real-time human reports capture peak hazard conditions while high-resolution satellite imagery is frequently acquired after peak conditions. This often reflects flood recession rather than maximum extent. Naive fusion of these misaligned streams can yield dangerous underestimates when post-event imagery overrides documented peak flooding. We present CrisiSense-RAG, which is a multimodal retrieval-augmented generation framework that reframes impact assessment as evidence synthesis over heterogeneous data sources without disaster-specific fine-tuning. The system employs hybrid dense-sparse retrieval for text sources and CLIP-based retrieval for aerial imagery. A split-pipeline architecture feeds into asynchronous fusion logic that prioritizes real-time social evidence for peak flood extent while treating imagery as persistent evidence of structural damage. Evaluated on Hurricane Harvey across 207 ZIP-code queries, the framework achieves a flood extent MAE of 10.94% to 28.40% and damage severity MAE of 16.47% to 21.65% in zero-shot settings. Prompt-level alignment proves critical for quantitative validity because metric grounding improves damage estimates by up to 4.75 percentage points. These results demonstrate a practical and deployable approach to rapid resilience intelligence under real-world data constraints.
33. 【2602.10833】raining-Induced Bias Toward LLM-Generated Content in Dense Retrieval
链接:https://arxiv.org/abs/2602.10833
作者:William Xion,Wolfgang Nejdl
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:information retrieval applications, acquiring relevant context, language processing tasks, retrieval applications, open-domain natural language
备注: Accepted at ECIR 2026
点击查看摘要
Abstract:Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
34. 【2602.14335】Predicting New Concept-Object Associations in Astronomy by Mining the Literature
链接:https://arxiv.org/abs/2602.14335
作者:Jinchu Li(1),Yuan-Sen Ting(2),Alberto Accomazzi(3),Tirthankar Ghosal(4),Nesar Ramachandra(5) ((1) Georgia Institute of Technology, (2) The Ohio State University, (3) The Center for Astrophysics | Harvard amp; Smithsonian, (4) Oak Ridge National Laboratory, (5) Argonne National Laboratory)
类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR)
关键词:full astro-ph corpus, full astro-ph, concept-object knowledge graph, corpus through July, astro-ph corpus
备注: Code, data, and full experimental configurations are available at: [this https URL](https://github.com/JinchuLi2002/astro-link-forecasting)
点击查看摘要
Abstract:We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and 19.8% on Recall@100 (0.175 vs 0.146), and exceeds the best recency heuristic by 96% and 88%, respectively. These results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.
计算机视觉
1. 【2602.15031】EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
链接:https://arxiv.org/abs/2602.15031
作者:Yehonathan Litman,Shikun Liu,Dario Seyb,Nicholas Milef,Yang Zhou,Carl Marshall,Shubham Tulsiani,Caleb Leak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video foundation models, High-fidelity generative video, leveraging pre-trained video, pre-trained video foundation, significant quality improvements
备注: Project page: [this https URL](https://yehonathanlitman.github.io/edit_ctrl)
点击查看摘要
Abstract:High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.
2. 【2602.15030】Image Generation with a Sphere Encoder
链接:https://arxiv.org/abs/2602.15030
作者:Kaiyu Yue,Menglin Jia,Ji Hou,Tom Goldstein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficient generative framework, generative framework capable, single forward pass, efficient generative, generative framework
备注: Technical report
点击查看摘要
Abstract:We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at this https URL .
3. 【2602.15018】Neurosim: A Fast Simulator for Neuromorphic Robot Perception
链接:https://arxiv.org/abs/2602.15018
作者:Richeek Das,Pratik Chaudhari
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic vision sensors, RGB cameras, depth sensors, simulating sensors, vision sensors
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at this https URL .
4. 【2602.14989】hermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
链接:https://arxiv.org/abs/2602.14989
作者:Ayush Shrivastava,Kirtan Gangani,Laksh Jain,Mayank Goel,Nipun Batra
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:achieve strong performance, RGB imagery, Unlike RGB imagery, achieve strong, strong performance
备注: 8 Pages with 2 figures of main content. 2 pages of References. 10 pages of appendix with 6 figures
点击查看摘要
Abstract:Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.
5. 【2602.14965】PAct: Part-Decomposed Single-View Articulated Object Generation
链接:https://arxiv.org/abs/2602.14965
作者:Qingming Liu,Xinyue Yao,Shuyuan Zhang,Yueci Deng,Guiliang Liu,Zhen Liu,Kui Jia
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:functional part decomposition, decomposition and kinematic, central to interactive, part decomposition, reliable part decomposition
备注: Technical Report(11 figures, 14 pages), Project Page: [this https URL](https://PAct-project.github.io)
点击查看摘要
Abstract:Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.
6. 【2602.14941】AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
链接:https://arxiv.org/abs/2602.14941
作者:Zun Wang,Han Lin,Jaehong Yoon,Jaemin Cho,Yue Zhang,Mohit Bansal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:long horizons remains, Maintaining spatial world, spatial world consistency, spatial world, long horizons
备注: Project website: [this https URL](https://zunwang1.github.io/AnchorWeave)
点击查看摘要
Abstract:Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.
7. 【2602.14929】Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
链接:https://arxiv.org/abs/2602.14929
作者:Chandrakanth Gudavalli,Tajuddin Manhar Mohammed,Abhay Yadav,Ananth Vishnu Bhaskar,Hardik Prajapati,Cheng Peng,Rama Chellappa,Shivkumar Chandrasekaran,B. S. Manjunath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Aligning ground-level imagery, large viewpoint gaps, GPS is unreliable, Aligning ground-level, crucial for mapping
备注:
点击查看摘要
Abstract:Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.
8. 【2602.14901】Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems
链接:https://arxiv.org/abs/2602.14901
作者:Pramit Saha,Joshua Strong,Mohammad Alsharid,Divyanshu Mishra,J. Alison Noble
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:agentic healthcare systems, answer clinical queries, healthcare systems, form the backbone, answer clinical
备注:
点击查看摘要
Abstract:Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.
9. 【2602.14889】Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment
链接:https://arxiv.org/abs/2602.14889
作者:Mounvik K,N Harshit
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
关键词:combining retrieved text, lightweight framework, framework for generating, web sources, introduce Web-Scale Multimodal
备注:
点击查看摘要
Abstract:We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal this http URL pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured this http URL on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.
10. 【2602.14879】CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
链接:https://arxiv.org/abs/2602.14879
作者:Qingqing Zhu,Qiao Jin,Tejas S. Mathai,Yin Fang,Zhizheng Wang,Yifan Yang,Maame Sarfo-Gyamfi,Benjamin Hou,Ran Gu,Praveen T. S. Balamuralikrishna,Kenneth C. Wang,Ronald M. Summers,Zhiyong Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:radiology report content, generate radiology report, Artificial intelligence, automatically delineate lesions, computed tomography
备注:
点击查看摘要
Abstract:Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
11. 【2602.14846】Multi-dimensional Persistent Sheaf Laplacians for Image Analysis
链接:https://arxiv.org/abs/2602.14846
作者:Xiang Xiang Wang,Guo-Wei Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:persistent sheaf Laplacian, persistent sheaf Laplacians, MPSL, reduced, principal component analysis
备注:
点击查看摘要
Abstract:We propose a multi-dimensional persistent sheaf Laplacian (MPSL) framework on simplicial complexes for image analysis. The proposed method is motivated by the strong sensitivity of commonly used dimensionality reduction techniques, such as principal component analysis (PCA), to the choice of reduced dimension. Rather than selecting a single reduced dimension or averaging results across dimensions, we exploit complementary advantages of multiple reduced dimensions. At a given dimension, image samples are regarded as simplicial complexes, and persistent sheaf Laplacians are utilized to extract a multiscale localized topological spectral representation for individual image samples. Statistical summaries of the resulting spectra are then aggregated across scales and dimensions to form multiscale multi-dimensional image representations. We evaluate the proposed framework on the COIL20 and ETH80 image datasets using standard classification protocols. Experimental results show that the proposed method provides more stable performance across a wide range of reduced dimensions and achieves consistent improvements to PCA-based baselines in moderate dimensional regimes.
12. 【2602.14837】Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation
链接:https://arxiv.org/abs/2602.14837
作者:Lorenzo Mur Labadia,Ruben Martinez-Cantin,Jose J.Guerrero,Giovanni M. Farinella,Antonino Furnari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Short Term object-interaction, Term object-interaction Anticipation, object-interaction Anticipation consists, Short Term, Term object-interaction
备注:
点击查看摘要
Abstract:Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.
13. 【2602.14834】Debiasing Central Fixation Confounds Reveals a Peripheral "Sweet Spot" for Human-like Scanpaths in Hard-Attention Vision
链接:https://arxiv.org/abs/2602.14834
作者:Pengcheng Pan,Yonekura Shogo,Yasuo Kuniyosh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visual recognition reflect, Human eye movements, Human eye, visual recognition, recognition reflect
备注:
点击查看摘要
Abstract:Human eye movements in visual recognition reflect a balance between foveal sampling and peripheral context. Task-driven hard-attention models for vision are often evaluated by how well their scanpaths match human gaze. However, common scanpath metrics can be strongly confounded by dataset-specific center bias, especially on object-centric datasets. Using Gaze-CIFAR-10, we show that a trivial center-fixation baseline achieves surprisingly strong scanpath scores, approaching many learned policies. This makes standard metrics optimistic and blurs the distinction between genuine behavioral alignment and mere central tendency. We then analyze a hard-attention classifier under constrained vision by sweeping foveal patch size and peripheral context, revealing a peripheral sweet spot: only a narrow range of sensory constraints yields scanpaths that are simultaneously (i) above the center baseline after debiasing and (ii) temporally human-like in movement statistics. To address center bias, we propose GCS (Gaze Consistency Score), a center-debiased composite metric augmented with movement similarity. GCS uncovers a robust sweet spot at medium patch size with both foveal and peripheral vision, that is not obvious from raw scanpath metrics or accuracy alone, and also highlights a "shortcut regime" when the field-of-view becomes too large. We discuss implications for evaluating active perception on object-centric datasets and for designing gaze benchmarks that better separate behavioral alignment from center bias.
14. 【2602.14788】VIPA: Visual Informative Part Attention for Referring Image Segmentation
链接:https://arxiv.org/abs/2602.14788
作者:Yubin Cho,Hyunwoo Yu,Kyeongbo Kong,Kyomin Sohn,Bongjoon Hyun,Suk-Ju Kang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Referring Image Segmentation, Image Segmentation, Referring Image, visual, natural language expression
备注: Preprint
点击查看摘要
Abstract:Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.
15. 【2602.14771】GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
链接:https://arxiv.org/abs/2602.14771
作者:Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE)
关键词:human visual system, visual system tracks, previously observed information, system tracks objects, fine granularity
备注: Learning Model Adaptation for Adverse and Dynamic Environments
点击查看摘要
Abstract:The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.
16. 【2602.14767】SAILS: Segment Anything with Incrementally Learned Semantics for Task-Invariant and Training-Free Continual Learning
链接:https://arxiv.org/abs/2602.14767
作者:Shishir Muralidhara,Didier Stricker,René Schuster
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Continual learning remains, high computational costs, learning remains constrained, Continual learning, repeated retraining
备注: Accepted at IEEE CAI 2026
点击查看摘要
Abstract:Continual learning remains constrained by the need for repeated retraining, high computational costs, and the persistent challenge of forgetting. These factors significantly limit the applicability of continuous learning in real-world settings, as iterative model updates require significant computational resources and inherently exacerbate forgetting. We present SAILS -- Segment Anything with Incrementally Learned Semantics, a training-free framework for Class-Incremental Semantic Segmentation (CISS) that sidesteps these challenges entirely. SAILS leverages foundational models to decouple CISS into two stages: Zero-shot region extraction using Segment Anything Model (SAM), followed by semantic association through prototypes in a fixed feature space. SAILS incorporates selective intra-class clustering, resulting in multiple prototypes per class to better model intra-class variability. Our results demonstrate that, despite requiring no incremental training, SAILS typically surpasses the performance of existing training-based approaches on standard CISS datasets, particularly in long and challenging task sequences where forgetting tends to be most severe. By avoiding parameter updates, SAILS completely eliminates forgetting and maintains consistent, task-invariant performance. Furthermore, SAILS exhibits positive backward transfer, where the introduction of new classes can enhance performance on previous classes.
17. 【2602.14761】Universal Algorithm-Implicit Learning
链接:https://arxiv.org/abs/2602.14761
作者:Stefano Woerner,Seong Joon Oh,Christian F. Baumgartner
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:narrow task distributions, Current meta-learning methods, limiting applicability, Current meta-learning, current meta-learning literature
备注:
点击查看摘要
Abstract:Current meta-learning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like "universal" and "general-purpose" inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for meta-learning which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods. Guided by this framework, we present TAIL, a transformer-based algorithm-implicit meta-learner that functions across tasks with varying domains, modalities, and label configurations. TAIL features three innovations over prior transformer-based meta-learners: random projections for cross-modal feature encoding, random injection label embeddings that extrapolate to larger label spaces, and efficient inline query processing. TAIL achieves state-of-the-art performance on standard few-shot benchmarks while generalizing to unseen domains. Unlike other meta-learning methods, it also generalizes to unseen modalities, solving text classification tasks despite training exclusively on images, handles tasks with up to 20$\times$ more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.
18. 【2602.14751】Depth Completion as Parameter-Efficient Test-Time Adaptation
链接:https://arxiv.org/abs/2602.14751
作者:Bingxin Ke,Qunjie Zhou,Jiahui Huang,Xuanchi Ren,Tianchang Shen,Konrad Schindler,Laura Leal-Taixé,Shengyu Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:test-time optimization framework, parameter-efficient test-time optimization, adapts pre-trained, depth completion, sparse geometric cues
备注:
点击查看摘要
Abstract:We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: this http URL.
19. 【2602.14705】It's a Matter of Time: Three Lessons on Long-Term Motion for Perception
链接:https://arxiv.org/abs/2602.14705
作者:Willem Davison,Xinyue Hao,Laura Sevilla-Lara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long-term motion, long-term motion information, motion, motion information, long been considered
备注:
点击查看摘要
Abstract:Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.
20. 【2602.14682】Exposing Diversity Bias in Deep Generative Models: Statistical Origins and Correction of Diversity Error
链接:https://arxiv.org/abs/2602.14682
作者:Farzan Farnia,Mohammad Jalali,Azim Ospanov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
关键词:Deep generative models, machine learning applications, achieved great success, Deep generative, producing high-quality samples
备注:
点击查看摘要
Abstract:Deep generative models have achieved great success in producing high-quality samples, making them a central tool across machine learning applications. Beyond sample quality, an important yet less systematically studied question is whether trained generative models faithfully capture the diversity of the underlying data distribution. In this work, we address this question by directly comparing the diversity of samples generated by state-of-the-art models with that of test samples drawn from the target data distribution, using recently proposed reference-free entropy-based diversity scores, Vendi and RKE. Across multiple benchmark datasets, we find that test data consistently attains substantially higher Vendi and RKE diversity scores than the generated samples, suggesting a systematic downward diversity bias in modern generative models. To understand the origin of this bias, we analyze the finite-sample behavior of entropy-based diversity scores and show that their expected values increase with sample size, implying that diversity estimated from finite training sets could inherently underestimate the diversity of the true distribution. As a result, optimizing the generators to minimize divergence to empirical data distributions would induce a loss of diversity. Finally, we discuss potential diversity-aware regularization and guidance strategies based on Vendi and RKE as principled directions for mitigating this bias, and provide empirical evidence suggesting their potential to improve the results.
21. 【2602.14679】Universal Image Immunization against Diffusion-based Image Editing via Semantic Injection
链接:https://arxiv.org/abs/2602.14679
作者:Chanhui Lee,Seunghyun Shin,Donggyu Choi,Hae-gon Jeon,Jeany Son
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language prompts, Recent advances, enabled powerful image, editing capabilities guided, language prompts
备注: Working paper
点击查看摘要
Abstract:Recent advances in diffusion models have enabled powerful image editing capabilities guided by natural language prompts, unlocking new creative possibilities. However, they introduce significant ethical and legal risks, such as deepfakes and unauthorized use of copyrighted visual content. To address these risks, image immunization has emerged as a promising defense against AI-driven semantic manipulation. Yet, most existing approaches rely on image-specific adversarial perturbations that require individual optimization for each image, thereby limiting scalability and practicality. In this paper, we propose the first universal image immunization framework that generates a single, broadly applicable adversarial perturbation specifically designed for diffusion-based editing pipelines. Inspired by universal adversarial perturbation (UAP) techniques used in targeted attacks, our method generates a UAP that embeds a semantic target into images to be protected. Simultaneously, it suppresses original content to effectively misdirect the model's attention during editing. As a result, our approach effectively blocks malicious editing attempts by overwriting the original semantic content in the image via the UAP. Moreover, our method operates effectively even in data-free settings without requiring access to training data or domain knowledge, further enhancing its practicality and broad applicability in real-world scenarios. Extensive experiments show that our method, as the first universal immunization approach, significantly outperforms several baselines in the UAP setting. In addition, despite the inherent difficulty of universal perturbations, our method also achieves performance on par with image-specific methods under a more restricted perturbation budget, while also exhibiting strong black-box transferability across different diffusion models.
22. 【2602.14672】MeFEm: Medical Face Embedding model
链接:https://arxiv.org/abs/2602.14672
作者:Yury Borets,Stepan Botman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Embedding Predictive Architecture, Joint Embedding Predictive, modified Joint Embedding, Predictive Architecture, Joint Embedding
备注:
点击查看摘要
Abstract:We present MeFEm, a vision model based on a modified Joint Embedding Predictive Architecture (JEPA) for biometric and medical analysis from facial images. Key modifications include an axial stripe masking strategy to focus learning on semantically relevant regions, a circular loss weighting scheme, and the probabilistic reassignment of the CLS token for high quality linear probing. Trained on a consolidated dataset of curated images, MeFEm outperforms strong baselines like FaRL and Franca on core anthropometric tasks despite using significantly less data. It also shows promising results on Body Mass Index (BMI) estimation, evaluated on a novel, consolidated closed-source dataset that addresses the domain bias prevalent in existing data. Model weights are available at this https URL , offering a strong baseline for future work in this domain.
23. 【2602.14662】Advances in Global Solvers for 3D Vision
链接:https://arxiv.org/abs/2602.14662
作者:Zhenjun Zhao,Heng Yang,Bangyan Liao,Yingping Zeng,Shaocheng Yan,Yingdong Gu,Peidong Liu,Yi Zhou,Haoang Li,Javier Civera
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:offering certifiable solutions, problems traditionally addressed, heuristic methods, Global solvers, solutions to nonconvex
备注: Comprehensive survey; 37 pages, 7 figures, 3 tables. Project page with literature tracking and code tutorials: [this https URL](https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision)
点击查看摘要
Abstract:Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC). We present their theoretical foundations, algorithmic designs, and practical enhancements for robustness and scalability, examining how each addresses the fundamental nonconvexity of geometric estimation problems. Our analysis spans ten core vision tasks, from Wahba problem to bundle adjustment, revealing the optimality-robustness-scalability trade-offs that govern solver selection. We identify critical future directions: scaling algorithms while maintaining guarantees, integrating data-driven priors with certifiable optimization, establishing standardized benchmarks, and addressing societal implications for safety-critical deployment. By consolidating theoretical foundations, practical advances, and broader impacts, this survey provides a unified perspective and roadmap toward certifiable, trustworthy perception for real-world applications. A continuously-updated literature summary and companion code tutorials are available at this https URL.
24. 【2602.14648】SketchingReality: From Freehand Scene Sketches To Photorealistic Images
链接:https://arxiv.org/abs/2602.14648
作者:Ahmed Bourouis,Mikhail Bessmeltsev,Yulia Gryaditskaya
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:witnessed remarkable progress, Recent years, natural language emerging, years have witnessed, witnessed remarkable
备注:
点击查看摘要
Abstract:Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.
25. 【2602.14633】VIGIL: Tackling Hallucination Detection in Image Recontextualization
链接:https://arxiv.org/abs/2602.14633
作者:Joanna Wojciechowicz,Maria Łubniewska,Jakub Antczak,Justyna Baczyńska,Wojciech Gromski,Wojciech Kozłowski,Maciej Zięba
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative In-context Lucidity, Visual Inconsistency Generative, Inconsistency Generative In-context, Visual Inconsistency, In-context Lucidity
备注: 10 pages, 6 figures, 4 tables. Code and data are available at: [this https URL](https://github.com/mlubneuskaya/vigil) and [this https URL](https://huggingface.co/datasets/joannaww/VIGIL)
点击查看摘要
Abstract:We introduce VIGIL (Visual Inconsistency Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: this https URL and Data repository: this https URL.
26. 【2602.14615】VariViT: A Vision Transformer for Variable Image Sizes
链接:https://arxiv.org/abs/2602.14615
作者:Aswathi Varma,Suprosanna Shit,Chinmay Prabhakar,Daniel Scholz,Hongwei Bran Li,Bjoern Menze,Daniel Rueckert,Benedikt Wiestler
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Vision Transformers, leveraging self-attention mechanisms, leveraging self-attention, self-attention mechanisms, mechanisms to excel
备注:
点击查看摘要
Abstract:Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: this https URL
27. 【2602.14582】YOLO26: A Comprehensive Architecture Overview and Key Improvements
链接:https://arxiv.org/abs/2602.14582
作者:Priyanto Hidayatullah,Refdinal Tubagus
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Distribution Focal Loss, YOLO, computer vision, YOLO series, YOLO model
备注:
点击查看摘要
Abstract:You Only Look Once (YOLO) has been the prominent model for computer vision in deep learning for a decade. This study explores the novel aspects of YOLO26, the most recent version in the YOLO series. The elimination of Distribution Focal Loss (DFL), implementation of End-to-End NMS-Free Inference, introduction of ProgLoss + Small-Target-Aware Label Assignment (STAL), and use of the MuSGD optimizer are the primary enhancements designed to improve inference speed, which is claimed to achieve a 43% boost in CPU mode. This is designed to allow YOLO26 to attain real-time performance on edge devices or those without GPUs. Additionally, YOLO26 offers improvements in many computer vision tasks, including instance segmentation, pose estimation, and oriented bounding box (OBB) decoding. We aim for this effort to provide more value than just consolidating information already included in the existing technical documentation. Therefore, we performed a rigorous architectural investigation into YOLO26, mostly using the source code available in its GitHub repository and its official documentation. The authentic and detailed operational mechanisms of YOLO26 are inside the source code, which is seldom extracted by others. The YOLO26 architectural diagram is shown as the outcome of the investigation. This study is, to our knowledge, the first one presenting the CNN-based YOLO26 architecture, which is the core of YOLO26. Our objective is to provide a precise architectural comprehension of YOLO26 for researchers and developers aspiring to enhance the YOLO model, ensuring it remains the leading deep learning model in computer vision.
28. 【2602.14577】DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving
链接:https://arxiv.org/abs/2602.14577
作者:Chenxu Dang,Sining Ang,Yongkang Li,Haochen Tian,Jie Wang,Guang Li,Hangjun Ye,Jie Ma,Long Chen,Yan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving increasingly, driving increasingly adopt, increasingly adopt generative, generative planners trained, adopt generative planners
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at this https URL.
29. 【2602.14552】OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance
链接:https://arxiv.org/abs/2602.14552
作者:Zhaotong Yang,Yong Du,Shengfeng He,Yuhui Li,Xinzhe Li,Yangyang Xu,Junyu Dong,Jian Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:realistic person imagery, Image-based Virtual Try-On, Image-based Virtual, concerns the synthesis, body constraints
备注:
点击查看摘要
Abstract:Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints. In practice, however, existing approaches are typically optimized for specific data conditions, making their deployment reliant on retraining and limiting their generalization as a unified solution. We present OmniVTON++, a training-free VTON framework designed for universal applicability. It addresses the intertwined challenges of garment alignment, human structural coherence, and boundary continuity by coordinating Structured Garment Morphing for correspondence-driven garment adaptation, Principal Pose Guidance for step-wise structural regulation during diffusion sampling, and Continuous Boundary Stitching for boundary-aware refinement, forming a cohesive pipeline without task-specific retraining. Experimental results demonstrate that OmniVTON++ achieves state-of-the-art performance across diverse generalization settings, including cross-dataset and cross-garment-type evaluations, while reliably operating across scenarios and diffusion backbones within a single formulation. In addition to single-garment, single-human cases, the framework supports multi-garment, multi-human, and anime character virtual try-on, expanding the scope of virtual try-on applications. The source code will be released to the public.
30. 【2602.14534】MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation
链接:https://arxiv.org/abs/2602.14534
作者:Hongpeng Wang,Zeyu Zhang,Wenhao Li,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Human motion understanding, Human motion, crucial for vision, vision and robotics, robotics but remain
备注:
点击查看摘要
Abstract:Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: this https URL. Website: this https URL.
31. 【2602.14525】Cross-view Domain Generalization via Geometric Consistency for LiDAR Semantic Segmentation
链接:https://arxiv.org/abs/2602.14525
作者:Jindong Zhao,Yuan Gao,Yang Xia,Sheng Nie,Jun Yue,Weiwei Sun,Shaobo Xia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Domain-generalized LiDAR semantic, real-world LiDAR applications, Domain-generalized LiDAR, seeks to train, LiDAR semantic segmentation
备注:
点击查看摘要
Abstract:Domain-generalized LiDAR semantic segmentation (LSS) seeks to train models on source-domain point clouds that generalize reliably to multiple unseen target domains, which is essential for real-world LiDAR applications. However, existing approaches assume similar acquisition views (e.g., vehicle-mounted) and struggle in cross-view scenarios, where observations differ substantially due to viewpoint-dependent structural incompleteness and non-uniform point density. Accordingly, we formulate cross-view domain generalization for LiDAR semantic segmentation and propose a novel framework, termed CVGC (Cross-View Geometric Consistency). Specifically, we introduce a cross-view geometric augmentation module that models viewpoint-induced variations in visibility and sampling density, generating multiple cross-view observations of the same scene. Subsequently, a geometric consistency module enforces consistent semantic and occupancy predictions across geometrically augmented point clouds of the same scene. Extensive experiments on six public LiDAR datasets establish the first systematic evaluation of cross-view domain generalization for LiDAR semantic segmentation, demonstrating that CVGC consistently outperforms state-of-the-art methods when generalizing from a single source domain to multiple target domains with heterogeneous acquisition viewpoints. The source code will be publicly available at this https URL
32. 【2602.14524】Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model
链接:https://arxiv.org/abs/2602.14524
作者:Ari Vesalainen,Eetu Mäkelä,Laura Ruotsalainen,Mikko Tolonen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Optical Character Recognition, remains challenging due, Character Error Rate, Optical Character, degraded print quality
备注:
点击查看摘要
Abstract:Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.14524 [cs.CV]
(or
arXiv:2602.14524v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.14524
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
33. 【2602.14523】Architectural Insights for Post-Tornado Damage Recognition
链接:https://arxiv.org/abs/2602.14523
作者:Robinson Umeike,Thang Dao,Shane Crawford,John van de Lindt,Blythe Johnston,Wanting(Lisa)Wang,Trung Do,Ajibola Mofikoya,Sarbesh Banjara,Cuong Pham
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:optimizing emergency resource, emergency resource allocation, accelerating community recovery, coordinating life-saving search, accurate building damage
备注:
点击查看摘要
Abstract:Rapid and accurate building damage assessment in the immediate aftermath of tornadoes is critical for coordinating life-saving search and rescue operations, optimizing emergency resource allocation, and accelerating community recovery. However, current automated methods struggle with the unique visual complexity of tornado-induced wreckage, primarily due to severe domain shift from standard pre-training datasets and extreme class imbalance in real-world disaster data. To address these challenges, we introduce a systematic experimental framework evaluating 79 open-source deep learning models, encompassing both Convolutional Neural Networks (CNNs) and Vision Transformers, across over 2,300 controlled experiments on our newly curated Quad-State Tornado Damage (QSTD) benchmark dataset. Our findings reveal that achieving operational-grade performance hinges on a complex interaction between architecture and optimization, rather than architectural selection alone. Most strikingly, we demonstrate that optimizer choice can be more consequential than architecture: switching from Adam to SGD provided dramatic F1 gains of +25 to +38 points for Vision Transformer and Swin Transformer families, fundamentally reversing their ranking from bottom-tier to competitive with top-performing CNNs. Furthermore, a low learning rate of 1x10^(-4) proved universally critical, boosting average F1 performance by +10.2 points across all architectures. Our champion model, ConvNeXt-Base trained with these optimized settings, demonstrated strong cross-event generalization on the held-out Tuscaloosa-Moore Tornado Damage (TMTD) dataset, achieving 46.4% Macro F1 (+34.6 points over baseline) and retaining 85.5% Ordinal Top-1 Accuracy despite temporal and sensor domain shifts.
34. 【2602.14514】Efficient Text-Guided Convolutional Adapter for the Diffusion Model
链接:https://arxiv.org/abs/2602.14514
作者:Aryan Das,Koushik Biswas,Swalpa Kumar Roy,Badri Narayana Patro,Vinay Kumar Verma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Preserving Conditional Generation, Structure Preserving Conditional, conditional image generation, Conditional Generation, Nexus Prime adapter
备注:
点击查看摘要
Abstract:We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: this https URL
35. 【2602.14512】MedVAR: Towards Scalable and Efficient Medical Image Generation via Next-scale Autoregressive Prediction
链接:https://arxiv.org/abs/2602.14512
作者:Zhicheng He,Yunpeng Zhao,Junde Wu,Ziwei Niu,Zijun Li,Lanfen Lin,Yueming Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-resource clinical tasks, privacy-preserving data sharing, pivotal in applications, augmentation for low-resource, low-resource clinical
备注: 23 pages, 8 figures
点击查看摘要
Abstract:Medical image generation is pivotal in applications like data augmentation for low-resource clinical tasks and privacy-preserving data sharing. However, developing a scalable generative backbone for medical imaging requires architectural efficiency, sufficient multi-organ data, and principled evaluation, yet current approaches leave these aspects unresolved. Therefore, we introduce MedVAR, the first autoregressive-based foundation model that adopts the next-scale prediction paradigm to enable fast and scale-up-friendly medical image synthesis. MedVAR generates images in a coarse-to-fine manner and produces structured multi-scale representations suitable for downstream use. To support hierarchical generation, we curate a harmonized dataset of around 440,000 CT and MRI images spanning six anatomical regions. Comprehensive experiments across fidelity, diversity, and scalability show that MedVAR achieves state-of-the-art generative performance and offers a promising architectural direction for future medical generative foundation models.
36. 【2602.14509】MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification
链接:https://arxiv.org/abs/2602.14509
作者:Mingrui Ma,Chentao Li,Pan Huang,Jing Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:slide images, diagnosis and sub-typing, gold standard, standard for pathological, pathological diagnosis
备注: Our code is available at [this https URL](https://github.com/Prince-Lee-PathAI/MacNet)
点击查看摘要
Abstract:Whole slide images (WSIs) are the gold standard for pathological diagnosis and sub-typing. Current main-stream two-step frameworks employ offline feature encoders trained without domain-specific knowledge. Among them, attention-based multiple instance learning (MIL) methods are outcome-oriented and offer limited interpretability. Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids. To this end, we propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering, where the manifold geometric structure facilitates robust clustering results. Furthermore, we design a prior knowledge guiding proxy instance labeling and aggregation strategy to approximate patch labels and focus on pathologically relevant tumor regions. Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable computation resources.
37. 【2602.14501】Prototype Instance-semantic Disentanglement with Low-rank Regularized Subspace Clustering for WSIs Explainable Recognition
链接:https://arxiv.org/abs/2602.14501
作者:Chentao Li,Pan Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tumor region plays, pathological diagnosis, region plays, plays a key, key role
备注: Our code is available at [this https URL](https://github.com/Prince-Lee-PathAI/PID-LRSC)
点击查看摘要
Abstract:The tumor region plays a key role in pathological diagnosis. Tumor tissues are highly similar to precancerous lesions and non tumor instances often greatly exceed tumor instances in whole slide images (WSIs). These issues cause instance-semantic entanglement in multi-instance learning frameworks, degrading both model representation capability and interpretability. To address this, we propose an end-to-end prototype instance semantic disentanglement framework with low-rank regularized subspace clustering, PID-LRSC, in two aspects. First, we use secondary instance subspace learning to construct low-rank regularized subspace clustering (LRSC), addressing instance entanglement caused by an excessive proportion of non tumor instances. Second, we employ enhanced contrastive learning to design prototype instance semantic disentanglement (PID), resolving semantic entanglement caused by the high similarity between tumor and precancerous tissues. We conduct extensive experiments on multicentre pathology datasets, implying that PID-LRSC outperforms other SOTA methods. Overall, PID-LRSC provides clearer instance semantics during decision-making and significantly enhances the reliability of auxiliary diagnostic outcomes.
38. 【2602.14498】Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
链接:https://arxiv.org/abs/2602.14498
作者:Aryan Das,Tanishq Rachamalla,Koushik Biswas,Swalpa Kumar Roy,Vinay Kumar Verma
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Decoding Attention Block, State Space Mixer, precise medical diagnosis, Modality Decoding Attention, multimodal segmentation framework
备注:
点击查看摘要
Abstract:We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: this https URL
39. 【2602.14493】Gaussian Mesh Renderer for Lightweight Differentiable Rendering
链接:https://arxiv.org/abs/2602.14493
作者:Xinpeng Liu,Fumio Okura
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:enabled high-fidelity virtualization, Gaussian Splatting, view synthesis, enabled high-fidelity, high-fidelity virtualization
备注: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026). GitHub: [this https URL](https://github.com/huntorochi/Gaussian-Mesh-Renderer)
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has enabled high-fidelity virtualization with fast rendering and optimization for novel view synthesis. On the other hand, triangle mesh models still remain a popular choice for surface reconstruction but suffer from slow or heavy optimization in traditional mesh-based differentiable renderers. To address this problem, we propose a new lightweight differentiable mesh renderer leveraging the efficient rasterization process of 3DGS, named Gaussian Mesh Renderer (GMR), which tightly integrates the Gaussian and mesh representations. Each Gaussian primitive is analytically derived from the corresponding mesh triangle, preserving structural fidelity and enabling the gradient flow. Compared to the traditional mesh renderers, our method achieves smoother gradients, which especially contributes to better optimization using smaller batch sizes with limited memory. Our implementation is available in the public GitHub repository at this https URL.
40. 【2602.14486】Revisiting the Platonic Representation Hypothesis: An Aristotelian View
链接:https://arxiv.org/abs/2602.14486
作者:Fabian Gröger,Shuo Wen,Maria Brbić
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:Platonic Representation Hypothesis, Representation Hypothesis suggests, Representation Hypothesis, Platonic Representation, common statistical model
备注:
点击查看摘要
Abstract:The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can systematically inflate representational similarity scores. To correct these effects, we introduce a permutation-based null-calibration framework that transforms any representational similarity metric into a calibrated score with statistical guarantees. We revisit the Platonic Representation Hypothesis with our calibration framework, which reveals a nuanced picture: the apparent convergence reported by global spectral measures largely disappears after calibration, while local neighborhood similarity, but not local distances, retains significant agreement across different modalities. Based on these findings, we propose the Aristotelian Representation Hypothesis: representations in neural networks are converging to shared local neighborhood relationships.
41. 【2602.14482】kArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning
链接:https://arxiv.org/abs/2602.14482
作者:Hao Ding,Zhichuan Yang,Weijie Ge,Ziqin Gao,Chaoyi Lu,Lei Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:global image encoding, single global image, multimodal large language, tiny objects, image encoding
备注:
点击查看摘要
Abstract:We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.
42. 【2602.14464】CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer
链接:https://arxiv.org/abs/2602.14464
作者:Wenbo Nie,Zixiang Li,Renshuai Tao,Bin Wu,Yunchao Wei,Yao Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Transferring visual style, preserving semantic correspondence, similar objects remains, pixel-wise semantic correspondence, computer vision
备注:
点击查看摘要
Abstract:Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
43. 【2602.14457】Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5
链接:https://arxiv.org/abs/2602.14457
作者:Dongrui Liu,Yi Yu,Jie Zhang,Guanxu Chen,Qihao Lin,Hanxi Zhu,Lige Huang,Yijin Zhou,Peng Wang,Shuai Shao,Boxuan Zhang,Zicheng Liu,Jingwei Sun,Yu Li,Yuejin Xie,Jiaxuan Guo,Jia Xu,Chaochao Lu,Bowen Zhou,Xia Hu,Jing Shao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:Risk Management Framework, Management Framework, Framework in Practice, advancing artificial intelligence, Large Language Models
备注: 49 pages, 17 figures, 12 tables
点击查看摘要
Abstract:To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\D, we focus on the ``mis-evolution'' of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
44. 【2602.14443】Controlling Your Image via Simplified Vector Graphics
链接:https://arxiv.org/abs/2602.14443
作者:Lanqing Guo,Xi Liu,Yufei Wang,Zhihao Li,Siyu Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remarkable visual quality, enabling intuitive modifications, achieved remarkable visual, fundamental challenge remains, Recent advances
备注: Preprint
点击查看摘要
Abstract:Recent advances in image generation have achieved remarkable visual quality, while a fundamental challenge remains: Can image generation be controlled at the element level, enabling intuitive modifications such as adjusting shapes, altering colors, or adding and removing objects? In this work, we address this challenge by introducing layer-wise controllable generation through simplified vector graphics (VGs). Our approach first efficiently parses images into hierarchical VG representations that are semantic-aligned and structurally coherent. Building on this representation, we design a novel image synthesis framework guided by VGs, allowing users to freely modify elements and seamlessly translate these edits into photorealistic outputs. By leveraging the structural and semantic features of VGs in conjunction with noise prediction, our method provides precise control over geometry, color, and object semantics. Extensive experiments demonstrate the effectiveness of our approach in diverse applications, including image editing, object-level manipulation, and fine-grained content creation, establishing a new paradigm for controllable image generation. Project page: this https URL
45. 【2602.14441】D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection
链接:https://arxiv.org/abs/2602.14441
作者:Gagandeep Singh,Samudi Amarasinghe,Priyanka Singh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal misinformation increasingly, misinformation increasingly mixes, increasingly mixes realistic, mixes realistic im-age, Multimodal misinformation
备注: 12 pages, 2 figures
点击查看摘要
Abstract:Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.
46. 【2602.14425】Hierarchical Vision-Language Interaction for Facial Action Unit Detection
链接:https://arxiv.org/abs/2602.14425
作者:Yong Li,Yi Ren,Yizhe Zhang,Wenhua Zhang,Tianyi Zhang,Muyun Jiang,Guo-Sen Xie,Cuntai Guan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Action Coding System, Facial Action Unit, Facial Action Coding, Action Unit, Action Coding
备注: Accepted to IEEE Transaction on Affective Computing 2026
点击查看摘要
Abstract:Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.
47. 【2602.14413】Understanding Sensor Vulnerabilities in Industrial XR Tracking
链接:https://arxiv.org/abs/2602.14413
作者:Sourya Saha,Md. Nurul Absur
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Extended Reality, Inertial Odometry, ideal assumptions, environments often involve, deviate from ideal
备注: IEEE VR XRIOS 2026 Workshop
点击查看摘要
Abstract:Extended Reality (XR) systems deployed in industrial and operational settings rely on Visual--Inertial Odometry (VIO) for continuous six-degree-of-freedom pose tracking, yet these environments often involve sensing conditions that deviate from ideal assumptions. Despite this, most VIO evaluations emphasize nominal sensor behavior, leaving the effects of sustained sensor degradation under operational conditions insufficiently understood. This paper presents a controlled empirical study of VIO behavior under degraded sensing, examining faults affecting visual and inertial modalities across a range of operating regimes. Through systematic fault injection and quantitative evaluation, we observe a pronounced asymmetry in fault impact where degradations affecting visual sensing typically lead to bounded pose errors on the order of centimeters, whereas degradations affecting inertial sensing can induce substantially larger trajectory deviations, in some cases reaching hundreds to thousands of meters. These observations motivate greater emphasis on inertial reliability in the evaluation and design of XR systems for real-life industrial settings.
48. 【2602.14409】Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning
链接:https://arxiv.org/abs/2602.14409
作者:Haichao Zhu,Zhaorui Yang,Qian Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:physical consistency constraints, problem traditionally addressed, Spatial perception aims, consistency constraints, aims to estimate
备注:
点击查看摘要
Abstract:Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.14409 [cs.CV]
(or
arXiv:2602.14409v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.14409
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
49. 【2602.14408】Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection
链接:https://arxiv.org/abs/2602.14408
作者:Rongqiang Zhao,Hengrui Hu,Yijing Wang,Mingchun Sun,Jie Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:exhibit limited capability, extracting fine-grained abnormal, fine-grained abnormal features, exhibit limited, limited capability
备注:
点击查看摘要
Abstract:Multimodal methods are widely used in rice deterioration detection, which exhibit limited capability in representing and extracting fine-grained abnormal features. Moreover, these methods rely on devices, such as hyperspectral cameras and mass spectrometers, increasing detection costs and prolonging data acquisition time. To address these issues, we propose a feature recalibration based olfactory-visual multimodal model for fine-grained rice deterioration detection. The fine-grained deterioration embedding constructor (FDEC) is proposed to reconstruct the labeled multimodal embedded-feature dataset, enhancing sample representation. The fine-grained deterioration recalibration attention network (FDRA-Net) is proposed to emphasize signal variations and increase sensitivity to fine-grained deterioration on the rice surface. Experiments show that the proposed method achieves a classification accuracy of 99.89%. Compared with state-of-the-art methods, the detection accuracy is improved and the procedure is simplified. Furthermore, field detection demonstrates the advantages of accuracy and operational simplicity. The proposed method can also be extended to other agrifood in agriculture and food industry.
50. 【2602.14401】pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI
链接:https://arxiv.org/abs/2602.14401
作者:Qingqian Yang,Hao Wang,Sai Qian Zhang,Jian Li,Yang Hua,Miao Pan,Tao Song,Zhengwei Qi,Haibing Guan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:significant privacy concerns, raising significant privacy, private indoor environments, VLN requires large-scale, requires large-scale trajectory
备注: Preprint
点击查看摘要
Abstract:Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs' extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.
51. 【2602.14399】Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models
链接:https://arxiv.org/abs/2602.14399
作者:In Chong Choi,Jiacheng Zhang,Feng Liu,Yiliao Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:text-only large language, large language models, introducing malicious content, effective against text-only, large vision-language models
备注:
点击查看摘要
Abstract:Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.
52. 【2602.14381】Adapting VACE for Real-Time Autoregressive Video Diffusion
链接:https://arxiv.org/abs/2602.14381
作者:Ryan Fosdick(Daydream)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Creation and Editing, autoregressive video generation, real-time autoregressive video, fixed chunk sizes, video generation
备注: 10 pages, 4 figures, 7 tables
点击查看摘要
Abstract:We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at this https URL.
53. 【2602.14376】Event-based Visual Deformation Measurement
链接:https://arxiv.org/abs/2602.14376
作者:Yuliang Wu,Wei Zhai,Yuxin Cui,Tiesong Zhao,Yang Cao,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Deformation Measurement, Visual Deformation, Deformation Measurement, aims to recover, Affine Invariant Simplicial
备注:
点击查看摘要
Abstract:Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead. We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation. Revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework. It partitions the deformation field into linearized sub-regions with low-parametric representation, effectively mitigating motion ambiguities arising from sparse and noisy events. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking. To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and frames is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that our method outperforms the state-of-the-art baseline by 1.6% in survival rate. Remarkably, it achieves this using only 18.9% of the data storage and processing resources of high-speed video methods.
54. 【2602.14365】Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data
链接:https://arxiv.org/abs/2602.14365
作者:Shun Kato(Keio University, Japan),Yasushi Kondo(Keio University, Japan),Shuntaro Saito(Keio University, Japan),Yoshimitsu Aoki(Keio University, Japan),Mariko Isogawa(Keio University, Japan)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:autoimmune disease characterized, Rheumatoid arthritis, systemic joint inflammation, RGB hand images, characterized by systemic
备注:
点击查看摘要
Abstract:Rheumatoid arthritis (RA) is an autoimmune disease characterized by systemic joint inflammation. Early diagnosis and tight follow-up are essential to the management of RA, as ongoing inflammation can cause irreversible joint damage. The detection of arthritis is important for diagnosis and assessment of disease activity; however, it often takes a long time for patients to receive appropriate specialist care. Therefore, there is a strong need to develop systems that can detect joint inflammation easily using RGB images captured at home. Consequently, we tackle the task of RA inflammation detection from RGB hand images. This task is highly challenging due to general issues in medical imaging, such as the scarcity of positive samples, data imbalance, and the inherent difficulty of the task itself. However, to the best of our knowledge, no existing work has explicitly addressed these challenges in RGB-based RA inflammation detection. This paper quantitatively demonstrates the difficulty of visually detecting inflammation by constructing a dedicated dataset, and we propose a inflammation detection framework with global local encoder that combines self-supervised pretraining on large-scale healthy hand images with imbalance-aware training to detect RA-related joint inflammation from RGB hand images. Our experiments demonstrated that the proposed approach improves F1-score by 0.2 points and Gmean by 0.25 points compared with the baseline model.
55. 【2602.14356】A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification
链接:https://arxiv.org/abs/2602.14356
作者:Areez Muhammed Shabu,Mohammad Samar Ansari,Asra Aslam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:common cancers worldwide, effective treatment, Skin, Skin Imaging Collaboration, International Skin Imaging
备注:
点击查看摘要
Abstract:Skin cancer is one of the most common cancers worldwide and early detection is critical for effective treatment. However, current AI diagnostic tools are often trained on datasets dominated by lighter skin tones, leading to reduced accuracy and fairness for people with darker skin. The International Skin Imaging Collaboration (ISIC) dataset, one of the most widely used benchmarks, contains over 70% light skin images while dark skins fewer than 8%. This imbalance poses a significant barrier to equitable healthcare delivery and highlights the urgent need for methods that address demographic diversity in medical imaging. This paper addresses this challenge of skin tone imbalance in automated skin cancer detection using dermoscopic images. To overcome this, we present a generative augmentation pipeline that fine-tunes a pre-trained Stable Diffusion model using Low-Rank Adaptation (LoRA) on the image dark-skin subset of the ISIC dataset and generates synthetic dermoscopic images conditioned on lesion type and skin tone. In this study, we investigated the utility of these images on two downstream tasks: lesion segmentation and binary classification. For segmentation, models trained on the augmented dataset and evaluated on held-out real images show consistent improvements in IoU, Dice coefficient, and boundary accuracy. These evalutions provides the verification of Generated dataset. For classification, an EfficientNet-B0 model trained on the augmented dataset achieved 92.14% accuracy. This paper demonstrates that synthetic data augmentation with Generative AI integration can substantially reduce bias with increase fairness in conventional dermatological diagnostics and open challenges for future directions.
56. 【2602.14297】Differential pose optimization in descriptor space -- Combining Geometric and Photometric Methods for Motion Estimation
链接:https://arxiv.org/abs/2602.14297
作者:Andreas L. Teigen,Annette Stahl,Rudolf Mester
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:two-frame relative pose, fundamental problems, computer vision, two-frame relative, pose optimization problem
备注:
点击查看摘要
Abstract:One of the fundamental problems in computer vision is the two-frame relative pose optimization problem. Primarily, two different kinds of error values are used: photometric error and re-projection error. The selection of error value is usually directly dependent on the selection of feature paradigm, photometric features, or geometric features. It is a trade-off between accuracy, robustness, and the possibility of loop closing. We investigate a third method that combines the strengths of both paradigms into a unified approach. Using densely sampled geometric feature descriptors, we replace the photometric error with a descriptor residual from a dense set of descriptors, thereby enabling the employment of sub-pixel accuracy in differential photometric methods, along with the expressiveness of the geometric feature descriptor. Experiments show that although the proposed strategy is an interesting approach that results in accurate tracking, it ultimately does not outperform pose optimization strategies based on re-projection error despite utilizing more information. We proceed to analyze the underlying reason for this discrepancy and present the hypothesis that the descriptor similarity metric is too slowly varying and does not necessarily correspond strictly to keypoint placement accuracy.
57. 【2602.14276】Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
链接:https://arxiv.org/abs/2602.14276
作者:A. Said Gurbuz,Sunghwan Hong,Ahmed Nassar,Marc Pollefeys,Peter Staar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern computer-use agents, reliably ground instructions, Modern computer-use, computer-use agents, structured state
备注: 28 pages, 15 figures
点击查看摘要
Abstract:Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: this https URL.
58. 【2602.14237】AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks
链接:https://arxiv.org/abs/2602.14237
作者:Kunal Swami,Raghu Chittersu,Yuvraj Rathore,Rajeev Irny,Shashavali Doodekula,Alok Shukla
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Instruction-based object addition, Instruction-based object, mask-based inputs, ambiguity of text-only, text-only prompts
备注: Accepted in IEEE ICASSP 2026
点击查看摘要
Abstract:Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework's ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.
59. 【2602.14236】Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models
链接:https://arxiv.org/abs/2602.14236
作者:Vishnu Sai,Dheeraj Sai,Srinath B,Girish Varma,Priyesh Shukla
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
关键词:critical memory bottleneck, face a critical, cache with sequence, sequence length, linear growth
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.
60. 【2602.14228】Learning Significant Persistent Homology Features for 3D Shape Understanding
链接:https://arxiv.org/abs/2602.14228
作者:Prachi Kudeshia,Jiju Poovvancheri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:topology constitute complementary, constitute complementary descriptors, primarily capture geometric, capture geometric information, Geometry and topology
备注: 17 pages, 10 figures, Preprint under review
点击查看摘要
Abstract:Geometry and topology constitute complementary descriptors of three-dimensional shape, yet existing benchmark datasets primarily capture geometric information while neglecting topological structure. This work addresses this limitation by introducing topologically-enriched versions of ModelNet40 and ShapeNet, where each point cloud is augmented with its corresponding persistent homology features. These benchmarks with the topological signatures establish a foundation for unified geometry-topology learning and enable systematic evaluation of topology-aware deep learning architectures for 3D shape analysis. Building on this foundation, we propose a deep learning-based significant persistent point selection method, \textit{TopoGAT}, that learns to identify the most informative topological features directly from input data and the corresponding topological signatures, circumventing the limitations of hand-crafted statistical selection criteria. A comparative study verifies the superiority of the proposed method over traditional statistical approaches in terms of stability and discriminative power. Integrating the selected significant persistent points into standard point cloud classification and part-segmentation pipelines yields improvements in both classification accuracy and segmentation metrics. The presented topologically-enriched datasets, coupled with our learnable significant feature selection approach, enable the broader integration of persistent homology into the practical deep learning workflows for 3D point cloud analysis.
61. 【2602.14226】Freq-DP Net: A Dual-Branch Network for Fence Removal using Dual-Pixel and Fourier Priors
链接:https://arxiv.org/abs/2602.14226
作者:Kunal Swami,Sudha Velusamy,Chandra Sekhar Seelamantula
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision applications, degrades visual quality, limits downstream computer, downstream computer vision, Removing fence occlusions
备注: Accepted in IEEE ICASSP 2026
点击查看摘要
Abstract:Removing fence occlusions from single images is a challenging task that degrades visual quality and limits downstream computer vision applications. Existing methods often fail on static scenes or require motion cues from multiple frames. To overcome these limitations, we introduce the first framework to leverage dual-pixel (DP) sensors for this problem. We propose Freq-DP Net, a novel dual-branch network that fuses two complementary priors: a geometric prior from defocus disparity, modeled using an explicit cost volume, and a structural prior of the fence's global pattern, learned via Fast Fourier Convolution (FFC). An attention mechanism intelligently merges these cues for highly accurate fence segmentation. To validate our approach, we build and release a diverse benchmark with different fence varieties. Experiments demonstrate that our method significantly outperforms strong general-purpose baselines, establishing a new state-of-the-art for single-image, DP-based fence removal.
62. 【2602.14214】HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming
链接:https://arxiv.org/abs/2602.14214
作者:Jiahui Chen,Bo Peng,Lianchen Jia,Zeyu Zhang,Tianchi Huang,Lifeng Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:optimize subjective quality, chunk-level importance weights, chunk-level importance, quality of experience, Large Language Models
备注: ICLR 2026
点击查看摘要
Abstract:Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5\% for VOD and 26\% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7\%.
63. 【2602.14201】GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
链接:https://arxiv.org/abs/2602.14201
作者:Fengxiang Wang,Mingshuo Chen,Yueying Li,Yajie Yang,Yifan Zhang,Long Lan,Xue Yang,Hongda Sun,Yulin Wang,Di Wang,Jun Song,Jing Zhang,Bo Du
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:paradigm enables multimodal, enables multimodal large, multimodal large language, actively explore visual, explore visual scenes
备注:
点击查看摘要
Abstract:The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.
64. 【2602.14193】Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation
链接:https://arxiv.org/abs/2602.14193
作者:Yue Chen,Muqing Jiang,Kaifeng Zheng,Jiaqi Liang,Chenrui Tie,Haoran Lu,Ruihai Wu,Hao Dong
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:diverse objects remains, Articulated object manipulation, Articulated object, remains a major, diverse object categories
备注: Accept to ICLR 2026, Project page: [this https URL](https://pa3ff.github.io)
点击查看摘要
Abstract:Articulated object manipulation is essential for various real-world robotic tasks, yet generalizing across diverse objects remains a major challenge. A key to generalization lies in understanding functional parts (e.g., door handles and knobs), which indicate where and how to manipulate across diverse object categories and shapes. Previous works attempted to achieve generalization by introducing foundation features, while these features are mostly 2D-based and do not specifically consider functional parts. When lifting these 2D features to geometry-profound 3D space, challenges arise, such as long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information. To address these issues, we propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D feature with part awareness for generalizable articulated object manipulation. PA3FF is trained by 3D part proposals from a large-scale labeled dataset, via a contrastive learning formulation. Given point clouds as input, PA3FF predicts a continuous 3D feature field in a feedforward manner, where the distance between point features reflects the proximity of functional parts: points with similar features are more likely to belong to the same part. Building on this feature, we introduce the Part-Aware Diffusion Policy (PADP), an imitation learning framework aimed at enhancing sample efficiency and generalization for robotic manipulation. We evaluate PADP on several simulated and real-world tasks, demonstrating that PA3FF consistently outperforms a range of 2D and 3D representations in manipulation scenarios, including CLIP, DINOv2, and Grounded-SAM. Beyond imitation learning, PA3FF enables diverse downstream methods, including correspondence learning and segmentation tasks, making it a versatile foundation for robotic manipulation. Project page: this https URL
65. 【2602.14186】UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing
链接:https://arxiv.org/abs/2602.14186
作者:Hongyang Wei,Bin Wen,Yancheng Long,Yankai Yang,Yuhang Hu,Tianke Zhang,Wei Chen,Haonan Fan,Kaiyu Jiang,Jiankang Chen,Changyi Liu,Kaiyu Tang,Haojie Ding,Xiao Yang,Jia Sun,Huaiqing Wang,Zhenyu Yang,Xinyu Wei,Xianglong He,Yangguang Li,Fan Yang,Tingting Gao,Lei Zhang,Guorui Zhou,Han Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-performance multi-modal generation, multi-modal generation system, unifies single-image editing, high-performance multi-modal, system that unifies
备注:
点击查看摘要
Abstract:We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.
66. 【2602.14178】UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
链接:https://arxiv.org/abs/2602.14178
作者:Shaobin Zhuang,Yuang Ai,Jiaming Han,Weijia Mao,Xiaohui Li,Fangyikang Wang,Xiao Wang,Yan Li,Shanchuan Lin,Kun Xu,Zhenheng Yang,Huaibo Huang,Xiangyu Yue,Hao Chen,Yali Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Large Language, supports high-fidelity reconstruction, simultaneously supports high-fidelity
备注: 29 pages, 9 figures, 33 tables
点击查看摘要
Abstract:Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.
67. 【2602.14177】owards Spatial Transcriptomics-driven Pathology Foundation Models
链接:https://arxiv.org/abs/2602.14177
作者:Konstantin Hemker,Andrew H. Song,Cristina Almagro-Pérez,Guillaume Jaume,Sophia J. Wagner,Anurag Vaidya,Nikola Simidjievski,Mateja Jamnik,Faisal Mahmood
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:spatially resolved measurements, pathology foundation models, Spatial transcriptomics, enabling characterization, foundation models
备注:
点击查看摘要
Abstract:Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.
68. 【2602.14162】Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
链接:https://arxiv.org/abs/2602.14162
作者:Tao Xu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Existing multimodal document, generate comprehensive descriptions, answering methods universally, methods universally adopt, Existing multimodal
备注: 24 pages, 9 figures, 9 tables
点击查看摘要
Abstract:Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
69. 【2602.14157】When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance
链接:https://arxiv.org/abs/2602.14157
作者:Ahmed Ghorbel,Badr Moufad,Navid Bagheri Shouraki,Alain Oliviero Durmus,Thomas Hirtz,Eric Moulines,Jimmy Olsson,Yazid Janati
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Text-driven image, inpainting problems, naturally cast, cast as inpainting, masked regions
备注: Preprint
点击查看摘要
Abstract:Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.
70. 【2602.14153】ARport: An Augmented Reality System for Markerless Image-Guided Port Placement in Robotic Surgery
链接:https://arxiv.org/abs/2602.14153
作者:Zheng Han,Zixin Yang,Yonghao Long,Lin Zhang,Peter Kazanzides,Qi Dou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Precise port placement, port configuration influences, patient body surface, Precise port, patient body
备注:
点击查看摘要
Abstract:Purpose: Precise port placement is a critical step in robot-assisted surgery, where port configuration influences both visual access to the operative field and instrument maneuverability. To bridge the gap between preoperative planning and intraoperative execution, we present ARport, an augmented reality (AR) system that automatically maps pre-planned trocar layouts onto the patient's body surface, providing intuitive spatial guidance during surgical preparation. Methods: ARport, implemented on an optical see-through head-mounted display (OST-HMD), operates without any external sensors or markers, simplifying setup and enhancing workflow integration. It reconstructs the operative scene from RGB, depth, and pose data captured by the OST-HMD, extracts the patient's body surface using a foundation model, and performs surface-based markerless registration to align preoperative anatomical models to the extracted patient's body surface, enabling in-situ visualization of planned trocar layouts. A demonstration video illustrating the overall workflow is available online. Results: In full-scale human-phantom experiments, ARport accurately overlaid pre-planned trocar sites onto the physical phantom, achieving consistent spatial correspondence between virtual plans and real anatomy. Conclusion: ARport provides a fully marker-free and hardware-minimal solution for visualizing preoperative trocar plans directly on the patient's body surface. The system facilitates efficient intraoperative setup and demonstrates potential for seamless integration into routine clinical workflows.
71. 【2602.14147】LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models
链接:https://arxiv.org/abs/2602.14147
作者:Shufan Li,Yuchen Zhu,Jiuxiang Gu,Kangning Liu,Zhe Lin,Yongxin Chen,Molei Tao,Aditya Grover,Jason Kuen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion language models, Diffusion language, language models, recently emerged, auto-regressive LLMs
备注: 28 pages, 11 figures
点击查看摘要
Abstract:Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
72. 【2602.14140】Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking
链接:https://arxiv.org/abs/2602.14140
作者:Kaixuan Fang,Yuzhen Lu,Xinyang Mu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Traditional mechanized chestnut, Traditional mechanized, mechanized chestnut harvesting, small producers, damaging nuts
备注: 16 pages, 10 figures
点击查看摘要
Abstract:Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11-13) and 15 in the RT-DETR (v1-v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best mAP@0.5 of 95.1% among all the evaluated models, while the RT-DETRv2-R101 was the most accurate variant among RT-DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. Both the dataset and software programs in this study have been made publicly available at this https URL.
73. 【2602.14134】DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors
链接:https://arxiv.org/abs/2602.14134
作者:Yi Li,Hongze Shen,Lexiang Tang,Xin Li,Xinpeng Ding,Yinsong Liu,Deqiang Jiang,Xing Sun,Xiaomeng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, high-level visual understanding
备注: 25 pages, 9 figures
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.
74. 【2602.14122】EgoSound: Benchmarking Sound Understanding in Egocentric Videos
链接:https://arxiv.org/abs/2602.14122
作者:Bingwen Zhu,Yuqian Fu,Qiaole Dong,Guolei Sun,Tianwen Qian,Yuzheng Wu,Danda Pani Paudel,Xiangyang Xue,Yanwei Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, recently achieved remarkable
备注: 17 pages
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
75. 【2602.14119】GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction
链接:https://arxiv.org/abs/2602.14119
作者:Ahmet Burak Yildirim,Tuna Saygin,Duygu Ceylan,Aysegul Dundar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large reconstruction models, exhibit geometric inconsistencies, advanced rapidly, large reconstruction, inconsistencies and misaligned
备注:
点击查看摘要
Abstract:Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model's own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.
76. 【2602.14099】SemanticFeels: Semantic Labeling during In-Hand Manipulation
链接:https://arxiv.org/abs/2602.14099
作者:Anas Al Shikh Khalil,Haozhi Qi,Roberto Calandra
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:everyday tasks, intelligent behavior, robots become increasingly, increasingly integrated, integrated into everyday
备注: 10 pages, 5 figures
点击查看摘要
Abstract:As robots become increasingly integrated into everyday tasks, their ability to perceive both the shape and properties of objects during in-hand manipulation becomes critical for adaptive and intelligent behavior. We present SemanticFeels, an extension of the NeuralFeels framework that integrates semantic labeling with neural implicit shape representation, from vision and touch. To illustrate its application, we focus on material classification: high-resolution Digit tactile readings are processed by a fine-tuned EfficientNet-B0 convolutional neural network (CNN) to generate local material predictions, which are then embedded into an augmented signed distance field (SDF) network that jointly predicts geometry and continuous material regions. Experimental results show that the system achieves a high correspondence between predicted and actual materials on both single- and multi-material objects, with an average matching accuracy of 79.87% across multiple manipulation trials on a multi-material object.
77. 【2602.14098】ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization
链接:https://arxiv.org/abs/2602.14098
作者:Youqi Wang,Shen Chen,Haowei Wang,Rongxuan Peng,Taiping Yao,Shunquan Tan,Changsheng Chen,Bin Li,Shouhong Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing Multimodal Large, Multimodal Large Language, Large Language Models, Existing Multimodal, Multimodal Large
备注:
点击查看摘要
Abstract:Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at this https URL.
78. 【2602.14071】Bidirectional Temporal Dynamics Modeling for EEG-based Driving Fatigue Recognition
链接:https://arxiv.org/abs/2602.14071
作者:YipTin Po,Jianming Wang,Yutao Miao,Jiayan Zhang,Yunxu Zhao,Xiaomin Ouyang,Zhihong Li,Nevin L. Zhang
类目:Other Computer Science (cs.OH); Computer Vision and Pattern Recognition (cs.CV)
关键词:Bidirectional temporal dynamics, Driving fatigue, SADT driving fatigue, road safety, driving fatigue recognition
备注:
点击查看摘要
Abstract:Driving fatigue is a major contributor to traffic accidents and poses a serious threat to road safety. Electroencephalography (EEG) provides a direct measurement of neural activity, yet EEG-based fatigue recognition is hindered by strong non-stationarity and asymmetric neural dynamics. To address these challenges, we propose DeltaGateNet, a novel framework that explicitly captures Bidirectional temporal dynamics for EEG-based driving fatigue recognition. Our key idea is to introduce a Bidirectional Delta module that decomposes first-order temporal differences into positive and negative components, enabling explicit modeling of asymmetric neural activation and suppression patterns. Furthermore, we design a Gated Temporal Convolution module to capture long-term temporal dependencies for each EEG channel using depthwise temporal convolutions and residual learning, preserving channel-wise specificity while enhancing temporal representation robustness. Extensive experiments conducted under both intra-subject and inter-subject evaluation settings on the public SEED-VIG and SADT driving fatigue datasets demonstrate that DeltaGateNet consistently outperforms existing methods. On SEED-VIG, DeltaGateNet achieves an intra-subject accuracy of 81.89% and an inter-subject accuracy of 55.55%. On the balanced SADT 2022 dataset, it attains intra-subject and inter-subject accuracies of 96.81% and 83.21%, respectively, while on the unbalanced SADT 2952 dataset, it achieves 96.84% intra-subject and 84.49% inter-subject accuracy. These results indicate that explicitly modeling Bidirectional temporal dynamics yields robust and generalizable performance under varying subject and class-distribution conditions.
79. 【2602.14068】CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning
链接:https://arxiv.org/abs/2602.14068
作者:Yuhui Wu,Chenxi Xie,Ruibin Li,Liyi Chen,Qiaosi Yi,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved impressive results, large-scale generative models, Image editing, editing, achieved impressive
备注:
点击查看摘要
Abstract:Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.
80. 【2602.14048】ProAct: A Dual-System Framework for Proactive Embodied Social Agents
链接:https://arxiv.org/abs/2602.14048
作者:Zeyi Zhang,Zixi Kang,Ruijie Zhao,Yusen Feng,Biao Jiang,Libin Liu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:generating synchronized speech, agents have recently, recently advanced, advanced in generating, generating synchronized
备注: Project Page: [this https URL](https://proactrobot.github.io/)
点击查看摘要
Abstract:Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.
81. 【2602.14042】Restoration Adaptation for Semantic Segmentation on Low Quality Images
链接:https://arxiv.org/abs/2602.14042
作者:Kai Guan,Rongyuan Wu,Shuai Li,Wentao Zhu,Wenjun Zeng,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:clear semantic structures, segmentation, processing low-quality, high-frequency details, restoration
备注:
点击查看摘要
Abstract:In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images, which may lack clear semantic structures and high-frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real-world image restoration (Real-IR) models primarily focus on pixel-level fidelity and often fail to recover task-relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high-quality data lack robustness under real-world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high-quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic-Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross-attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning, thereby enhancing the model's robustness to LQ images. To validate the effectiveness of our framework, we construct a real-world LQ image segmentation dataset with high-quality annotations, and conduct extensive experiments on both synthetic and real-world LQ benchmarks. The results show that SCR and RASS significantly outperform state-of-the-art methods in segmentation and restoration tasks. Code, models, and datasets will be available at this https URL.
82. 【2602.14041】BitDance: Scaling Autoregressive Generative Models with Binary Tokens
链接:https://arxiv.org/abs/2602.14041
作者:Yuang Ai,Jiaming Han,Shaobin Zhuang,Weijia Mao,Xuefeng Hu,Ziyan Yang,Zhenheng Yang,Huaibo Huang,Xiangyu Yue,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:scalable autoregressive, codebook indices, predicts binary visual, BitDance, binary visual tokens
备注: Code and models: [this https URL](https://github.com/shallowdream204/BitDance)
点击查看摘要
Abstract:We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: this https URL.
83. 【2602.14040】Explainability-Inspired Layer-Wise Pruning of Deep Neural Networks for Efficient Object Detection
链接:https://arxiv.org/abs/2602.14040
作者:Abhinav Shukla,Nachiket Tapas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, increasing complexity poses, complexity poses significant, poses significant challenges, object detection tasks
备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) have achieved remarkable success in object detection tasks, but their increasing complexity poses significant challenges for deployment on resource-constrained platforms. While model compression techniques such as pruning have emerged as essential tools, traditional magnitude-based pruning methods do not necessarily align with the true functional contribution of network components to task-specific performance. In this work, we present an explainability-inspired, layer-wise pruning framework tailored for efficient object detection. Our approach leverages a SHAP-inspired gradient--activation attribution to estimate layer importance, providing a data-driven proxy for functional contribution rather than relying solely on static weight magnitudes. We conduct comprehensive experiments across diverse object detection architectures, including ResNet-50, MobileNetV2, ShuffleNetV2, Faster R-CNN, RetinaNet, and YOLOv8, evaluating performance on the Microsoft COCO 2017 validation set. The results show that the proposed attribution-inspired pruning consistently identifies different layers as least important compared to L1-norm-based methods, leading to improved accuracy--efficiency trade-offs. Notably, for ShuffleNetV2, our method yields a 10\% empirical increase in inference speed, whereas L1-pruning degrades performance by 13.7\%. For RetinaNet, the proposed approach preserves the baseline mAP (0.151) with negligible impact on inference speed, while L1-pruning incurs a 1.3\% mAP drop for a 6.2\% speed increase. These findings highlight the importance of data-driven layer importance assessment and demonstrate that explainability-inspired compression offers a principled direction for deploying deep neural networks on edge and resource-constrained platforms while preserving both performance and interpretability.
84. 【2602.14027】rain Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation
链接:https://arxiv.org/abs/2602.14027
作者:Jia Li,Xiaomeng Fu,Xurui Peng,Weifeng Chen,Youwei Zheng,Tianyu Zhao,Jiexi Wang,Fangmin Chen,Xing Wang,Hayden Kwok-Hay So
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autoregressive video diffusion, Autoregressive video, scalable paradigm, paradigm for long, Antiphase Noise Sampling
备注: 19 pages, 15 figures
点击查看摘要
Abstract:Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at this https URL.
85. 【2602.14021】Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow
链接:https://arxiv.org/abs/2602.14021
作者:Shenhan Qian,Ganlin Zhang,Shangzhe Wu,Daniel Cremers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reconstructing and tracking, remains a fundamental, fundamental challenge, challenge in computer, Reconstructing
备注: Project Page: [this https URL](https://shenhanqian.github.io/flow4r)
点击查看摘要
Abstract:Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.
86. 【2602.14010】A Deployment-Friendly Foundational Framework for Efficient Computational Pathology
链接:https://arxiv.org/abs/2602.14010
作者:Yu Cai,Cheng Jin,Jiabo Ma,Fengtao Zhou,Yingxue Xu,Zhengrui Guo,Yihui Wang,Zhengyu Zhang,Ling Liang,Yonghao Tan,Pingcheng Dong,Du Cai,On Ki Tang,Chenglong Zhao,Xi Wang,Can Yang,Yali Xu,Jing Cui,Zhenhui Li,Ronald Cheong Kin Chan,Yueping Liu,Feng Gao,Xiuming Zhang,Li Liang,Hao Chen,Kwang-Ting Cheng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:substantial computational cost, limits clinical accessibility, enabled robust generalization, computational cost, Adaptive Patch Selector
备注:
点击查看摘要
Abstract:Pathology foundation models (PFMs) have enabled robust generalization in computational pathology through large-scale datasets and expansive architectures, but their substantial computational cost, particularly for gigapixel whole slide images, limits clinical accessibility and scalability. Here, we present LitePath, a deployment-friendly foundational framework designed to mitigate model over-parameterization and patch level redundancy. LitePath integrates LiteFM, a compact model distilled from three large PFMs (Virchow2, H-Optimus-1 and UNI2) using 190 million patches, and the Adaptive Patch Selector (APS), a lightweight component for task-specific patch selection. The framework reduces model parameters by 28x and lowers FLOPs by 403.5x relative to Virchow2, enabling deployment on low-power edge hardware such as the NVIDIA Jetson Orin Nano Super. On this device, LitePath processes 208 slides per hour, 104.5x faster than Virchow2, and consumes 0.36 kWh per 3,000 slides, 171x lower than Virchow2 on an RTX3090 GPU. We validated accuracy using 37 cohorts across four organs and 26 tasks (26 internal, 9 external, and 2 prospective), comprising 15,672 slides from 9,808 patients disjoint from the pretraining data. LitePath ranks second among 19 evaluated models and outperforms larger models including H-Optimus-1, mSTAR, UNI2 and GPFM, while retaining 99.71% of the AUC of Virchow2 on average. To quantify the balance between accuracy and efficiency, we propose the Deployability Score (D-Score), defined as the weighted geometric mean of normalized AUC and normalized FLOP, where LitePath achieves the highest value, surpassing Virchow2 by 10.64%. These results demonstrate that LitePath enables rapid, cost-effective and energy-efficient pathology image analysis on accessible hardware while maintaining accuracy comparable to state-of-the-art PFMs and reducing the carbon footprint of AI deployment.
87. 【2602.13994】Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization
链接:https://arxiv.org/abs/2602.13994
作者:Guandong Li,Mengxia Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:integrate specific identities, Spatially Uniform Visual, arbitrary contexts, employ Spatially Uniform, aims to integrate
备注:
点击查看摘要
Abstract:Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.
88. 【2602.13993】Elastic Diffusion Transformer
链接:https://arxiv.org/abs/2602.13993
作者:Jiangshan Wang,Zeqiang Lai,Jiarui Chen,Jiayi Guo,Hang Guo,Xiu Li,Xiangyu Yue,Chunchao Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly computationally expensive, Elastic Diffusion Transformer, remain highly computationally, demonstrated remarkable generative, remarkable generative capabilities
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to $\sim$2$\times$ speedup with negligible loss in generation quality. Code will be available at this https URL.
89. 【2602.13961】MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
链接:https://arxiv.org/abs/2602.13961
作者:Shuoyuan Wang,Yiran Wang,Hongxin Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation and Language (cs.CL)
关键词:Mars exploration, Data-driven approaches, advancing planetary science, rapidly advancing planetary, approaches like deep
备注:
点击查看摘要
Abstract:Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at this https URL
90. 【2602.13944】Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology
链接:https://arxiv.org/abs/2602.13944
作者:Minghao Han,Dingkang Yang,Linhao Qu,Zizhi Chen,Gang Li,Han Wang,Jiacong Wang,Lihua Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:witnessed remarkable progress, Recent years, years have witnessed, witnessed remarkable, remarkable progress
备注: accepted by ICLR 2026, 34 pages, 10 figures, 7tables
点击查看摘要
Abstract:Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: this https URL.
91. 【2602.13930】MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction
链接:https://arxiv.org/abs/2602.13930
作者:Ruggiero Santeramo,Igor Zubarev,Florian Jug
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:programmes increasingly seek, screening programmes increasingly, interval to risk-adapted, personalized strategies, programmes increasingly
备注: 16 pages
点击查看摘要
Abstract:Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.
Comments:
16 pages
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2602.13930 [cs.CV]
(or
arXiv:2602.13930v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.13930
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Ruggiero Santeramo [view email] [v1]
Sat, 14 Feb 2026 23:56:22 UTC (1,329 KB)
92. 【2602.13909】High-fidelity 3D reconstruction for planetary exploration
链接:https://arxiv.org/abs/2602.13909
作者:Alfonso Martínez-Petersen,Levin Gerdes,David Rodríguez-Martínez,C. J. Pérez-del-Pulgar
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:exploration increasingly relies, Planetary exploration increasingly, increasingly relies, reconstructing their surroundings, absence of global
备注: 7 pages, 3 figures, conference paper
点击查看摘要
Abstract:Planetary exploration increasingly relies on autonomous robotic systems capable of perceiving, interpreting, and reconstructing their surroundings in the absence of global positioning or real-time communication with Earth. Rovers operating on planetary surfaces must navigate under sever environmental constraints, limited visual redundancy, and communication delays, making onboard spatial awareness and visual localization key components for mission success. Traditional techniques based on Structure-from-Motion (SfM) and Simultaneous Localization and Mapping (SLAM) provide geometric consistency but struggle to capture radiometric detail or to scale efficiently in unstructured, low-texture terrains typical of extraterrestrial environments. This work explores the integration of radiance field-based methods - specifically Neural Radiance Fields (NeRF) and Gaussian Splatting - into a unified, automated environment reconstruction pipeline for planetary robotics. Our system combines the Nerfstudio and COLMAP frameworks with a ROS2-compatible workflow capable of processing raw rover data directly from rosbag recordings. This approach enables the generation of dense, photorealistic, and metrically consistent 3D representations from minimal visual input, supporting improved perception and planning for autonomous systems operating in planetary-like conditions. The resulting pipeline established a foundation for future research in radiance field-based mapping, bridging the gap between geometric and neural representations in planetary exploration.
93. 【2602.13901】RPGD: RANSAC-P3P Gradient Descent for Extrinsic Calibration in 3D Human Pose Estimation
链接:https://arxiv.org/abs/2602.13901
作者:Zhanyu Tuo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:multi-view RGB cameras, Gradient Descent, robustly aligns MoCap-based, natural human motion, multi-view RGB
备注: Accepted at AAIML 2026. This work is co-funded by the European Union's Horizon Europe research and innovation programme under MSCA with grant agreement No 101081674
点击查看摘要
Abstract:In this paper, we propose RPGD (RANSAC-P3P Gradient Descent), a human-pose-driven extrinsic calibration framework that robustly aligns MoCap-based 3D skeletal data with monocular or multi-view RGB cameras using only natural human motion. RPGD formulates extrinsic calibration as a coarse-to-fine problem tailored to human poses, combining the global robustness of RANSAC-P3P with Gradient-Descent-based refinement. We evaluate RPGD on three large-scale public 3D HPE datasets as well as on a self-collected in-the-wild dataset. Experimental results demonstrate that RPGD consistently recovers extrinsic parameters with accuracy comparable to the provided ground truth, achieving sub-pixel MPJPE reprojection error even in challenging, noisy settings. These results indicate that RPGD provides a practical and automatic solution for reliable extrinsic calibration of large-scale 3D HPE dataset collection.
94. 【2602.13889】Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification
链接:https://arxiv.org/abs/2602.13889
作者:Daniel Chen,Zaria Zinn,Marcus Lowe
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:classification system capable, font classification system, rendered text images, capable of identifying, classification system
备注:
点击查看摘要
Abstract:We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.
95. 【2602.13887】Human-Aligned Evaluation of a Pixel-wise DNN Color Constancy Model
链接:https://arxiv.org/abs/2602.13887
作者:Hamed Heidari-Gorji,Raquel Gil Rodriguez,Karl R. Gegenfurtner
类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:Deep Neural Network, Deep Neural, photorealistic virtual reality, previously investigated color, developed a Deep
备注:
点击查看摘要
Abstract:We previously investigated color constancy in photorealistic virtual reality (VR) and developed a Deep Neural Network (DNN) that predicts reflectance from rendered images. Here, we combine both approaches to compare and study a model and human performance with respect to established color constancy mechanisms: local surround, maximum flux and spatial mean. Rather than evaluating the model against physical ground truth, model performance was assessed using the same achromatic object selection task employed in the human experiments. The model, a ResNet based U-Net from our previous work, was pre-trained on rendered images to predict surface reflectance. We then applied transfer learning, fine-tuning only the network's decoder on images from the baseline VR condition. To parallel the human experiment, the model's output was used to perform the same achromatic object selection task across all conditions. Results show a strong correspondence between the model and human behavior. Both achieved high constancy under baseline conditions and showed similar, condition-dependent performance declines when the local surround or spatial mean color cues were removed.
96. 【2602.13880】VSAL: A Vision Solver with Adaptive Layouts for Graph Property Detection
链接:https://arxiv.org/abs/2602.13880
作者:Jiahao Xie,Guangmo Tong
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:aims to determine, exhibits certain structural, property detection aims, structural properties, Graph
备注: Accepted by The Web Conference (WWW) 2026
点击查看摘要
Abstract:Graph property detection aims to determine whether a graph exhibits certain structural properties, such as being Hamiltonian. Recently, learning-based approaches have shown great promise by leveraging data-driven models to detect graph properties efficiently. In particular, vision-based methods offer a visually intuitive solution by processing the visualizations of graphs. However, existing vision-based methods rely on fixed visual graph layouts, and therefore, the expressiveness of their pipeline is restricted. To overcome this limitation, we propose VSAL, a vision-based framework that incorporates an adaptive layout generator capable of dynamically producing informative graph visualizations tailored to individual instances, thereby improving graph property detection. Extensive experiments demonstrate that VSAL outperforms state-of-the-art vision-based methods on various tasks such as Hamiltonian cycle, planarity, claw-freeness, and tree detection.
97. 【2602.13859】Low-Pass Filtering Improves Behavioral Alignment of Vision Models
链接:https://arxiv.org/abs/2602.13859
作者:Max Wolff,Thomas Klein,Evgenia Rusak,Felix Wichmann,Wieland Brendel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep Neural Networks, Deep Neural, Neural Networks, adequately modeling human, shape bias
备注: 10 pages, 6 figures
点击查看摘要
Abstract:Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emph{generative} -- rather than \emph{discriminative} -- classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time -- rather than training on blurred images -- achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of the specific width that also maximizes error consistency.
Comments:
10 pages, 6 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.13859 [cs.CV]
(or
arXiv:2602.13859v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.13859
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
98. 【2602.13846】Cardiac Output Prediction from Echocardiograms: Self-Supervised Learning with Limited Data
链接:https://arxiv.org/abs/2602.13846
作者:Adson Duarte,Davide Vitturini,Emanuele Milillo,Andrea Bragagnolo,Carlo Alberto Barbano,Riccardo Renzulli,Michele Cannito,Federico Giacobbe,Francesco Bruno,Ovidio de Filippo,Fabrizio D'Ascenzo,Marco Grangetto
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cardiac Output, cardiovascular diseases, key parameter, diagnosis and management, management of cardiovascular
备注: Accepted at ISBI 2026
点击查看摘要
Abstract:Cardiac Output (CO) is a key parameter in the diagnosis and management of cardiovascular diseases. However, its accurate measurement requires right-heart catheterization, an invasive and time-consuming procedure, motivating the development of reliable non-invasive alternatives using echocardiography. In this work, we propose a self-supervised learning (SSL) pretraining strategy based on SimCLR to improve CO prediction from apical four-chamber echocardiographic videos. The pretraining is performed using the same limited dataset available for the downstream task, demonstrating the potential of SSL even under data scarcity. Our results show that SSL mitigates overfitting and improves representation learning, achieving an average Pearson correlation of 0.41 on the test set and outperforming PanEcho, a model trained on over one million echocardiographic exams. Source code is available at this https URL.
99. 【2602.13844】Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation
链接:https://arxiv.org/abs/2602.13844
作者:Giorgio Chiesa,Rossella Borra,Vittorio Lauro,Sabrina De Cillis,Daniele Amparore,Cristian Fiori,Riccardo Renzulli,Marco Grangetto
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthetic dataset designed, robotic surgery instrument, Vinci robotic arms, paper presents, presents a comprehensive
备注: Accepted at ISBI 2026
点击查看摘要
Abstract:This paper presents a comprehensive workflow for generating and validating a synthetic dataset designed for robotic surgery instrument segmentation. A 3D reconstruction of the Da Vinci robotic arms was refined and animated in Autodesk Maya through a fully automated Python-based pipeline capable of producing photorealistic, labeled video sequences. Each scene integrates randomized motion patterns, lighting variations, and synthetic blood textures to mimic intraoperative variability while preserving pixel-accurate ground truth masks. To validate the realism and effectiveness of the generated data, several segmentation models were trained under controlled ratios of real and synthetic data. Results demonstrate that a balanced composition of real and synthetic samples significantly improves model generalization compared to training on real data only, while excessive reliance on synthetic data introduces a measurable domain shift. The proposed framework provides a reproducible and scalable tool for surgical computer vision, supporting future research in data augmentation, domain adaptation, and simulation-based pretraining for robotic-assisted surgery. Data and code are available at this https URL.
100. 【2602.13842】Automated Prediction of Paravalvular Regurgitation before Transcatheter Aortic Valve Implantation
链接:https://arxiv.org/abs/2602.13842
作者:Michele Cannito,Riccardo Renzulli,Adson Duarte,Farzad Nikfam,Carlo Alberto Barbano,Enrico Chiesa,Francesco Bruno,Federico Giacobbe,Wojciech Wanha,Arturo Giordano,Marco Grangetto,Fabrizio D'Ascenzo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Aortic Valve Implantation, Transcatheter Aortic Valve, Severe aortic stenosis, Valve Implantation, treated with Transcatheter
备注: Accepted at ISBI 2026
点击查看摘要
Abstract:Severe aortic stenosis is a common and life-threatening condition in elderly patients, often treated with Transcatheter Aortic Valve Implantation (TAVI). Despite procedural advances, paravalvular aortic regurgitation (PVR) remains one of the most frequent post-TAVI complications, with a proven impact on long-term prognosis. In this work, we investigate the potential of deep learning to predict the occurrence of PVR from preoperative cardiac CT. To this end, a dataset of preoperative TAVI patients was collected, and 3D convolutional neural networks were trained on isotropic CT volumes. The results achieved suggest that volumetric deep learning can capture subtle anatomical features from pre-TAVI imaging, opening new perspectives for personalized risk assessment and procedural optimization. Source code is available at this https URL.
Comments:
Accepted at ISBI 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.13842 [cs.CV]
(or
arXiv:2602.13842v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.13842
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
101. 【2602.13837】High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication
链接:https://arxiv.org/abs/2602.13837
作者:Cem Eteke,Batuhan Tosun,Alexander Griessel,Wolfgang Kellerer,Eckehard Steinbach
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video diffusion model, real-time video generation, diffusion model, semantic communication constraints, video diffusion
备注:
点击查看摘要
Abstract:We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates ( 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.
102. 【2602.13831】Prior-guided Hierarchical Instance-pixel Contrastive Learning for Ultrasound Speckle Noise Suppression
链接:https://arxiv.org/abs/2602.13831
作者:Zhenyu Bu,Yuanxin Xie,Guang-Quan Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mitigating speckle-induced degradations, enhancing image quality, improving diagnostic reliability, speckle-induced degradations, diagnostic reliability
备注:
点击查看摘要
Abstract:Ultrasound denoising is essential for mitigating speckle-induced degradations, thereby enhancing image quality and improving diagnostic reliability. Nevertheless, because speckle patterns inherently encode both texture and fine anatomical details, effectively suppressing noise while preserving structural fidelity remains a significant challenge. In this study, we propose a prior-guided hierarchical instance-pixel contrastive learning model for ultrasound denoising, designed to promote noise-invariant and structure-aware feature representations by maximizing the separability between noisy and clean samples at both pixel and instance levels. Specifically, a statistics-guided pixel-level contrastive learning strategy is introduced to enhance distributional discrepancies between noisy and clean pixels, thereby improving local structural consistency. Concurrently, a memory bank is employed to facilitate instance-level contrastive learning in the feature space, encouraging representations that more faithfully approximate the underlying data distribution. Furthermore, a hybrid Transformer-CNN architecture is adopted, coupling a Transformer-based encoder for global context modeling with a CNN-based decoder optimized for fine-grained anatomical structure restoration, thus enabling complementary exploitation of long-range dependencies and local texture details. Extensive evaluations on two publicly available ultrasound datasets demonstrate that the proposed model consistently outperforms existing methods, confirming its effectiveness and superiority.
103. 【2602.13823】Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings
链接:https://arxiv.org/abs/2602.13823
作者:Haonan Jiang,Yuji Wang,Yongjie Zhu,Xin Lu,Wenyu Qin,Meng Wang,Pengfei Wan,Yansong Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Leveraging Multimodal Large, Multimodal Large Language, Large Language Models, Large Language, advancing Universal Multimodal
备注: The project page is [this URL]( [this https URL](https://github.com/ZoengHN/Embed-RL) )
点击查看摘要
Abstract:Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.
104. 【2602.13818】VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer
链接:https://arxiv.org/abs/2602.13818
作者:Zongcheng Han,Dongyan Cao,Haoran Sun,Yu Hong
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:achieved remarkable success, Recent advances, transformers have achieved, achieved remarkable, remarkable success
备注:
点击查看摘要
Abstract:Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.
105. 【2602.13806】Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos
链接:https://arxiv.org/abs/2602.13806
作者:Can Li,Jie Gu,Jingmin Chen,Fangzhou Qiu,Lei Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:scalable robot learning, remains highly ill-posed, Understanding dynamic scenes, settings remains highly, strictly monocular settings
备注:
点击查看摘要
Abstract:Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.
106. 【2602.13801】Joint Orientation and Weight Optimization for Robust Watertight Surface Reconstruction via Dirichlet-Regularized Winding Fields
链接:https://arxiv.org/abs/2602.13801
作者:Jiaze Li,Daisheng Jin,Fei Hou,Junhui Hou,Zheng Liu,Shiqing Xin,Wenping Wang,Ying He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:propose Dirichlet Winding, Dirichlet Winding Reconstruction, unoriented point clouds, propose Dirichlet, reconstructing watertight surfaces
备注:
点击查看摘要
Abstract:We propose Dirichlet Winding Reconstruction (DiWR), a robust method for reconstructing watertight surfaces from unoriented point clouds with non-uniform sampling, noise, and outliers. Our method uses the generalized winding number (GWN) field as the target implicit representation and jointly optimizes point orientations, per-point area weights, and confidence coefficients in a single pipeline. The optimization minimizes the Dirichlet energy of the induced winding field together with additional GWN-based constraints, allowing DiWR to compensate for non-uniform sampling, reduce the impact of noise, and downweight outliers during reconstruction, with no reliance on separate preprocessing. We evaluate DiWR on point clouds from 3D Gaussian Splatting, a computer-vision pipeline, and corrupted graphics benchmarks. Experiments show that DiWR produces plausible watertight surfaces on these challenging inputs and outperforms both traditional multi-stage pipelines and recent joint orientation-reconstruction methods.
107. 【2602.13780】Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery
链接:https://arxiv.org/abs/2602.13780
作者:Hengtong Shen,Li Yan,Hong Xie,Yaxuan Wei,Xinhao Li,Wenfei Shen,Peixian Lv,Fei Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Remote sensing, extract critical information, semantic change detection, change detection methods, SCD
备注:
点击查看摘要
Abstract:Remote sensing (RS) change detection methods can extract critical information on surface dynamics and are an essential means for humans to understand changes in the earth's surface and environment. Among these methods, semantic change detection (SCD) can more effectively interpret the multi-class information contained in bi-temporal RS imagery, providing semantic-level predictions that support dynamic change monitoring. However, due to the limited semantic understanding capability of the model and the inherent complexity of the SCD tasks, existing SCD methods face significant challenges in both performance and paradigm complexity. In this paper, we propose PerASCD, a SCD method driven by RS foundation model PerA, designed to enhance the multi-scale semantic understanding and overall performance. We introduce a modular Cascaded Gated Decoder (CG-Decoder) that simplifies complex SCD decoding pipelines while promoting effective multi-level feature interaction and fusion. In addition, we propose a Soft Semantic Consistency Loss (SSCLoss) to mitigate the numerical instability commonly encountered during SCD training. We further explore the applicability of multiple existing RS foundation models on the SCD task when equipped with the proposed decoder. Experimental results demonstrate that our decoder not only effectively simplifies the paradigm of SCD, but also achieves seamless adaptation across various vision encoders. Our method achieves state-of-the-art (SOTA) performance on two public benchmark datasets, validating its effectiveness. The code is available at this https URL.
108. 【2602.13778】Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation
链接:https://arxiv.org/abs/2602.13778
作者:Jidong Jia,Youjian Zhang,Huan Fu,Dacheng Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mesh-level physical constraints, ignore mesh-level physical, skeletal domain, domain and ignore, ignore mesh-level
备注:
点击查看摘要
Abstract:Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: this https URL
109. 【2602.13772】Offline-Poly: A Polyhedral Framework For Offline 3D Multi-Object Tracking
链接:https://arxiv.org/abs/2602.13772
作者:Xiaoyu Li,Yitao Wu,Xian Wu,Haolin Zhuo,Lijun Zhao,Lining Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:critical component, MOT, Offline, multi-object tracking, MOT method based
备注: Based on this work, we achieved 1st place on the KITTI tracking leaderboard
点击查看摘要
Abstract:Offline 3D multi-object tracking (MOT) is a critical component of the 4D auto-labeling (4DAL) process. It enhances pseudo-labels generated by high-performance detectors through the incorporation of temporal context. However, existing offline 3D MOT approaches are direct extensions of online frameworks and fail to fully exploit the advantages of offline setting. Moreover, these methods often depend on fixed upstream and customized architectures, limiting their adaptability. To address these limitations, we propose Offline-Poly, a general offline 3D MOT method based on a tracking-centric design. We introduce a standardized paradigm termed Tracking-by-Tracking (TBT), which operates exclusively on arbitrary off-the-shelf tracking outputs and produces offline-refined tracklets. This formulation decouples offline tracker from specific upstream detectors or trackers. Under the TBT paradigm, Offline-Poly accepts one or multiple coarse tracking results and processes them through a structured pipeline comprising pre-processing, hierarchical matching and fusion, and tracklet refinement. Each module is designed to capitalize on the two fundamental properties of offline tracking: resource unconstrainedness, which permits global optimization beyond real-time limits, and future observability, which enables tracklet reasoning over the full temporal horizon. Offline-Poly first eliminates short-term ghost tracklets and re-identifies fragmented segments using global scene context. It then constructs scene-level similarity to associate tracklets across multiple input sources. Finally, Offline-Poly refines tracklets by jointly leveraging local and global motion patterns. On nuScenes, we achieve SOTA performance with 77.6% AMOTA. On KITTI, it achieves leading results with 83.00% HOTA. Comprehensive experiments further validate the flexibility, generalizability, and modular effectiveness of Offline-Poly.
110. 【2602.13760】SAM4Dcap: Training-free Biomechanical Twin System from Monocular Video
链接:https://arxiv.org/abs/2602.13760
作者:Li Wang,HaoYu Wang,Xi Chen,ZeKun Jiang,Kang Li,Jian Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Quantitative biomechanical analysis, Quantitative biomechanical, essential for clinical, clinical diagnosis, diagnosis and injury
备注:
点击查看摘要
Abstract:Quantitative biomechanical analysis is essential for clinical diagnosis and injury prevention but is often restricted to laboratories due to the high cost of optical motion capture systems. While multi-view video approaches have lowered barriers, they remain impractical for home-based scenarios requiring monocular capture. This paper presents SAM4Dcap, an open-source, end-to-end framework for estimating biomechanical metrics from monocular video without additional training. SAM4Dcap integrates the temporally consistent 4D human mesh recovery of SAM-Body4D with the OpenSim biomechanical solver. The pipeline converts reconstructed meshes into trajectory files compatible with diverse musculoskeletal models. We introduce automated prompting strategies and a Linux-native build for processing. Preliminary evaluations on walking and drop-jump tasks indicate that SAM4Dcap has the potential to achieve knee kinematic predictions comparable to multi-view systems, although some discrepancies in hip flexion and residual jitter remain. By bridging advanced computer vision with established biomechanical simulation, SAM4Dcap provides a flexible, accessible foundation for non-laboratory motion analysis.
111. 【2602.13758】OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
链接:https://arxiv.org/abs/2602.13758
作者:Haoyi Tao,Chaozheng Huang,Nan Wang,Han Lyu,Linfeng Zhang,Guolin Ke,Xi Fang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, demonstrate strong performance, exhibit limited capability, Models demonstrate strong, Large Language Models
备注:
点击查看摘要
Abstract:Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.
112. 【2602.13751】2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation
链接:https://arxiv.org/abs/2602.13751
作者:Bin Yang,Rong Ou,Weisheng Xu,Jiaqi Xiong,Xintao Li,Taowen Wang,Luyu Zhu,Xu Jiang,Jing Tan,Renjing Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:motion generation capabilities, in-distribution textual inputs, Fine-grained Accuracy Evaluation, systematically assess model, assess model generalization
备注:
点击查看摘要
Abstract:Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.
113. 【2602.13748】RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction
链接:https://arxiv.org/abs/2602.13748
作者:Yongkang Jin,Jianwen Luo,Jingjing Wang,Jianmin Yao,Yu Hong
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:aims to identify, text and images, MEE, Event, Relation-aware Multi-task Progressive
备注:
点击查看摘要
Abstract:Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
114. 【2602.13731】Generative Latent Representations of 3D Brain MRI for Multi-Task Downstream Analysis in Down Syndrome
链接:https://arxiv.org/abs/2602.13731
作者:Jordi Malé,Juan Fortea,Mateus Rozalem-Aranha,Neus Martínez-Abadías,Xavier Sevillano
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthetic data generation, high-quality synthetic data, anomaly detection, data generation, emerged as powerful
备注:
点击查看摘要
Abstract:Generative models have emerged as powerful tools in medical imaging, enabling tasks such as segmentation, anomaly detection, and high-quality synthetic data generation. These models typically rely on learning meaningful latent representations, which are particularly valuable given the high-dimensional nature of 3D medical images like brain magnetic resonance imaging (MRI) scans. Despite their potential, latent representations remain underexplored in terms of their structure, information content, and applicability to downstream clinical tasks. Investigating these representations is crucial for advancing the use of generative models in neuroimaging research and clinical decision-making. In this work, we develop multiple variational autoencoders (VAEs) to encode 3D brain MRI scans into compact latent space representations for generative and predictive applications. We systematically evaluate the effectiveness of the learned representations through three key analyses: (i) a quantitative and qualitative assessment of MRI reconstruction quality, (ii) a visualisation of the latent space structure using Principal Component Analysis, and (iii) downstream classification tasks on a proprietary dataset of euploid and Down syndrome individuals brain MRI scans. Our results demonstrate that the VAE successfully captures essential brain features while maintaining high reconstruction fidelity. The latent space exhibits clear clustering patterns, particularly in distinguishing individuals with Down syndrome from euploid controls.
115. 【2602.13728】Explore Intrinsic Geometry for Query-based Tiny and Oriented Object Detector with Momentum-based Bipartite Matching
链接:https://arxiv.org/abs/2602.13728
作者:Junpeng Zhang,Zewei Yang,Jie Feng,Yuhui Zheng,Ronghua Shang,Mengxuan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, limited texture information, performance remains constrained, capturing limited texture, Recent query-based detectors
备注: 13 pages
点击查看摘要
Abstract:Recent query-based detectors have achieved remarkable progress, yet their performance remains constrained when handling objects with arbitrary orientations, especially for tiny objects capturing limited texture information. This limitation primarily stems from the underutilization of intrinsic geometry during pixel-based feature decoding and the occurrence of inter-stage matching inconsistency caused by stage-wise bipartite matching. To tackle these challenges, we present IGOFormer, a novel query-based oriented object detector that explicitly integrates intrinsic geometry into feature decoding and enhances inter-stage matching stability. Specifically, we design an Intrinsic Geometry-aware Decoder, which enhances the object-related features conditioned on an object query by injecting complementary geometric embeddings extrapolated from their correlations to capture the geometric layout of the object, thereby offering a critical geometric insight into its orientation. Meanwhile, a Momentum-based Bipartite Matching scheme is developed to adaptively aggregate historical matching costs by formulating an exponential moving average with query-specific smoothing factors, effectively preventing conflicting supervisory signals arising from inter-stage matching inconsistency. Extensive experiments and ablation studies demonstrate the superiority of our IGOFormer for aerial oriented object detection, achieving an AP$_{50}$ score of 78.00\% on DOTA-V1.0 using Swin-T backbone under the single-scale setting. The code will be made publicly available.
116. 【2602.13726】RGA-Net: A Vision Enhancement Framework for Robotic Surgical Systems Using Reciprocal Attention Mechanisms
链接:https://arxiv.org/abs/2602.13726
作者:Quanjun Li,Weixuan Li,Han Xia,Junhua Zhou,Chi-Man Pun,Xuhang Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:endoscopic video feeds, energy-based devices significantly, devices significantly degrades, significantly degrades endoscopic, degrades endoscopic video
备注: Accepted by ICRA2026
点击查看摘要
Abstract:Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons' cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.
117. 【2602.13712】Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images
链接:https://arxiv.org/abs/2602.13712
作者:Chan Hao Sien,Hezerul Abdul Karim,Nouar AlDahoul
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:infections continuously affect, specialized diagnostic expertise, Soil-transmitted helminth, infections continuously, global population
备注:
点击查看摘要
Abstract:Soil-transmitted helminth (STH) infections continuously affect a large proportion of the global population, particularly in tropical and sub-tropical regions, where access to specialized diagnostic expertise is limited. Although manual microscopic diagnosis of parasitic eggs remains the diagnostic gold standard, the approach can be labour-intensive, time-consuming, and prone to human error. This paper aims to utilize a vision language model (VLM) such as Microsoft Florence that was fine-tuned to localize all parasitic eggs within microscopic images. The preliminary results show that our localization VLM performs comparatively better than the other object detection methods, such as EfficientDet, with an mIOU of 0.94. This finding demonstrates the potential of the proposed VLM to serve as a core component of an automated framework, offering a scalable engineering solution for intelligent parasitological diagnosis.
118. 【2602.13704】Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search
链接:https://arxiv.org/abs/2602.13704
作者:Lei Chen,Chen Ju,Xu Chen,Zhicheng Wang,Yuheng Jiao,Hongfeng Zhan,Zhaoyang Li,Shihao Xu,Zhixiang Zhao,Tong Jia,Jinsong Lan,Xiaoyong Zhu,Bo Zheng
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:real-time industrial search, comprehensive multi-modal retrieval, engineered for high-precision, real-time industrial, industrial search
备注:
点击查看摘要
Abstract:In this work, we presented Pailitao-VL, a comprehensive multi-modal retrieval system engineered for high-precision, real-time industrial search. We here address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap. Our primary contribution lies in two fundamental paradigm shifts. First, we transitioned the embedding paradigm from traditional contrastive learning to an absolute ID-recognition task. Through anchoring instances to a globally consistent latent space defined by billions of semantic prototypes, we successfully overcome the stochasticity and granularity bottlenecks inherent in existing embedding solutions. Second, we evolved the generative reranker from isolated pointwise evaluation to the compare-and-calibrate listwise policy. By synergizing chunk-based comparative reasoning with calibrated absolute relevance scoring, the system achieves nuanced discriminative resolution while circumventing the prohibitive latency typically associated with conventional reranking methods. Extensive offline benchmarks and online A/B tests on Alibaba e-commerce platform confirm that Pailitao-VL achieves state-of-the-art performance and delivers substantial business impact. This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.
119. 【2602.13693】A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy
链接:https://arxiv.org/abs/2602.13693
作者:Xin Zhang,Liangxiu Han,Yue Shi,Yalin Zheng,Uazman Alam,Maryam Ferdousi,Rayaz Malik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diabetic Peripheral Neuropathy, Corneal Confocal Microscopy, Confocal Microscopy, Peripheral Neuropathy, Diabetic Peripheral
备注:
点击查看摘要
Abstract:Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fréchet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework's potential to alleviate data bottlenecks in medical AI.
120. 【2602.13689】Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation
链接:https://arxiv.org/abs/2602.13689
作者:Wonju Lee,Matteo Grimaldi,Tao Yu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:robotic manipulation demand, manipulation demand precise, contact-rich interactions, tasks in robotic, robotic manipulation
备注: Accepted By ICRA2026
点击查看摘要
Abstract:Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged "wrist + contact force" configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.
121. 【2602.13681】An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment
链接:https://arxiv.org/abs/2602.13681
作者:Maimoona Jafar,Syed Imran Ali,Ahsan Saadat,Muhammad Bilal,Shah Khalid
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:critical global issue, Environmental pollution, global issue, viable solutions, critical global
备注:
点击查看摘要
Abstract:Environmental pollution is a critical global issue, with recycling emerging as one of the most viable solutions. This study focuses on waste segregation, a crucial step in recycling processes to obtain raw material. Recent advancements in computer vision have significantly contributed to waste classification and recognition. In waste segregation, segmentation masks are essential for robots to accurately localize and pick objects from conveyor belts. The complexity of real-world waste environments, characterized by deformed items without specific patterns and overlapping objects, further complicates waste segmentation tasks. This paper proposes an Ensemble Learning approach to improve segmentation accuracy by combining high performing segmentation models, U-Net and FPN, using a weighted average method. U-Net excels in capturing fine details and boundaries in segmentation tasks, while FPN effectively handles scale variation and context in complex environments, and their combined masks result in more precise predictions. The dataset used closely mimics real-life waste scenarios, and preprocessing techniques were applied to enhance feature learning for deep learning segmentation models. The ensemble model, referred to as EL-4, achieved an IoU value of 0.8306, an improvement over U-Net's 0.8065, and reduced Dice loss to 0.09019 from FPN's 0.1183. This study could contribute to the efficiency of waste sorting at Material Recovery Facility, facilitating better raw material acquisition for recycling with minimal human intervention and enhancing the overall throughput.
122. 【2602.13669】EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
链接:https://arxiv.org/abs/2602.13669
作者:Rang Meng,Weipeng Wu,Yingjie Yin,Yuming Li,Chenguang Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent multi-modal video, high visual quality, hinder real-time deployment, achieved high visual, stability hinder real-time
备注:
点击查看摘要
Abstract:Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.
123. 【2602.13662】LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases
链接:https://arxiv.org/abs/2602.13662
作者:Khang Nguyen Quoc,Phuong D. Dao,Luyl-Da Quach
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enabling multimodal processing, advanced Vision-Language Models, significantly advanced Vision-Language, Foundation models, vision-language pre-training
备注: 26 pages, 13 figures and 8 tables
点击查看摘要
Abstract:Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at this https URL.
124. 【2602.13658】Optimizing Point-of-Care Ultrasound Video Acquisition for Probabilistic Multi-Task Heart Failure Detection
链接:https://arxiv.org/abs/2602.13658
作者:Armin Saadat,Nima Hashemi,Bahar Khodabakhshian,Michael Y. Tsang,Christina Luong,Teresa S.M. Tsang,Purang Abolmaesumi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tight bedside time, operator-effort constraints, decision-making under tight, time and operator-effort, support clinical decision-making
备注: Accepted in IJCARS, IPCAI 2026 special issue
点击查看摘要
Abstract:Purpose: Echocardiography with point-of-care ultrasound (POCUS) must support clinical decision-making under tight bedside time and operator-effort constraints. We introduce a personalized data acquisition strategy in which an RL agent, given a partially observed multi-view study, selects the next view to acquire or terminates acquisition to support heart-failure (HF) assessment. Upon termination, a diagnostic model jointly predicts aortic stenosis (AS) severity and left ventricular ejection fraction (LVEF), two key HF biomarkers, and outputs uncertainty, enabling an explicit trade-off between diagnostic performance and acquisition cost. Methods: We model POCUS as a sequential acquisition problem: at each step, a video selector (RL agent) chooses the next view to acquire or terminates acquisition. Upon termination, a shared multi-view transformer performs multi-task inference with two heads, ordinal AS classification, and LVEF regression, and outputs Gaussian predictive distributions yielding ordinal probabilities over AS classes and EF thresholds. These probabilities drive a reward that balances expected diagnostic benefit against acquisition cost, producing patient-specific acquisition pathways. Results: The dataset comprises 12,180 patient-level studies, split into training/validation/test sets (75/15/15). On the 1,820 test studies, our method matches full-study performance while using 32% fewer videos, achieving 77.2% mean balanced accuracy (bACC) across AS severity classification and LVEF estimation, demonstrating robust multi-task performance under acquisition budgets. Conclusion: Patient-tailored, cost-aware acquisition can streamline POCUS workflows while preserving decision quality, producing interpretable scan pathways suited to bedside use. The framework is extensible to additional cardiac endpoints and merits prospective evaluation for clinical integration.
125. 【2602.13653】Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
链接:https://arxiv.org/abs/2602.13653
作者:Yibo Wang,Guangda Huzhang,Yuwei Hu,Yu Xia,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Graphical User Interface, Multimodal Large Language, Large Language Models, User Interface, Multimodal Large
备注:
点击查看摘要
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
126. 【2602.13650】KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination
链接:https://arxiv.org/abs/2602.13650
作者:Byungjin Choi,Seongsu Bae,Sunjun Kweon,Edward Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Korean Medical Licensing, Medical Licensing Examinations, Korean medical, multiple-choice question answering, evaluating vision-language models
备注: 17 pages, 2 figures, 6 tables. (Includes appendix.)
点击查看摘要
Abstract:We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: this https URL.
127. 【2602.13637】DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation
链接:https://arxiv.org/abs/2602.13637
作者:Haoyu Zhao,Yuang Zhang,Junqi Cheng,Jiaxi Gu,Zenghui Lu,Peng Shu,Zuxuan Wu,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent video generative, impressive visual fidelity, demonstrated impressive visual, Recent video, video generative models
备注: 7 pages, 2 figures
点击查看摘要
Abstract:Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
128. 【2602.13636】Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness
链接:https://arxiv.org/abs/2602.13636
作者:Yang Zhou,Derui Ding,Ran Sun,Ying Sun,Haohua Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unmanned aerial vehicle, Visual object tracking, Visual object, plays a pivotal, aerial vehicle
备注:
点击查看摘要
Abstract:Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack's state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8\% precision). Code is available at this https URL
129. 【2602.13633】A generalizable foundation model for intraoperative understanding across surgical procedures
链接:https://arxiv.org/abs/2602.13633
作者:Kanggil Park,Yongjun Jeon,Soyoung Lim,Seonmin Park,Jongmin Shin,Jung Yong Kim,Sehyeon An,Jinsoo Rhu,Jongman Kim,Gyu-Seong Choi,Namkee Oh,Kyu-Hwan Jung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:clinical decisions depend, minimally invasive surgery, real-time visual interpretation, perception varies substantially, intraoperative perception varies
备注:
点击查看摘要
Abstract:In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.
130. 【2602.13602】owards Sparse Video Understanding and Reasoning
链接:https://arxiv.org/abs/2602.13602
作者:Chenwei Xu,Zhen Ye,Shang Wu,Weijian Li,Zihan Wang,Zhuofan Xia,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Han Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:underline, multi-round agent, video question answering, revise, question answering
备注:
点击查看摘要
Abstract:We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.
131. 【2602.13600】AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting
链接:https://arxiv.org/abs/2602.13600
作者:Jiacheng Zhang,Feng Liu,Chao Du,Tianyu Pang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, Vision-Language Models, Large Vision-Language, existing methods primarily, methods primarily focus
备注:
点击查看摘要
Abstract:Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.
132. 【2602.13588】wo-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks
链接:https://arxiv.org/abs/2602.13588
作者:Guanfeng Tang,Hongbo Zhao,Ziwei Long,Jiayao Li,Bohong Xiao,Wei Ye,Hanli Wang,Rui Fan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human visual system, bio-inspired joint learning, joint learning framework, learning framework capable, interactive streams
备注:
点击查看摘要
Abstract:Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS's core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.
133. 【2602.13585】Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation
链接:https://arxiv.org/abs/2602.13585
作者:Binglei Li,Mengping Yang,Zhiyu Tan,Junping Zhang,Hao Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable advancement, descriptions remains challenging, remains challenging due, complex textual descriptions, textual descriptions remains
备注: 18 pages
点击查看摘要
Abstract:Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.
134. 【2602.13555】Privacy-Concealing Cooperative Perception for BEV Scene Segmentation
链接:https://arxiv.org/abs/2602.13555
作者:Song Wang,Lingling Li,Marcus Santos,Guanghui Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Cooperative perception systems, share sensing information, autonomous driving aim, limited perception range, Cooperative perception
备注:
点击查看摘要
Abstract:Cooperative perception systems for autonomous driving aim to overcome the limited perception range of a single vehicle by communicating with adjacent agents to share sensing information. While this improves perception performance, these systems also face a significant privacy-leakage issue, as sensitive visual content can potentially be reconstructed from the shared data. In this paper, we propose a novel Privacy-Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation. Based on commonly shared BEV features, we design a hiding network to prevent an image reconstruction network from recovering the input images from the shared features. An adversarial learning mechanism is employed to train the network, where the hiding network works to conceal the visual clues in the BEV features while the reconstruction network attempts to uncover these clues. To maintain segmentation performance, the perception network is integrated with the hiding network and optimized end-to-end. The experimental results demonstrate that the proposed PCC framework effectively degrades the quality of the reconstructed images with minimal impact on segmentation performance, providing privacy protection for cooperating vehicles. The source code will be made publicly available upon publication.
135. 【2602.13549】Nighttime Autonomous Driving Scene Reconstruction with Physically-Based Gaussian Splatting
链接:https://arxiv.org/abs/2602.13549
作者:Tae-Kyeong Kim,Xingxin Chen,Guile Wu,Chengjie Huang,Dongfeng Bai,Bingbing Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural Radiance Fields, autonomous driving simulation, autonomous driving, Radiance Fields, Neural Radiance
备注: ICRA 2026
点击查看摘要
Abstract:This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
136. 【2602.13515】SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning
链接:https://arxiv.org/abs/2602.13515
作者:Jintao Zhang,Kai Jiang,Chendong Xiang,Weiqi Feng,Yuezhou Hu,Haocheng Xi,Jianfei Chen,Jun Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:sparse attention, attention, trainable sparse attention, sparse, Top-k and Top-p
备注:
点击查看摘要
Abstract:Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.
137. 【2602.13507】Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening
链接:https://arxiv.org/abs/2602.13507
作者:Md Saiful Islam,Ekram Hossain,Abdelrahman Abdelkader,Tariq Adnan,Fazla Rabbi Mashrur,Sooyong Park,Praveen Kumar,Qasim Sudais,Natalia Chunga,Nami Shah,Jan Freyberg,Christopher Kanan,Ruth Schneider,Ehsan Hoque
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video-based assessments offer, Parkinson disease, pathway for Parkinson, video-based assessments, assessments offer
备注:
点击查看摘要
Abstract:Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: this https URL\_video\_benchmarking-A2C5
138. 【2602.13479】GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables
链接:https://arxiv.org/abs/2602.13479
作者:Akhil Ramachandran,Ankit Arun,Ashish Shenoy,Abhay Harpale,Srihari Jayakumar,Debojeet Chatterjee,Mohsen Moslehpour,Pierce Chuang,Yichao Lu,Vikas Bhardwaj,Peyman Heidari
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Large Language Models, Video Large Language, Large Language, shown remarkable progress, Language Models
备注:
点击查看摘要
Abstract:Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.
139. 【2602.13444】FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation
链接:https://arxiv.org/abs/2602.13444
作者:Huajian Zeng,Lingyun Chen,Jiaqi Yang,Yuantai Zhang,Fan Shi,Peidong Liu,Xingxing Zuo
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:plausible end-effector motions, generate plausible end-effector, underlying hand-object interaction, end-effector motions, fail in long-horizon
备注: Project Page: [this https URL](https://huajian-zeng.github.io/projects/flowhoi/)
点击查看摘要
Abstract:Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
140. 【2602.13440】Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones
链接:https://arxiv.org/abs/2602.13440
作者:Sebastian-Ion Nae,Mihai-Eugen Barbu,Sebastian Mocanu,Marius Leordeanu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:limiting catastrophic forgetting, Autonomous agents, motivating Class-Incremental Learning, catastrophic forgetting, motivating Class-Incremental
备注: Accepted at European Robotics Forum (ERF) 2026
点击查看摘要
Abstract:Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of $14,400$ frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a $98.6\%$ first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ($5-10\%$ replay), FAR performs better than the rest, achieving an average accuracy (ACC, $mAP_{50-95}$ across increments) of $82.96\%$ with $5\%$ replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: this https URL
141. 【2602.13430】Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning
链接:https://arxiv.org/abs/2602.13430
作者:Ha-Hieu Pham,Hai-Dang Nguyen,Thanh-Huy Nguyen,Min Xu,Ulas Bagci,Trung-Nghia Le,Huy-Hieu Pham
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Chest X-ray, missing annotations, extreme long-tailed multi-label, clinical practice, limited by imperfect
备注:
点击查看摘要
Abstract:Chest X-ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at this https URL.
142. 【2602.13378】LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery
链接:https://arxiv.org/abs/2602.13378
作者:Sohail Ali Farooqui,Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:applied computer vision, Unmanned aerial vehicles, primary sensing platforms, traffic monitoring, aerial vehicles serve
备注:
点击查看摘要
Abstract:Unmanned aerial vehicles serve as primary sensing platforms for surveillance, traffic monitoring, and disaster response, making aerial object detection a central problem in applied computer vision. Current detectors struggle with UAV-specific challenges: targets spanning only a few pixels, cluttered backgrounds, heavy occlusion, and strict onboard computational budgets. This study introduces LAF-YOLOv10, built on YOLOv10n, integrating four complementary techniques to improve small-object detection in drone imagery. A Partial Convolution C2f (PC-C2f) module restricts spatial convolution to one quarter of backbone channels, reducing redundant computation while preserving discriminative capacity. An Attention-Guided Feature Pyramid Network (AG-FPN) inserts Squeeze-and-Excitation channel gates before multi-scale fusion and replaces nearest-neighbor upsampling with DySample for content-aware interpolation. An auxiliary P2 detection head at 160$\times$160 resolution extends localization to objects below 8$\times$8 pixels, while the P5 head is removed to redistribute parameters. Wise-IoU v3 replaces CIoU for bounding box regression, attenuating gradients from noisy annotations in crowded aerial scenes. The four modules address non-overlapping bottlenecks: PC-C2f compresses backbone computation, AG-FPN refines cross-scale fusion, the P2 head recovers spatial resolution, and Wise-IoU stabilizes regression under label noise. No individual component is novel; the contribution is the joint integration within a single YOLOv10 framework. Across three training runs (seeds 42, 123, 256), LAF-YOLOv10 achieves 35.1$\pm$0.3\% mAP@0.5 on VisDrone-DET2019 with 2.3\,M parameters, exceeding YOLOv10n by 3.3 points. Cross-dataset evaluation on UAVDT yields 35.8$\pm$0.4\% mAP@0.5. Benchmarks on NVIDIA Jetson Orin Nano confirm 24.3 FPS at FP16, demonstrating viability for embedded UAV deployment.
143. 【2602.13376】An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation
链接:https://arxiv.org/abs/2602.13376
作者:Giang Son Nguyen,Zi Pong Lim,Sarthak Ketanbhai Modi,Yon Shin Teo,Wenya Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:document processing pipelines, Vision-Language Models, document processing, processing pipelines, pipelines to convert
备注: 9 pages, 4 tables. Under review
点击查看摘要
Abstract:Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson's $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework's reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.
144. 【2602.13361】he Diffusion Duet: Harmonizing Dual Channels with Wavelet Suppression for Image Separation
链接:https://arxiv.org/abs/2602.13361
作者:Jingwei Li,Wei Pu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:restoring multiple independent, unknown mixing mode, Blind image separation, multiple independent source, single observation image
备注:
点击查看摘要
Abstract:Blind image separation (BIS) refers to the inverse problem of simultaneously estimating and restoring multiple independent source images from a single observation image under conditions of unknown mixing mode and without prior knowledge of the source images. Traditional methods relying on statistical independence assumptions or CNN/GAN variants struggle to characterize complex feature distributions in real scenes, leading to estimation bias, texture distortion, and artifact residue under strong noise and nonlinear mixing. This paper innovatively introduces diffusion models into dual-channel BIS, proposing an efficient Dual-Channel Diffusion Separation Model (DCDSM). DCDSM leverages diffusion models' powerful generative capability to learn source image feature distributions and reconstruct feature structures effectively. A novel Wavelet Suppression Module (WSM) is designed within the dual-branch reverse denoising process, forming an interactive separation network that enhances detail separation by exploiting the mutual coupling noise characteristic between source images. Extensive experiments on synthetic datasets containing rain/snow and complex mixtures demonstrate that DCDSM achieves state-of-the-art performance: 1) In image restoration tasks, it obtains PSNR/SSIM values of 35.0023 dB/0.9549 and 29.8108 dB/0.9243 for rain and snow removal respectively, outperforming Histoformer and LDRCNet by 1.2570 dB/0.9272 dB (PSNR) and 0.0262/0.0289 (SSIM) on average; 2) For complex mixture separation, the restored dual-source images achieve average PSNR and SSIM of 25.0049 dB and 0.7997, surpassing comparative methods by 4.1249 dB and 0.0926. Both subjective and objective evaluations confirm DCDSM's superiority in addressing rain/snow residue removal and detail preservation challenges.
145. 【2602.13357】AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers
链接:https://arxiv.org/abs/2602.13357
作者:Dong Liu,Yanxuan Yu,Ben Lengerich,Ying Nian Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:iterative denoising structure, expensive inference due, denoising structure, suffer from expensive, iterative denoising
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbf{AdaCorrection}, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.
146. 【2602.13352】Using Deep Learning to Generate Semantically Correct Hindi Captions
链接:https://arxiv.org/abs/2602.13352
作者:Wasim Akram Khan,Anil Kumar Vuppala
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:natural language processing, Automated image captioning, Automated image, image, harnessing the capability
备注: 34 pages, 12 figures, 3 tables. Master's thesis, Liverpool John Moores University, November 2022
点击查看摘要
Abstract:Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.
147. 【2602.13350】Detecting Brick Kiln Infrastructure at Scale: Graph, Foundation, and Remote Sensing Models for Satellite Imagery Data
链接:https://arxiv.org/abs/2602.13350
作者:Usman Nazir,Xidong Chen,Hafiz Muhammad Abubakar,Hadia Abu Bakar,Raahim Arbaz,Fezan Rasool,Bin Chen,Sara Khalid
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:outdated ground data, South Asia, large-scale monitoring remains, monitoring remains limited, labor in South
备注:
点击查看摘要
Abstract:Brick kilns are a major source of air pollution and forced labor in South Asia, yet large-scale monitoring remains limited by sparse and outdated ground data. We study brick kiln detection at scale using high-resolution satellite imagery and curate a multi city zoom-20 (0.149 meters per pixel) resolution dataset comprising over 1.3 million image tiles across five regions in South and Central Asia. We propose ClimateGraph, a region-adaptive graph-based model that captures spatial and directional structure in kiln layouts, and evaluate it against established graph learning baselines. In parallel, we assess a remote sensing based detection pipeline and benchmark it against recent foundation models for satellite imagery. Our results highlight complementary strengths across graph, foundation, and remote sensing approaches, providing practical guidance for scalable brick kiln monitoring from satellite imagery.
148. 【2602.13349】From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models
链接:https://arxiv.org/abs/2602.13349
作者:Parmida Atighehchian,Henry Wang,Andrei Kapustin,Boris Lerner,Tiancheng Jiang,Taylor Jensen,Negin Sokhandan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:made significant strides, producing impressive results, significant strides, producing impressive, textual descriptions
备注: 17 pages, 12 figures, Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
点击查看摘要
Abstract:Text-to-image models have made significant strides, producing impressive results in generating images from textual descriptions. However, creating a scalable pipeline for deploying these models in production remains a challenge. Achieving the right balance between automation and human feedback is critical to maintain both scale and quality. While automation can handle large volumes, human oversight is still an essential component to ensure that the generated images meet the desired standards and are aligned with the creative vision. This paper presents a new pipeline that offers a fully automated, scalable solution for generating marketing images of commercial products using text-to-image models. The proposed system maintains the quality and fidelity of images, while also introducing sufficient creative variation to adhere to marketing guidelines. By streamlining this process, we ensure a seamless blend of efficiency and human oversight, achieving a $30.77\%$ increase in marketing object fidelity using DINOV2 and a $52.00\%$ increase in human preference over the generated outcome.
149. 【2602.13347】Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots
链接:https://arxiv.org/abs/2602.13347
作者:Lijun Zhang,Nikhil Chacko,Petter Nilsson,Ruinian Xu,Shantanu Thakar,Bai Lou,Harpreet Sawhney,Zhebin Zhang,Mudit Agrawal,Bhavana Chandrashekhar,Aaron Parness
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Automated warehouses execute, robots place objects, warehouses execute millions, Automated warehouses, execute millions
备注: 20 pages, 16 figures
点击查看摘要
Abstract:Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.
150. 【2602.13344】FireRed-Image-Edit-1.0 Techinical Report
链接:https://arxiv.org/abs/2602.13344
作者:Super Intelligence Team:Changhao Qiao,Chao Hui,Chen Li,Cunzheng Wang,Dejia Song,Jiale Zhang,Jing Li,Qiang Xiang,Runqi Wang,Shuang Sun,Wei Zhu,Xu Tang,Yao Hu,Yibo Chen,Yuhao Huang,Yuxuan Duan,Zhiyi Chen,Ziyuan Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:instruction-based image editing, evaluation design, diffusion transformer, transformer for instruction-based, image editing pairs
备注:
点击查看摘要
Abstract:We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.
151. 【2602.13339】An Integrated Causal Inference Framework for Traffic Safety Modeling with Semantic Street-View Visual Features
链接:https://arxiv.org/abs/2602.13339
作者:Lishan Sun,Yujia Cheng,Pengfei Cui,Lei Han,Mohamed Abdel-Aty,Yunhan Zheng,Xingchen Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:identify critical risk, critical risk factors, informing targeted policy, Macroscopic traffic safety, safety modeling aims
备注: 34 pages, 13 figures
点击查看摘要
Abstract:Macroscopic traffic safety modeling aims to identify critical risk factors for regional crashes, thereby informing targeted policy interventions for safety improvement. However, current approaches rely heavily on static sociodemographic and infrastructure metrics, frequently overlooking the impacts from drivers' visual perception of driving environment. Although visual environment features have been found to impact driving and traffic crashes, existing evidence remains largely observational, failing to establish the robust causality for traffic policy evaluation under complex spatial environment. To fill these gaps, we applied semantic segmentation on Google Street View imageries to extract visual environmental features and proposed a Double Machine Learning framework to quantify their causal effects on regional crashes. Meanwhile, we utilized SHAP values to characterize the nonlinear influence mechanisms of confounding variables in the models and applied causal forests to estimate conditional average treatment effects. Leveraging crash records from the Miami metropolitan area, Florida, and 220,000 street view images, evidence shows that greenery proportion exerts a significant and robust negative causal effect on traffic crashes (Average Treatment Effect = -6.38, p = 0.005). This protective effect exhibits spatial heterogeneity, being most pronounced in densely populated and socially vulnerable urban cores. While greenery significantly mitigates angle and rear-end crashes, its protective benefit for vulnerable road users (VRUs) remains limited. Our findings provide causal evidence for greening as a potential safety intervention, prioritizing hazardous visual environments while highlighting the need for distinct design optimizations to protect VRUs.
152. 【2602.13335】Meningioma Analysis and Diagnosis using Limited Labeled Samples
链接:https://arxiv.org/abs/2602.13335
作者:Jiamiao Lu,Wei Wu,Ke Gao,Ping Mao,Weichuan Zhang,Tuo Wang,Lingkun Ma,Jiapan Guo,Zanyi Wu,Yuqing Hu,Changming Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accurate diagnosis essential, treatment response, treatment planning, making an accurate, prognosis assessment
备注: 19 pages,7 figures
点击查看摘要
Abstract:The biological behavior and treatment response of meningiomas depend on their grade, making an accurate diagnosis essential for treatment planning and prognosis assessment. We observed that the weighted fusion of spatial-frequency domain features significantly influences meningioma classification performance. Notably, the contribution of specific frequency bands obtained by discrete wavelet transform varies considerably across different images. A feature fusion architecture with adaptive weights of different frequency band information and spatial domain information is proposed for few-shot meningioma learning. To verify the effectiveness of the proposed method, a new MRI dataset of meningiomas is introduced. The experimental results demonstrate the superiority of the proposed method compared with existing state-of-the-art methods in three datasets. The code will be available at: this https URL
153. 【2602.13334】Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators
链接:https://arxiv.org/abs/2602.13334
作者:Hao Liu,Suhaib A. Fahmy
类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
关键词:Deploying Vision Transformers, Deploying Vision, Vision Transformers, high computational complexity, cloud resources presents
备注:
点击查看摘要
Abstract:Deploying Vision Transformers on edge devices is challenging due to their high computational complexity, while full offloading to cloud resources presents significant latency overheads. We propose a novel collaborative inference framework, which orchestrates a lightweight generalist ViT on an edge device and multiple medium-sized expert ViTs on a near-edge accelerator. A novel routing mechanism uses the edge model's Top-$\mathit{k}$ predictions to dynamically select the most relevant expert for samples with low confidence. We further design a progressive specialist training strategy to enhance expert accuracy on dataset subsets. Extensive experiments on the CIFAR-100 dataset using a real-world edge and near-edge testbed demonstrate the superiority of our framework. Specifically, the proposed training strategy improves expert specialization accuracy by 4.12% on target subsets and enhances overall accuracy by 2.76% over static experts. Moreover, our method reduces latency by up to 45% compared to edge execution, and energy consumption by up to 46% compared to just near-edge offload.
154. 【2602.13332】MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling
链接:https://arxiv.org/abs/2602.13332
作者:Wenjie Li,Yujie Zhang,Haoran Sun,Xingqi He,Hongcheng Gao,Chenglong Ma,Ming Hu,Guankun Wang,Shiyi Yao,Renhao Yang,Hongliang Ren,Lei Wang,Junjun He,Yankai Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visual evidence-based decision-making, evidence-based decision-making, related settings, growing importance, importance for applications
备注:
点击查看摘要
Abstract:Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely "think with videos" through tool-integrated reasoning. We will release our code, models, and data.
155. 【2602.13330】Zwitscherkasten -- DIY Audiovisual bird monitoring
链接:https://arxiv.org/abs/2602.13330
作者:Dominik Blum,Elias Häring,Fabian Jirges,Martin Schäffer,David Schick,Florian Schulenberg,Torsten Schön
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper presents Zwitscherkasten, presents Zwitscherkasten, multimodal system, edge devices, paper presents
备注: Project Report of the Applied Artificial Intelligence Degree Program at Technische Hochschule Ingolstadt
点击查看摘要
Abstract:This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.
156. 【2602.13329】HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2602.13329
作者:Yiru Wang,Zichong Gu,Yu Gao,Anqing Jiang,Zhigang Sun,Shuo Wang,Yuwen Heng,Hao Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:offer promising capabilities, models offer promising, multimodal understanding, offer promising, promising capabilities
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Cite as:
arXiv:2602.13329 [cs.CV]
(or
arXiv:2602.13329v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.13329
Focus to learn more
arXiv-issued DOI via DataCite</p>
157. 【2602.13326】MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
链接:https://arxiv.org/abs/2602.13326
作者:Xirui Hu,Yanbo Ding,Jiahao Wang,Tingting Shi,Yali Wang,Guo Zhi Zhi,Weizhan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely limited, reference characters driven, diverse humanoid forms, pose sequences, driven by pose
备注:
点击查看摘要
Abstract:Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.
158. 【2602.13324】Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge
链接:https://arxiv.org/abs/2602.13324
作者:Jesse Barkley,Abraham George,Amir Barati Farimani
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:dynamic military environments, scarce domain-specific training, domain-specific training data, Deploying autonomous edge, Deploying autonomous
备注: 8 Pages, 3 Figures
点击查看摘要
Abstract:Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel "Controlled Input" methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.
159. 【2602.13322】Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture
链接:https://arxiv.org/abs/2602.13322
作者:Datorien L. Anderson
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:isolate topological invariance, standard vision benchmarks, maintain structural identity, dominate standard vision, diagnostic benchmarks designed
备注: 8 pages, 3 figures and extra material to help can be found: [this https URL](https://zenodo.org/records/18529180)
点击查看摘要
Abstract:We present the PolyShapes-Ideal (PSI) dataset, a suite of diagnostic benchmarks designed to isolate topological invariance -- the ability to maintain structural identity across affine transformations -- from the textural correlations that dominate standard vision benchmarks. Through three diagnostic probes (polygon classification under noise, zero-shot font transfer from MNIST, and geometric collapse mapping under progressive deformation), we demonstrate that the Eidos architecture achieves 99% accuracy on PSI and 81.67% zero-shot transfer across 30 unseen typefaces without pre-training. These results validate the "Form-First" hypothesis: generalization in structurally constrained architectures is a property of geometric integrity, not statistical scale.
160. 【2602.13318】DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing
链接:https://arxiv.org/abs/2602.13318
作者:Daesik Jang,Morgan Lindsay Heisler,Linzi Xing,Yifei Li,Edward Wang,Ying Xiong,Yong Zhang,Zhenan Fan
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Automatically generating, slide decks requires, generating and iteratively, Compliance Kit Benchmark, multi-agent slide generation
备注:
点击查看摘要
Abstract:Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at this https URL .
161. 【2602.13315】IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
链接:https://arxiv.org/abs/2602.13315
作者:Yifan Tan,Yifu Sun,Shirui Huang,Hong Liu,Guanghua Yu,Jianchen Zhu,Yangdong Deng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Large Language, demonstrated impressive capabilities, encounter significant computational
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at this https URL.
162. 【2602.13314】Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction
链接:https://arxiv.org/abs/2602.13314
作者:Emily Bejerano,Federico Tondolo,Aayan Qayyum,Xiaofan Yu,Xiaofan Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:degraded indoor environments, large-scale radar datasets, visually degraded indoor, annotating large-scale radar, low light
备注:
点击查看摘要
Abstract:Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.
163. 【2602.13313】Agentic Spatio-Temporal Grounding via Collaborative Reasoning
链接:https://arxiv.org/abs/2602.13313
作者:Heng Zhao,Yew-Soon Ong,Joey Tianyi Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Spatio-Temporal Video Grounding, Video Grounding, text query, object or person, Spatio-Temporal Video
备注:
点击查看摘要
Abstract:Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.
164. 【2602.13310】Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
链接:https://arxiv.org/abs/2602.13310
作者:Haoran Xu,Hongyu Wang,Jiaze Li,Shunpeng Chen,Zizhao Tong,Jianzhong Ju,Zhenbo Luo,Jian Luan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing LLM test-time, LLM test-time scaling, Existing LLM, test-time scaling laws, scaling laws emphasize
备注:
点击查看摘要
Abstract:Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.
165. 【2602.13306】Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique
链接:https://arxiv.org/abs/2602.13306
作者:Zhehan Zhang,Meihua Qian,Li Luo,Siyu Huang,Chaoyi Zhou,Ripon Saha,Xinxin Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Assessing artistic creativity, Torrance Tests, Creative Thinking, Tests of Creative, Assessing artistic
备注:
点击查看摘要
Abstract:Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric and the artwork description in the system prompt, we constrain the generated text to match the quantitative prediction. Experiments show strong accuracy, achieving Pearson r 0.97 and MAE about 3.95 on the 100-point scale. Qualitative evaluation indicates the generated feedback is semantically close to expert critiques (average SBERT cosine similarity = 0.798). The proposed approach bridges computer vision and art assessment and offers a scalable tool for creativity research and classroom feedback.
166. 【2602.13305】WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery
链接:https://arxiv.org/abs/2602.13305
作者:Aydin Ayanzadeh,Prakhar Dixit,Sadia Kamal,Milton Halem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:intensity rising due, human lives, human activities, threat to ecosystems, growing threat
备注:
点击查看摘要
Abstract:Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring.
167. 【2602.13304】Progressive Contrast Registration for High-Fidelity Bidirectional Photoacoustic Microscopy Alignment
链接:https://arxiv.org/abs/2602.13304
作者:Jiahao Qin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:High-speed optical-resolution photoacoustic, optical-resolution photoacoustic microscopy, backward scan lines, bidirectional raster scanning, raster scanning doubles
备注: 11 pages, 3 figures, 3 tables
点击查看摘要
Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing methods, constrained by brightness constancy assumptions, achieve limited alignment quality (NCC~$\leq 0.96$). We propose PCReg-Net, a progressive contrast-guided registration framework that performs coarse-to-fine alignment through four lightweight modules: (1)~a registration U-Net for coarse alignment, (2)~a reference feature extractor capturing multi-scale structural cues, (3)~a contrast module that identifies residual misalignment by comparing coarse-registered and reference features, and (4)~a refinement U-Net with feature injection for high-fidelity output. We further propose the Temporal NCC (TNCC) and Temporal NCC Gap (TNCG) for reference-free evaluation of inter-frame temporal consistency. On OR-PAM-Reg-4K (432 test samples), PCReg-Net achieves NCC of 0.983, SSIM of 0.982, and PSNR of 46.96 dB, surpassing the state-of-the-art by over 14 dB at real-time speed. Code is available at this https URL
168. 【2602.13303】Spectral Collapse in Diffusion Inversion
链接:https://arxiv.org/abs/2602.13303
作者:Nicolas Bourriez,Alexandre Verine,Auguste Genovesio
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:Conditional diffusion inversion, Conditional diffusion, framework for unpaired, powerful framework, diffusion inversion
备注:
点击查看摘要
Abstract:Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.
169. 【2602.13301】DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2602.13301
作者:Haisheng Su,Wei Wu,Feixiang Song,Junjie Zhang,Zhenjie Yang,Junchi Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autonomous Driving, separable Transformer decoders, dense BEV features, integrating modular designs, encode scene representations
备注: Accepted to ICLR2026
点击查看摘要
Abstract:Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
170. 【2602.13299】KidMesh: Computational Mesh Reconstruction for Pediatric Congenital Hydronephrosis Using Deep Neural Networks
链接:https://arxiv.org/abs/2602.13299
作者:Haoran Sun,Zhanpeng Zhu,Anguo Zhang,Bo Liu,Zhaohua Lin,Liqin Huang,Mingjing Yang,Lei Liu,Shan Lin,Wangbin Ding
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Pediatric congenital hydronephrosis, urinary tract disorder, common urinary tract, Pediatric congenital, renal pelvis-ureter junction
备注:
点击查看摘要
Abstract:Pediatric congenital hydronephrosis (CH) is a common urinary tract disorder, primarily caused by obstruction at the renal pelvis-ureter junction. Magnetic resonance urography (MRU) can visualize hydronephrosis, including renal pelvis and calyces, by utilizing the natural contrast provided by water. Existing voxel-based segmentation approaches can extract CH regions from MRU, facilitating disease diagnosis and prognosis. However, these segmentation methods predominantly focus on morphological features, such as size, shape, and structure. To enable functional assessments, such as urodynamic simulations, external complex post-processing steps are required to convert these results into mesh-level representations. To address this limitation, we propose an end-to-end method based on deep neural networks, namely KidMesh, which could automatically reconstruct CH meshes directly from MRU. Generally, KidMesh extracts feature maps from MRU images and converts them into feature vertices through grid sampling. It then deforms a template mesh according to these feature vertices to generate the specific CH meshes of MRU images. Meanwhile, we develop a novel schema to train KidMesh without relying on accurate mesh-level annotations, which are difficult to obtain due to the sparsely sampled MRU slices. Experimental results show that KidMesh could reconstruct CH meshes in an average of 0.4 seconds, and achieve comparable performance to conventional methods without requiring post-processing. The reconstructed meshes exhibited no self-intersections, with only 3.7% and 0.2% of the vertices having error distances exceeding 3.2mm and 6.4mm, respectively. After rasterization, these meshes achieved a Dice score of 0.86 against manually delineated CH masks. Furthermore, these meshes could be used in renal urine flow simulations, providing valuable urodynamic information for clinical practice.
171. 【2602.13298】Effect of Convolutional Depth on Image Recognition Performance: VGG vs. ResNet vs. GoogLeNet
链接:https://arxiv.org/abs/2602.13298
作者:Manfred M. Fischer,Joshua Pitts
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:uniformly yield higher, Increasing convolutional depth, yield higher accuracy, Increasing convolutional, image recognition
备注:
点击查看摘要
Abstract:Increasing convolutional depth has been central to advances in image recognition, yet deeper networks do not uniformly yield higher accuracy, stable optimization, or efficient computation. We present a controlled comparative study of three canonical convolutional neural network architectures - VGG, ResNet, and GoogLeNet - to isolate how depth influences classification performance, convergence behavior, and computational efficiency. By standardizing training protocols and explicitly distinguishing between nominal and effective depth, we show that the benefits of depth depend critically on architectural mechanisms that constrain its effective manifestation during training rather than on nominal depth alone. Although plain deep networks exhibit early accuracy saturation and optimization instability, residual and inception-based architectures consistently translate additional depth into improved accuracy at lower effective depth and favorable accuracy-compute trade-offs. These findings demonstrate that effective depth, not nominal depth, is the operative quantity governing depth's role as a productive scaling dimension in convolutional networks.
172. 【2602.13297】Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset
链接:https://arxiv.org/abs/2602.13297
作者:Edwyn Brient(CMM),Santiago Velasco-Forero(CMM),Rami Kassab
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:High-resolution range profiles, enable fast onboard, automatic target recognition, fast onboard processing, radar automatic target
备注:
点击查看摘要
Abstract:High-resolution range profiles (HRRPs) enable fast onboard processing for radar automatic target recognition, but their strong sensitivity to acquisition conditions limits robustness across operational scenarios. Conditional HRRP generation can mitigate this issue, yet prior studies are constrained by small, highly specific datasets. We study HRRP synthesis on a largescale maritime database representative of coastal surveillance variability. Our analysis indicates that the fundamental scenario drivers are geometric: ship dimensions and the desired aspect angle. Conditioning on these variables, we train generative models and show that the synthesized signatures reproduce the expected line-of-sight geometric trend observed in real data. These results highlight the central role of acquisition geometry for robust HRRP generation.
173. 【2602.13296】MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models
链接:https://arxiv.org/abs/2602.13296
作者:Edwyn Brient(CMM),Santiago Velasco-Forero(CMM),Rami Kassab
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:High-resolution range profile, automatic target recognition, radar automatic target, High-resolution range, range profile
备注:
点击查看摘要
Abstract:High-resolution range profile (HRRP ) data are in vogue in radar automatic target recognition (RATR). With the interest in classifying models using HRRP, filling gaps in datasets using generative models has recently received promising contributions. Evaluating generated data is a challenging topic, even for explicit data like face images. However, the evaluation methods used in the state-ofthe-art of HRRP generation rely on classification models. Such models, called ''black-box'', do not allow either explainability on generated data or multi-level evaluation. This work focuses on decomposing HRRP data into three components: the mask, the features, and the noise. Using this decomposition, we propose two metrics based on the physical interpretation of those data. We take profit from an expensive dataset to evaluate our metrics on a challenging task and demonstrate the discriminative ability of those.
174. 【2602.13294】VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
链接:https://arxiv.org/abs/2602.13294
作者:Jiarong Liang,Max Ku,Ka-Hei Hui,Ping Nie,Wenhu Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Evaluating whether Multimodal, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.
175. 【2602.13293】NutVLM: A Self-Adaptive Defense Framework against Full-Dimension Attacks for Vision Language Models in Autonomous Driving
链接:https://arxiv.org/abs/2602.13293
作者:Xiaoxu Peng,Dong Zhou,Jianwen Zhang,Guanghui Sun,Anh Tu Ngo,Anupam Chattopadhyay
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Vision Language Models, Vision Language, Language Models, advanced perception, perception in autonomous
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Vision Language Models (VLMs) have advanced perception in autonomous driving (AD), but they remain vulnerable to adversarial threats. These risks range from localized physical patches to imperceptible global perturbations. Existing defense methods for VLMs remain limited and often fail to reconcile robustness with clean-sample performance. To bridge these gaps, we propose NutVLM, a comprehensive self-adaptive defense framework designed to secure the entire perception-decision lifecycle. Specifically, we first employ NutNet++ as a sentinel, which is a unified detection-purification mechanism. It identifies benign samples, local patches, and global perturbations through three-way classification. Subsequently, localized threats are purified via efficient grayscale masking, while global perturbations trigger Expert-guided Adversarial Prompt Tuning (EAPT). Instead of the costly parameter updates of full-model fine-tuning, EAPT generates "corrective driving prompts" via gradient-based latent optimization and discrete projection. These prompts refocus the VLM's attention without requiring exhaustive full-model retraining. Evaluated on the Dolphins benchmark, our NutVLM yields a 4.89% improvement in overall metrics (e.g., Accuracy, Language Score, and GPT Score). These results validate NutVLM as a scalable security solution for intelligent transportation. Our code is available at this https URL.
176. 【2602.13289】Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs
链接:https://arxiv.org/abs/2602.13289
作者:Paul Jonas Kurz,Tobias Jan Wieczorek,Mohamed A. Abdelsalam,Rahaf Aljundi,Marcus Rohrbach
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, efficiency are critical
备注: Accepted poster at the 1st Workshop on Epistemic Intelligence in Machine Learning (EIML) @ EURIPS 2025
点击查看摘要
Abstract:Multimodal Large Language Models (MLLM) are increasingly deployed in domains where both reliability and efficiency are critical. However, current models remain overconfident, producing highly certain but incorrect answers. At the same time, their large size limits deployment on edge devices, necessitating compression. We study the intersection of these two challenges by analyzing how Post-Training Quantization (PTQ) compression affects both accuracy and reliability in Visual Question Answering (VQA). We evaluate two MLLMs, Qwen2-VL-7B and Idefics3-8B, quantized with data-free (HQQ) and data-aware (MBQ) methods across multiple bit widths. To counteract the reduction in reliability caused by quantization, we adapt the Selector confidence estimator for quantized multimodal settings and test its robustness across various quantization levels and out-of-distribution (OOD) scenarios. We find that PTQ degrades both accuracy and reliability. Data-aware methods soften the effect thereof. The Selector substantially mitigates the reliability impact. The combination of int4 MBQ and the Selector achieves the best efficiency-reliability trade-off, closing in on uncompressed performance at approx. 75% less memory demand. Overall, we present the first systematic study linking quantization and reliability in multimodal settings.
177. 【2602.13287】COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception
链接:https://arxiv.org/abs/2602.13287
作者:Shilpa Mukhopadhyay,Amit Roy-Chowdhury,Hang Qiu
类目:Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
关键词:perception enables autonomous, share encoded representations, Cooperative perception enables, live situational awareness, enables autonomous agents
备注: Accepted in ICLR 2026
点击查看摘要
Abstract:Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other's live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open-source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real-world deployment.
178. 【2602.13286】Explanatory Interactive Machine Learning for Bias Mitigation in Visual Gender Classification
链接:https://arxiv.org/abs/2602.13286
作者:Nathanya Satriani,Djordje Slijepčević,Markus Schedl,Matthias Zeppelzauer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Explanatory interactive learning, Explanatory interactive, guide model training, enables users, user perspective
备注: 8 pages, 4 figures, CBMI2025
点击查看摘要
Abstract:Explanatory interactive learning (XIL) enables users to guide model training in machine learning (ML) by providing feedback on the model's explanations, thereby helping it to focus on features that are relevant to the prediction from the user's perspective. In this study, we explore the capability of this learning paradigm to mitigate bias and spurious correlations in visual classifiers, specifically in scenarios prone to data bias, such as gender classification. We investigate two methodologically different state-of-the-art XIL strategies, i.e., CAIPI and Right for the Right Reasons (RRR), as well as a novel hybrid approach that combines both strategies. The results are evaluated quantitatively by comparing segmentation masks with explanations generated using Gradient-weighted Class Activation Mapping (GradCAM) and Bounded Logit Attention (BLA). Experimental results demonstrate the effectiveness of these methods in (i) guiding ML models to focus on relevant image features, particularly when CAIPI is used, and (ii) reducing model bias (i.e., balancing the misclassification rates between male and female predictions). Our analysis further supports the potential of XIL methods to improve fairness in gender classifiers. Overall, the increased transparency and fairness obtained by XIL leads to slight performance decreases with an exception being CAIPI, which shows potential to even improve classification accuracy.
179. 【2602.13267】Beyond Ground: Map-Free LiDAR Relocalization for UAVs
链接:https://arxiv.org/abs/2602.13267
作者:Hengyu Mu,Jianshi Wu,Yuxin Guo,XianLian Lin,Qingyong Hu,Chenglu Wen,Cheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:unmanned aerial vehicle, aerial vehicle, fundamental capability, capability in unmanned, unmanned aerial
备注: 18 pages, 16 figures
点击查看摘要
Abstract:Localization is a fundamental capability in unmanned aerial vehicle (UAV) systems. Map-free LiDAR relocalization offers an effective solution for achieving high-precision positioning in environments with weak or unavailable GNSS signals. However, existing LiDAR relocalization methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in UAV scenarios. In this paper, we propose MAILS, a novel map-free LiDAR relocalization framework for UAVs. A Locality-Preserving Sliding Window Attention module is first introduced to extract locally discriminative geometric features from sparse point clouds. To handle substantial yaw rotations and altitude variations encountered during UAV flight, we then design a coordinate-independent feature initialization module and a locally invariant positional encoding mechanism, which together significantly enhance the robustness of feature extraction. Furthermore, existing LiDAR-based relocalization datasets fail to capture real-world UAV flight characteristics, such as irregular trajectories and varying altitudes. To address this gap, we construct a large-scale LiDAR localization dataset for UAVs, which comprises four scenes and various flight trajectories, designed to evaluate UAV relocalization performance under realistic conditions. Extensive experiments demonstrate that our method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. Our code and dataset will be released soon.
180. 【2602.13239】CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment
链接:https://arxiv.org/abs/2602.13239
作者:Yiming Xiao,Kai Yin,Ali Mostafavi
类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:effective emergency response, spatially resolved disaster, Timely and spatially, resolved disaster impact, emergency response
备注: 27 pages, 4 figures
点击查看摘要
Abstract:Timely and spatially resolved disaster impact assessment is essential for effective emergency response. However, automated methods typically struggle with temporal asynchrony. Real-time human reports capture peak hazard conditions while high-resolution satellite imagery is frequently acquired after peak conditions. This often reflects flood recession rather than maximum extent. Naive fusion of these misaligned streams can yield dangerous underestimates when post-event imagery overrides documented peak flooding. We present CrisiSense-RAG, which is a multimodal retrieval-augmented generation framework that reframes impact assessment as evidence synthesis over heterogeneous data sources without disaster-specific fine-tuning. The system employs hybrid dense-sparse retrieval for text sources and CLIP-based retrieval for aerial imagery. A split-pipeline architecture feeds into asynchronous fusion logic that prioritizes real-time social evidence for peak flood extent while treating imagery as persistent evidence of structural damage. Evaluated on Hurricane Harvey across 207 ZIP-code queries, the framework achieves a flood extent MAE of 10.94% to 28.40% and damage severity MAE of 16.47% to 21.65% in zero-shot settings. Prompt-level alignment proves critical for quantitative validity because metric grounding improves damage estimates by up to 4.75 percentage points. These results demonstrate a practical and deployable approach to rapid resilience intelligence under real-world data constraints.
181. 【2602.13235】Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains
链接:https://arxiv.org/abs/2602.13235
作者:Yuqi Xiong,Chunyi Peng,Zhipeng Xu,Zhenghao Liu,Zulong Chen,Yukun Yan,Shuo Wang,Yu Gu,Ge Yu
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Retrieval-Augmented Generation, enhances Vision-Language Models, Retrieval-Augmented Generation, Vision-Language Models, visual perception
备注:
点击查看摘要
Abstract:Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at this https URL.
182. 【2602.14199】Learnable Multi-level Discrete Wavelet Transforms for 3D Gaussian Splatting Frequency Modulation
链接:https://arxiv.org/abs/2602.14199
作者:Hung Nguyen,An Le,Truong Nguyen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Gaussian Splatting, view synthesis, powerful approach, Gaussian, Discrete Wavelet Transform
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful approach for novel view synthesis. However, the number of Gaussian primitives often grows substantially during training as finer scene details are reconstructed, leading to increased memory and storage costs. Recent coarse-to-fine strategies regulate Gaussian growth by modulating the frequency content of the ground-truth images. In particular, AutoOpti3DGS employs the learnable Discrete Wavelet Transform (DWT) to enable data-adaptive frequency modulation. Nevertheless, its modulation depth is limited by the 1-level DWT, and jointly optimizing wavelet regularization with 3D reconstruction introduces gradient competition that promotes excessive Gaussian densification. In this paper, we propose a multi-level DWT-based frequency modulation framework for 3DGS. By recursively decomposing the low-frequency subband, we construct a deeper curriculum that provides progressively coarser supervision during early training, consistently reducing Gaussian counts. Furthermore, we show that the modulation can be performed using only a single scaling parameter, rather than learning the full 2-tap high-pass filter. Experimental results on standard benchmarks demonstrate that our method further reduces Gaussian counts while maintaining competitive rendering quality.
183. 【2602.13522】Frequency-Enhanced Hilbert Scanning Mamba for Short-Term Arctic Sea Ice Concentration Prediction
链接:https://arxiv.org/abs/2602.13522
作者:Feng Gao,Zheng Gong,Wenli Liu,Yanhai Gan,Zhuoran Zheng,Junyu Dong,Qian Du
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Mamba models offer, vanilla versions struggle, sea ice concentration, Arctic sea ice, models offer efficient
备注: Accepted for publication in IEEE TGRS 2026
点击查看摘要
Abstract:While Mamba models offer efficient sequence modeling, vanilla versions struggle with temporal correlations and boundary details in Arctic sea ice concentration (SIC) prediction. To address these limitations, we propose Frequency-enhanced Hilbert scanning Mamba Framework (FH-Mamba) for short-term Arctic SIC prediction. Specifically, we introduce a 3D Hilbert scan mechanism that traverses the 3D spatiotemporal grid along a locality-preserving path, ensuring that adjacent indices in the flattened sequence correspond to neighboring voxels in both spatial and temporal dimensions. Additionally, we incorporate wavelet transform to amplify high-frequency details and we also design a Hybrid Shuffle Attention module to adaptively aggregate sequence and frequency features. Experiments conducted on the OSI-450a1 and AMSR2 datasets demonstrate that our FH-Mamba achieves superior prediction performance compared with state-of-the-art baselines. The results confirm the effectiveness of Hilbert scanning and frequency-aware attention in improving both temporal consistency and edge reconstruction for Arctic SIC forecasting. Our codes are publicly available at this https URL.
184. 【2602.13414】FUTON: Fourier Tensor Network for Implicit Neural Representations
链接:https://arxiv.org/abs/2602.13414
作者:Pooya Ashtari,Pourya Behmandpoor,Nikos Deligiannis,Aleksandra Pizurica
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
关键词:Implicit neural representations, Implicit neural, Fourier Tensor Network, dominant MLP-based designs, overfitting to noise
备注: 17 pages, 18 figures, 3 tables
点击查看摘要
Abstract:Implicit neural representations (INRs) have emerged as powerful tools for encoding signals, yet dominant MLP-based designs often suffer from slow convergence, overfitting to noise, and poor extrapolation. We introduce FUTON (Fourier Tensor Network), which models signals as generalized Fourier series whose coefficients are parameterized by a low-rank tensor decomposition. FUTON implicitly expresses signals as weighted combinations of orthonormal, separable basis functions, combining complementary inductive biases: Fourier bases capture smoothness and periodicity, while the low-rank parameterization enforces low-dimensional spectral structure. We provide theoretical guarantees through a universal approximation theorem and derive an inference algorithm with complexity linear in the spectral resolution and the input dimension. On image and volume representation, FUTON consistently outperforms state-of-the-art MLP-based INRs while training 2--5$\times$ faster. On inverse problems such as image denoising and super-resolution, FUTON generalizes better and converges faster.
185. 【2602.13346】CellMaster: Collaborative Cell Type Annotation in Single-Cell Analysis
链接:https://arxiv.org/abs/2602.13346
作者:Zhen Wang,Yiming Gao,Jieyuan Liu,Enze Ma,Jefferson Chen,Mark Antkowiak,Mengzhou Hu,JungHo Kong,Dexter Pratt,Zhiting Hu,Wei Wang,Trey Ideker,Eric P. Xing
类目:Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:enables atlas-scale profiling, Single-cell RNA-seq, revealing rare lineages, enables atlas-scale, atlas-scale profiling
备注: Preprint
点击查看摘要
Abstract:Single-cell RNA-seq (scRNA-seq) enables atlas-scale profiling of complex tissues, revealing rare lineages and transient states. Yet, assigning biologically valid cell identities remains a bottleneck because markers are tissue- and state-dependent, and novel states lack references. We present CellMaster, an AI agent that mimics expert practice for zero-shot cell-type annotation. Unlike existing automated tools, CellMaster leverages LLM-encoded knowledge (e.g., GPT-4o) to perform on-the-fly annotation with interpretable rationales, without pre-training or fixed marker databases. Across 9 datasets spanning 8 tissues, CellMaster improved accuracy by 7.1% over best-performing baselines (including CellTypist and scTab) in automatic mode. With human-in-the-loop refinement, this advantage increased to 18.6%, with a 22.1% gain on subtype populations. The system demonstrates particular strength in rare and novel cell states where baselines often fail. Source code and the web application are available at \href{this https URL}{this https URL}.
186. 【2602.13308】Learning to Select Like Humans: Explainable Active Learning for Medical Imaging
链接:https://arxiv.org/abs/2602.13308
作者:Ifrat Ikhtear Uddin,Longwei Wang,Xiao Qin,Yang Zhou,KC Santosh
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:image analysis requires, analysis requires substantial, requires substantial labeled, Medical image analysis, substantial labeled data
备注: Accepted for publication IEEE Conference on Artificial Intelligence 2026, Granada, Spain
点击查看摘要
Abstract:Medical image analysis requires substantial labeled data for model training, yet expert annotation is expensive and time-consuming. Active learning (AL) addresses this challenge by strategically selecting the most informative samples for the annotation purpose, but traditional methods solely rely on predictive uncertainty while ignoring whether models learn from clinically meaningful features a critical requirement for clinical deployment. We propose an explainability-guided active learning framework that integrates spatial attention alignment into a sample acquisition process. Our approach advocates for a dual-criterion selection strategy combining: (i) classification uncertainty to identify informative examples, and (ii) attention misalignment with radiologist-defined regions-of-interest (ROIs) to target samples where the model focuses on incorrect features. By measuring misalignment between Grad-CAM attention maps and expert annotations using \emph{Dice similarity}, our acquisition function judiciously identifies samples that enhance both predictive performance and spatial interpretability. We evaluate the framework using three expert-annotated medical imaging datasets, namely, BraTS (MRI brain tumors), VinDr-CXR (chest X-rays), and SIIM-COVID-19 (chest X-rays). Using only 570 strategically selected samples, our explainability-guided approach consistently outperforms random sampling across all the datasets, achieving 77.22\% accuracy on BraTS, 52.37\% on VinDr-CXR, and 52.66\% on SIIM-COVID. Grad-CAM visualizations confirm that the models trained by our dual-criterion selection focus on diagnostically relevant regions, demonstrating that incorporating explanation guidance into sample acquisition yields superior data efficiency while maintaining clinical interpretability.
187. 【2602.13270】Deep Learning CNN for Pneumonia Detection: Advancing Digital Health in Society 5.0
链接:https://arxiv.org/abs/2602.13270
作者:Hadi Almohab
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:global health problem, Convolutional Neural Network, limited diagnostic tools, health problem, contributing to high
备注: 7 pages 3 figures in Indonesian language
点击查看摘要
Abstract:Pneumonia is a serious global health problem, contributing to high morbidity and mortality, especially in areas with limited diagnostic tools and healthcare resources. This study develops a Convolutional Neural Network (CNN) based on deep learning to automatically detect pneumonia from chest X-ray images. The method involves training the model on labeled datasets with preprocessing techniques such as normalization, data augmentation, and image quality enhancement to improve robustness and generalization. Testing results show that the optimized model achieves 91.67% accuracy, ROC-AUC of 0.96, and PR-AUC of 0.95, demonstrating strong performance in distinguishing pneumonia from normal images. In conclusion, this CNN model has significant potential as a fast, consistent, and reliable diagnostic aid, supporting Society 5.0 by integrating artificial intelligence to improve healthcare services and public well-being.



