本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新763篇论文,其中:
- 自然语言处理103篇
- 信息检索12篇
- 计算机视觉164篇
自然语言处理
1. 【2605.22821】okenisation via Convex Relaxations
链接:https://arxiv.org/abs/2605.22821
作者:Jan Tempus,Philip Whittington,Craig W. Schmidt,Dennis Komm,Tiago Pimentel
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:current NLP pipeline, NLP pipeline, current NLP, integral part, BPE and Unigram
备注:
点击查看摘要
Abstract:Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.
2. 【2605.22817】Vector Policy Optimization: Training for Diversity Improves Test-Time Search
链接:https://arxiv.org/abs/2605.22817
作者:Ryan Bahlous-Boldi,Isha Puri,Idan Shenfeld,Akarsh Kumar,Mehul Damani,Sebastian Risi,Omar Khattab,Zhang-Wei Hong,Pulkit Agrawal
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
关键词:work inside inference-scaling, inference-scaling search procedures, inside inference-scaling search, Language models, task-specific reward functions
备注: 24 pages
点击查看摘要
Abstract:Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.
3. 【2605.22785】Evaluating Commercial AI Chatbots as News Intermediaries
链接:https://arxiv.org/abs/2605.22785
作者:Mirac Suzgun,Emily Shen,Federico Bianchi,Alexander Spangher,Thomas Icard,Daniel E. Ho,Dan Jurafsky,James Zou
类目:Computation and Language (cs.CL)
关键词:proprietary search integrations, handle emerging facts, retrieval-synthesis pipelines, handle emerging, languages and regions
备注: [this https URL](https://suzgunmirac.github.io/ai-news-preview/)
点击查看摘要
Abstract:AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.
4. 【2605.22771】Reducing Political Manipulation with Consistency Training
链接:https://arxiv.org/abs/2605.22771
作者:Long Phan,Devin Kim,Alexander Pan,Alice Blair,Adam Khoja,Dan Hendrycks
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, exhibit systematic political, Sentiment Consistency Training, Sentiment Consistency
备注:
点击查看摘要
Abstract:Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at this https URL
5. 【2605.22769】Understanding Data Temporality Impact on Large Language Models Pre-training
链接:https://arxiv.org/abs/2605.22769
作者:Pilchen Hippolyte,Fabre Romain,Signe Talla Franck,Perez Patrick,Grave Edouard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:remains poorly understood, temporal grounding remains, grounding remains poorly, Large language models, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at this https URL , checkpoints, and datasets at this https URL provide a foundation for future research on continual learning for LLMs.
6. 【2605.22734】ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning
链接:https://arxiv.org/abs/2605.22734
作者:Md Shamim Ahmed,Farzaneh Firoozbakht,Lukas Galke Poech,Jan Baumbach,Richard Röttger
类目:Computation and Language (cs.CL)
关键词:symptom diagnostic, treat disease associations, age, disease at age, Biomedical knowledge
备注: 9 pages main text plus appendices, 8 figures. Dataset and benchmark paper. ChronoMedKG released under CC BY 4.0 and ChronoTQA/code under MIT (Zenodo: [https://doi.org/10.5281/zenodo.19697542](https://doi.org/10.5281/zenodo.19697542) ). Under review
点击查看摘要
Abstract:Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent.
Comments:
9 pages main text plus appendices, 8 figures. Dataset and benchmark paper. ChronoMedKG released under CC BY 4.0 and ChronoTQA/code under MIT (Zenodo: https://doi.org/10.5281/zenodo.19697542). Under review
Subjects:
Computation and Language (cs.CL)
ACMclasses:
I.2.7; I.2.4; H.3.3; J.3
Cite as:
arXiv:2605.22734 [cs.CL]
(or
arXiv:2605.22734v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.22734
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2605.22732】Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
链接:https://arxiv.org/abs/2605.22732
作者:Juergen Dietrich
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:TRUST multi-agent large, multi-agent large language, Pathos dimension, TRUST multi-agent, Russell Circumplex projection
备注: 13 pages, 1 figure
点击查看摘要
Abstract:We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.
8. 【2605.22715】AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
链接:https://arxiv.org/abs/2605.22715
作者:Baiyu Chen,Zechen Li,Wilson Wongso,Lihuan Li,Xiachong Lin,Hao Xue,Benjamin Tag,Flora Salim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:continuously sense human, daily life, increasingly embedded, embedded in daily, offer a practical
备注:
点击查看摘要
Abstract:As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: this https URL.
9. 【2605.22714】AMEL: Accumulated Message Effects on LLM Judgments
链接:https://arxiv.org/abs/2605.22714
作者:Sid-ali Temkit
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, Large language, moderate content, automated evaluators, score outputs
备注: 19 pages, 14 figures, 6 tables. Single author. Code, data (75,898 deduplicated API responses), and analysis pipeline at [this https URL](https://github.com/chutapp/amel)
点击查看摘要
Abstract:Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.
Comments:
19 pages, 14 figures, 6 tables. Single author. Code, data (75,898 deduplicated API responses), and analysis pipeline at this https URL
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
ACMclasses:
I.2.7; I.2.6
Cite as:
arXiv:2605.22714 [cs.AI]
(or
arXiv:2605.22714v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2605.22714
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2605.22705】okenization with Split Trees
链接:https://arxiv.org/abs/2605.22705
作者:Craig W. Schmidt,Michael Krumdick,Adam Wiemerslage,Seth Ebner,Varshini Reddy,Yuval Pinter,Chris Tanner
类目:Computation and Language (cs.CL)
关键词:subword tokenization method, directly optimizes compression, introduce Tokenization, subword tokenization, tokenization method
备注:
点击查看摘要
Abstract:We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.
11. 【2605.22675】Self-Policy Distillation via Capability-Selective Subspace Projection
链接:https://arxiv.org/abs/2605.22675
作者:Guangya Hao,Yitong Shang,Yunbo Long,Zhuokai Zhao,Hanxue Liang
类目:Computation and Language (cs.CL)
关键词:bootstraps large language, Self-distillation bootstraps large, large language models, bootstraps large, large language
备注:
点击查看摘要
Abstract:Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.
12. 【2605.22660】Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora
链接:https://arxiv.org/abs/2605.22660
作者:Maciej Skorski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:culturally variable, making it difficult, difficult to translate, translate faithfully, Centered Kernel Alignment
备注:
点击查看摘要
Abstract:Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.
13. 【2605.22654】Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
链接:https://arxiv.org/abs/2605.22654
作者:Shanshan Wang,Fengying Ye,Hanjia Lyu,Caiwen Gou,Junchao Wu,Jingming Yao,Chengzhong Xu,Jiebo Luo,Derek F. Wong
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Previous detection studies, modern Chinese poetry, addressed modern Chinese, modern Chinese, Chinese poetry
备注:
点击查看摘要
Abstract:Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.
14. 【2605.22650】Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government
链接:https://arxiv.org/abs/2605.22650
作者:Alina Karakanta,Alex Christiansen,Tomás Dodds,Bissie Anderson,Matteo Fuoli,Marcus Perlman,Aletta G. Dorst
类目:Computation and Language (cs.CL)
关键词:Action Plan, Action Plan reflects, artificial intelligence, daily lives, shaping social
备注:
点击查看摘要
Abstract:As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions of AI based on a corpus of letters submitted during the public consultation for the Trump Administration's US AI Action Plan. To this aim, we release a corpus cleaning pipeline and perform topic modelling and frequency analysis to explore predominant topics discussed by different subgroups (e.g., academia, individuals, private sector) and those appearing in the AI Action Plan. Our results show that individuals voice strong concerns related to the impact of AI on life, while other stakeholders are more concerned with AI development. Our comparison of topics suggests that the AI Action Plan reflects predominantly the concerns of the private sector on security, policies, and development, with individuals' concerns less represented.
15. 【2605.22643】Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
链接:https://arxiv.org/abs/2605.22643
作者:Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Federico Sartore,Enrico Panai,Laura Caroli,Yue Zhu,Adam Leon Smith,Luca Nannini,Marcello Galisai,Susanna Cifani,Francesco Giarrusso,Marcantonio Bracale Syrnikov,Daniele Nardi
类目:Computation and Language (cs.CL)
关键词:Background, Boiling the Frog, language models, Code of Practice, language models evaluate
备注:
点击查看摘要
Abstract:Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.
16. 【2605.22641】More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
链接:https://arxiv.org/abs/2605.22641
作者:Víctor Yeste,Paolo Rosso
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Detecting Schwartz, political text, text is difficult, difficult because implicit, implicit cues
备注: Code: [this https URL](https://github.com/VictorMYeste/human-value-detection-context-rag) , best model: [this https URL](https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag) , 18 pages, 3 figures
点击查看摘要
Abstract:Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touch{é} ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8--4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.
17. 【2605.22635】he Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
链接:https://arxiv.org/abs/2605.22635
作者:Erjian Zhang,Yatong Hao,Liejun Wang,Zhiqing Guo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:radiology report generation, automatic radiology report, multi-task learning based, learning based automatic, based automatic radiology
备注: Accepted by ICML 2026
点击查看摘要
Abstract:While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3\% on MIMIC-CXR and 1.9\% on IU X-Ray. Our code is available at this https URL.
18. 【2605.22620】wo is better than one: A Collapse-free Multi-Reward RLIF Training Framework
链接:https://arxiv.org/abs/2605.22620
作者:Shourov Joarder,Diganta Sikdar,Ahsan Habib Akash,Binod Bhattarai,Prashnna Gyawali
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Reinforcement learning, ability of LLMs, gold-standard solutions, substantially improved, human annotations
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.
19. 【2605.22616】Chinese sensorimotor and embodiment norms for 3,000 lexicalized concepts
链接:https://arxiv.org/abs/2605.22616
作者:Jing Chen,Gábor Parti,Yin Zhong,Chu-Ren Huang,Marco Marelli
类目:Computation and Language (cs.CL)
关键词:artificial intelligence research, extent machine systems, direct sensorimotor experience, bodily experience, embodied artificial intelligence
备注:
点击查看摘要
Abstract:Understanding how conceptual knowledge is grounded in bodily experience, and to what extent machine systems can acquire such knowledge without direct sensorimotor experience, are central questions in both cognitive science and embodied artificial intelligence research. Large-scale normative resources are essential for investigating these questions empirically, yet such resources remain sparse for non-Indo-European languages. We present a novel normative database for 3,000 lexicalized concepts in Mandarin Chinese, comprising 11-dimensional sensorimotor ratings and unidimensional embodiment ratings collected from 378 native Mandarin speakers. The ratings demonstrate high reliability and strong cross-norm validity with existing Chinese resources, each of which covers fewer words and a subset of the 11 sensorimotor dimensions. In a validation study, we tested new variables derived from a theoretically motivated metric, Perceptual Strength of Embodiment (PSE) (Huang et al., 2025), together with seven common composite variables, on lexical decision tasks. The results suggest that PSE-Sensorimotor and Minkowski-3 are the strongest composite predictors of lexical decision performance, capturing the facilitatory effects of sensorimotor information on lexical processing. A further exploratory study showed that sensorimotor ratings are substantially recoverable from purely linguistic representations using simple regression models (mean Spearman r = .62 across dimensions), though recovery varied markedly: visual and auditory dimensions yielded higher correspondence than chemosensory ones. Representational similarity analysis further showed that the relational geometry of the sensorimotor space is also partially recoverable (r = .540), consistent with the view that distributional language use encodes aspects of embodied conceptual structure.
20. 【2605.22608】Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
链接:https://arxiv.org/abs/2605.22608
作者:Asaf Yehudai,Lilach Eden,Michal Shmueli-Scheuer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:agents define strategies, define strategies, Agentic CLEAR, Agentic, agents define
备注: ACL
点击查看摘要
Abstract:Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.
21. 【2605.22586】A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
链接:https://arxiv.org/abs/2605.22586
作者:Jiayi Fu,Yuxia Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:tutorial develops diffusion, differential equation, ordinary differential equation, stochastic differential equation, develops diffusion models
备注: A detailed tutorial on Diffusion models and SDE
点击查看摘要
Abstract:This tutorial develops diffusion models from the viewpoint of differential equations. We begin with the conditional Gaussian forward process and show that this path admits both an ordinary differential equation (ODE) representation and a stochastic differential equation (SDE) representation. Averaging the conditional process over the data distribution then yields marginalized forward ODE and SDE formulations that transport the data distribution $p_0=p_{\mathrm{data}}$ to a Gaussian prior $p_1=\mathcal{N}(0,I)$. We next derive the corresponding reverse-time dynamics, namely the reverse SDE and the reverse probability-flow ODE, both of which are governed by the marginal score $\grad\log p_t(x)$. This leads to a training objective for score estimation and shows that the standard noise-prediction objective is equivalent to score matching up to an additive constant independent of the model parameters. We then discuss sampling methods for the learned reverse dynamics, including DPM-Solver, as well as guided sampling through classifier guidance and classifier-free guidance. Finally, we compare DDPM and DDIM with the reverse SDE/ODE framework and show that they share the same training objective, while DDPM sampling corresponds to discrete reverse-SDE sampling and DDIM sampling corresponds to reverse-ODE sampling.
22. 【2605.22579】Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
链接:https://arxiv.org/abs/2605.22579
作者:Meimingwei Li,Yuanhao Ding,Esteban Garces Arias,Christian Heumann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
关键词:Large Language Models, fine-tuning Large Language, near-zero training loss, small datasets surprisingly, datasets surprisingly enhances
备注: Accepted at ICML 2026
点击查看摘要
Abstract:Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates
23. 【2605.22567】LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
链接:https://arxiv.org/abs/2605.22567
作者:Yuchun Fan,Bei Li,Peiguang Li,Yilin Wang,Yongyu Mu,Jian Yang,Xin Chen,Rongxiang Weng,Jingang Wang,Xunliang Cai,Jingbo Zhu,Tong Xiao
类目:Computation and Language (cs.CL)
关键词:enhancing multi-step reasoning, Reinforcement learning, proven effective, effective for enhancing, enhancing multi-step
备注: Accepted to ACL 2026 (main conference)
点击查看摘要
Abstract:Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers
24. 【2605.22564】SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
链接:https://arxiv.org/abs/2605.22564
作者:Shuaiqi Wang,Aadyaa Maddi,Zinan Lin,Giulia Fanti
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:including input commands, execution traces, including input, input commands, commonly evaluated
备注:
点击查看摘要
Abstract:Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at this https URL, with code at this https URL.
25. 【2605.22544】One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
链接:https://arxiv.org/abs/2605.22544
作者:Yevhen Kostiuk,Kenneth Enevoldsen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Instruction embedding models, embedding models, Instruction embedding, Instruction, prompt
备注:
点击查看摘要
Abstract:Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.
26. 【2605.22542】Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning
链接:https://arxiv.org/abs/2605.22542
作者:Yejin Cho,Katrin Erk
类目:Computation and Language (cs.CL)
关键词:Coffee and tea, strikingly different situations, affective associations, tea share, evoke strikingly
备注:
点击查看摘要
Abstract:Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).
27. 【2605.22536】SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
链接:https://arxiv.org/abs/2605.22536
作者:Xiaolong Zhou,Yifei Liu,Ziyang Gong,Jiarui Li,Qiyue Zhao,Muyao Niu,Yuanyuan Gao,Le Ma,Xue Yang,Hongjie Zhang,Zhihang Zhong
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.
28. 【2605.22511】Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
链接:https://arxiv.org/abs/2605.22511
作者:Zihan Liang,Yufei Ma,Ben Chen,Zhipeng Qian,Xuxin Zhang,Huangyu Dai,Lingtao Mao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:competent search-augmented reasoning, turning a language, search-augmented reasoning agent, language model, dominant recipe
备注:
点击查看摘要
Abstract:Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation (OFSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches $0.440$ average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.
29. 【2605.22509】Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking
链接:https://arxiv.org/abs/2605.22509
作者:Morita Tarvirdians,Senthil Chandrasegaran,Hayley Hung,Catholijn M. Jonker,Catharine Oertel
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:Making high-stakes personal, high-stakes personal decisions, personal decisions involves, high-stakes personal, allocate attention
备注: Accepted at UMAP 2026
点击查看摘要
Abstract:Making high-stakes personal decisions involves cognitive, emotional, and intuitive processes, and individuals differ in how they allocate attention across these modes. Integration of these processes has shown to benefit decision making. Yet, most current decision-support systems focus primarily on supporting cognitive aspects, rather than adapting to the individual's thinking profile to support integration of different types of thoughts. In this study, we investigate an agent designed to encourage integration by adapting to the individual user's thought patterns. We explore its effects on participants' perceptions of the agent and their reflective behavior, in comparison with unaided pre-reflection and a baseline agent. In a between-subjects study (N = 128), our agent, which fostered broad and elaborated thinking, enabled more personalized reflective trajectories, elicited more integrative reflective language, and was perceived as providing stronger support for holistic reflection. In contrast, the baseline agent produced homogenized profiles dominated by cognitive language across participants.
30. 【2605.22501】BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
链接:https://arxiv.org/abs/2605.22501
作者:Darya Shlyk,Stefano Montanelli,Lawrence Hunter
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Biomedical Entity Linking, Biomedical Entity, remains computationally inefficient, large language models, Entity Linking
备注: Accepted to ACM SIGIR 2026
点击查看摘要
Abstract:Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.
31. 【2605.22487】Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation
链接:https://arxiv.org/abs/2605.22487
作者:Md. Asaduzzaman Shuvo,Mahedi Hasan,Md. Tashin Parvez,Azizul Haque Noman,Md. Shafayet Hossain Ovi
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Multilingual Large Language, cross-lingual conversational capabilities, significantly enhanced cross-lingual, enhanced cross-lingual conversational
备注:
点击查看摘要
Abstract:Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low-resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for \textbf{BangLa Application and DialoguE generation - BLADE} and benchmarking framework comprising $4,196$ meticulously curated interaction pairs. We leverage this resource to systematically fine-tune and evaluate leading open-weight architectures, including DeepSeek-8B and LLaMA-3.2-3B, utilizing parameter-efficient fine-tuning via LoRA adapters in a 4-bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource multilingual text generation. Code and dataset: this https URL
32. 【2605.22476】Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
链接:https://arxiv.org/abs/2605.22476
作者:Hangyue Zhao,Paul Caillon,Erwan Fagnou,Alexandre Allauzen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Entity tracking requires, updating latent states, tracking requires maintaining, Entity tracking, requires maintaining
备注: 12 pages, 1 figure, 9 tables
点击查看摘要
Abstract:Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.
33. 【2605.22465】In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks
链接:https://arxiv.org/abs/2605.22465
作者:Stefan Bleeck
类目:Computation and Language (cs.CL)
关键词:Language Understanding, Ease of Language, RAMPHO episodic buffer, fundamental challenge, challenge of listening
备注:
点击查看摘要
Abstract:The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing to account for the cognitive penalty of informational masking. Here, we present an in silico simulation of the RAMPHO buffer using the frame-by-frame phonetic entropy of a self-supervised acoustic model (wav2vec 2.0). By contrasting a semantically intact distractor with a phase-decorrelated distractor (the Concentration Shield) across a signal-to-noise ratio (SNR) sweep, we successfully dissociate the cognitive penalty of informational distraction from the physical penalty of energetic decay. The simulation reveals a cognitive-acoustic Pareto optimization problem: destroying a distractor's semantic payload provides a release from informational masking at high SNRs, but fundamentally degrades temporal glimpsing cues at low SNRs.
34. 【2605.22462】From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models
链接:https://arxiv.org/abs/2605.22462
作者:Caleb Munigety
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Indirect Object Identification, Object Identification, Indirect Object, performing the Indirect, transformer language models
备注:
点击查看摘要
Abstract:We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.
35. 【2605.22447】Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse
链接:https://arxiv.org/abs/2605.22447
作者:Aisha Ali Al-Athba,Wajdi Zaghouani
类目:Computation and Language (cs.CL)
关键词:understanding societal polarization, central to understanding, understanding societal, Arabic public Facebook, Occupation of Palestine
备注:
点击查看摘要
Abstract:The study of online discourse has become central to understanding societal polarization. While much research has focused on detecting overt toxicity, the subtle dynamics of social cohesion, meaning the interaction between divisive and unifying narratives, remain computationally underexplored (Bail, 2021; Gonzalez-Bailon and Lelkes, 2023). This paper presents Cohesion-6K, a manually and ChatGPT-assisted annotated dataset of six thousand Arabic public Facebook posts related to the Israeli Occupation of Palestine. Each post is assigned to one of five discourse categories that represent a continuum from conflict to cohesion: Conflict, Resolution, Community Engagement, Supportive Interactions, and Shared Values. The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85). Quantitative analysis reveals a consistent engagement gap, where conflict-oriented posts receive between two and four times more user interaction than resolution-oriented ones (p 0.01). This pattern illustrates how divisive discourse tends to attract disproportionate visibility in Arabic social media spaces. Cohesion-6K provides a transparent and reproducible resource for the study of online cohesion and polarization. The dataset, annotation guidelines, and preprocessing code will be released for research use under an open license, supporting future work in computational social science, digital communication, and Arabic natural language processing.
36. 【2605.22435】Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
链接:https://arxiv.org/abs/2605.22435
作者:Genoveffa Martone,Helena Bonaldi,Marco Guerini
类目:Computation and Language (cs.CL)
关键词:frequently co-occur online, Large Language Models, misinformation frequently co-occur, amplifying prejudice, prejudice and polarization
备注:
点击查看摘要
Abstract:Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work has addressed these phenomena separately. We bridge this gap by studying CS generation in contexts where both hate and misinformation co-occur. We test three knowledge-driven generation strategies: first we prompt an LLM with fact-checkers' guidelines and fact-checking articles; secondly, with NGOs' guidelines and reports; thirdly, we create a mixed strategy that combines guidelines and documents from both. 23 experts revise the generated CS, which are assessed via human and automatic metrics. While LLMs produce adequate CS in 40% of cases, expert edits substantially improve naturalness, exhaustiveness, and adherence to guidelines. Based on the post-edited CS, the mixed strategy proves to be the most effective in crowdsourcing evaluation, pairing strong factual correction with stereotype mitigation and empathetic engagement. We release a dataset of hateful and misinformed claims with expert-verified CS and supporting knowledge.
37. 【2605.22411】DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
链接:https://arxiv.org/abs/2605.22411
作者:Jianing Yin,Tan Tang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language model, substantial irrelevant content, long conversational histories, Large language, memory question answering
备注: 31 pages, 3 figures
点击查看摘要
Abstract:Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.
38. 【2605.22391】Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
链接:https://arxiv.org/abs/2605.22391
作者:Jakub Radzikowski,Josef Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:multilingual recipe corpus, sibling skip-gram ingredient, skip-gram ingredient embeddings, ingredient embeddings retrained, present Epicure
备注:
点击查看摘要
Abstract:We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.
39. 【2605.22389】Unified Data Selection for LLM Reasoning
链接:https://arxiv.org/abs/2605.22389
作者:Xiaoyuan Li,Yubo Ma,Chengpeng Li,Fengbin Zhu,Yiyao Yu,Keqin Bao,Wenjie Wang,Fuli Feng,Dayiheng Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Effectively training Large, training Large Language, Large Language, massive high-quality reasoning
备注: Under Review
点击查看摘要
Abstract:Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this, we propose High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality by summing only the entropy of the top (e.g., 0.5\%) highest-entropy tokens in each reasoning sample. We validate HES across three mainstream training paradigms: Supervised Fine-tuning (SFT), Rejection Fine-tuning (RFT), and Reinforcement Learning (RL), with extensive results demonstrating its consistent effectiveness and significantly reduced computational overhead. In SFT, training on the top 20\% HES-ranked data matches full-dataset performance, while using the lowest-HES data degrades it. In RFT, our HES-based training approach significantly outperforms baseline methods. In RL, HES-selected successful trajectories enable the model to learn strong reasoning patterns, significantly surpassing other compared methods. Our findings establish HES as a robust, training-free metric that enables a unified, effective, and efficient method for developing advanced reasoning in LLMs.
40. 【2605.22380】Multi-Stage Training for Abusive Comment Detection in Indic Languages
链接:https://arxiv.org/abs/2605.22380
作者:Pranshu Rastogi,Madhav Mathur,Ramaneswaran S,Kshitij Mohan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:recent years social, increasingly popular tool, tool for communication, recent years, popular tool
备注: 4 pages, EAM2021 selected
点击查看摘要
Abstract:In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.
41. 【2605.22373】Boundary-targeted Membership Inference Attacks on Safety Classifiers
链接:https://arxiv.org/abs/2605.22373
作者:Anthony Hughes,Alexander Goldberg,Prince Jha,Adam Perer,Nikolaos Aletras,Niloofar Mireshghallah
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, filtering harmful content, generative AI systems, essential safeguards, safeguards within generative
备注:
点击查看摘要
Abstract:Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19\% of the conversations a safety classifier flagged as indicating user distress, at a 5\% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.
42. 【2605.22356】Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
链接:https://arxiv.org/abs/2605.22356
作者:Nicola Milano,Davide Marocco
类目:Computation and Language (cs.CL)
关键词:Large language models, modeling human-like behavior, Large language, tools for modeling, modeling human-like
备注:
点击查看摘要
Abstract:Large language models are increasingly used as computational tools for modeling human-like behavior. We introduce a behavioral induction framework that modifies model policies through fine-tuning on structured decision-making tasks: using synthetic datasets inspired by maladaptive behavioral patterns, including depression and paranoia, we train transformer-based language models to consistently select specific classes of actions across diverse contexts. We then test whether this behavioral optimization produces systematic changes in generative distributions. Across two architectures, fine-tuned models show stable, context-general shifts in next-token probability distributions, including increased probability assigned to negative and threat-related interpretations in open-ended language tasks. These effects generalize beyond training contexts and are detectable in qualitative completions, psychometric-style evaluations, and quantitative distributional metrics such as Jensen-Shannon divergence. Induced behavioral profiles also show partial specificity. Models optimized for different behavioral patterns exhibit dissociable response tendencies across evaluation probes, suggesting that structured behavioral training produces differentiated policy-level biases rather than generic distributional skew. We interpret these findings as evidence that consistent behavioral optimization in LLMs can generate stable behavioral and distributional patterns consistent with altered latent priors, linking action selection and language generation. More broadly, the results support a view of LLMs as policy-based systems in which behavioral constraints shape emergent representational structure, highlighting their potential as controlled testbeds for studying the relationship between behavior, interpretation, and generative language in computational models of cognition.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.22356 [cs.CL]
(or
arXiv:2605.22356v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.22356
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Nicola Milano [view email] [v1]
Thu, 21 May 2026 11:42:38 UTC (4,683 KB)
43. 【2605.22355】ransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
链接:https://arxiv.org/abs/2605.22355
作者:Hanyu Guo,Jiedong Yang,Chao Chen,Longfei Xu,Kaikui Liu,Xiangxiang Chu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:complex routing engines, Public transit route, structured map infrastructure, supports training models, planning traditionally depends
备注:
点击查看摘要
Abstract:Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at this https URL, with evaluation code at this https URL.
44. 【2605.22310】Pattern-and-root inflectional morphology: the Arabic broken plural
链接:https://arxiv.org/abs/2605.22310
作者:Alexis Amid Neme,Eric Laporte
类目:Computation and Language (cs.CL)
关键词:Arabic-speaking linguists, resources by Arabic-speaking, substantially implemented model, present a substantially, substantially implemented
备注:
点击查看摘要
Abstract:We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.
45. 【2605.22258】Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
链接:https://arxiv.org/abs/2605.22258
作者:Jingyi Kang,Junyu Lu,Bo Xu,Hongbo Wang,Linlin zong,Roy Ka-Wei Lee,Hongfei Lin
类目:Computation and Language (cs.CL)
关键词:Large language models, Chinese Implicit Toxicity, require robust toxicity, Large language, Implicit Toxicity Attack
备注: 16 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.
46. 【2605.22247】IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
链接:https://arxiv.org/abs/2605.22247
作者:Kai Golan Hashiloni,Daniel Fadlon,Lior Livyatan,Ofri Hefetz,Jiahuan Pei,Kfir Bar
类目:Computation and Language (cs.CL)
关键词:pose a fundamental, fundamental challenge, challenge for language, Idioms pose, Abstract
备注:
点击查看摘要
Abstract:Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.
47. 【2605.22228】GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis
链接:https://arxiv.org/abs/2605.22228
作者:Yu Du,Wenlong Zhu,Xingze Li,Chenglong Cao,Jing Wang,Yukun Ma
类目:Computation and Language (cs.CL)
关键词:Aspect-based sentiment analysis, Aspect-based sentiment, bind sentiment evidence, sentiment analysis, correct aspect
备注: 15 pages, 8 figures, 7 tables
点击查看摘要
Abstract:Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural reasoning layer built on a bipartite topology. GHI represents diverse linguistic and semantic evidence as token--hyperedge incidence relations, allowing different structural signals to be incorporated through a unified interface. Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa. Further experiments show that with only 247M parameters, GHI approaches the performance of 11B Flan-T5 based methods on the ISE benchmark. Moreover, it demonstrates strong robustness on the challenging ARTS datasets, maintaining highly competitive performance where traditional models degrade. These results demonstrate that compact structural reasoning remains a valuable alternative to scale-driven approaches for fine-grained tasks.
48. 【2605.22217】Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
链接:https://arxiv.org/abs/2605.22217
作者:Sophia Xiao Pu,Zhaotian Weng,Chengzhi Liu,Jayanth Srinivasa,Gaowen Liu,William Yang Wang,Xin Eric Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:reinforcement learning trains, learning trains language, trains language models, Self-play reinforcement learning, human labels
备注:
点击查看摘要
Abstract:Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.
49. 【2605.22204】Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus
链接:https://arxiv.org/abs/2605.22204
作者:Wajdi Zaghouani,Mabrouka Bessghaier,MD. Rafiul Biswas,Shimaa Amer Ibrahim
类目:Computation and Language (cs.CL)
关键词:public Arabic Facebook, Facebook posts related, ten year collection, Arabic Facebook posts, women empowerment
备注:
点击查看摘要
Abstract:This paper presents the Arabic Women and Society Corpus, a ten year collection of 252,487 public Arabic Facebook posts related to women's empowerment and social wellbeing. The corpus was collected from 51,660 pages across 77 countries between 2013 and 2024, resulting in more than 267 million user interactions. Each post includes engagement metrics such as shares, comments, and emotional reactions, providing a unique view of audience sentiment and social attention. The data were processed using an automated pipeline with language identification, normalization, and metadata cleaning to ensure reliability and reproducibility. The corpus enables large scale analysis of gender discourse, social reform, and emotional engagement across Arabic dialects. It supports research in Arabic natural language processing, computational social science, and digital communication studies. The dataset and accompanying documentation will be released under request for research use.
50. 【2605.22203】Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents
链接:https://arxiv.org/abs/2605.22203
作者:Sovandara Chhoun,Pichdara Po,Sereiwathna Ros,Wan-Sup Cho,Saksonita Khoeurn
类目:Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, text chunking approaches, Khmer agricultural documents, framework applied, Answer Relevance
备注: 11 pages, 1 figure
点击查看摘要
Abstract:In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.
51. 【2605.22202】Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
链接:https://arxiv.org/abs/2605.22202
作者:Amanda Myntti,Jenna Kanerva,Veronika Laippala,Filip Ginter
类目:Computation and Language (cs.CL)
关键词:high-performing embedding models, embedding models organize, show that high-performing, MTEB tasks spanning, models organize
备注:
点击查看摘要
Abstract:In this paper, we show that high-performing embedding models organize their embedding spaces in a consistent way. We evaluate 25 contemporary embedding models on five MTEB tasks spanning four diverse task categories (retrieval, bitext mining, pair classification, and summarization) in both English and multilingual settings, and reveal that nearest-neighbor overlap and magnitude differences in independent component analysis (ICA) between paired text instances strongly correlate (even up to 0.97) with performance on the given task. Ultimately, we show that embedding tasks display varying degrees of linearity and reliance on retention of local information. Our results further the understanding of embeddings, their relation to model performance, and shed light on possible future training objectives and optimizing conditional embeddings.
52. 【2605.22177】Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
链接:https://arxiv.org/abs/2605.22177
作者:Jinyang Wu,Guocheng Zhai,Ruihan Jin,Yuhao Shen,Zhengxi Lu,Fan Zhang,Haoran Luo,Zheng Lian,Zhengqi Wen,Jianhua Tao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:increasingly powerful capabilities, endowed autonomous agents, large language models, powerful capabilities, proliferation of large
备注:
点击查看摘要
Abstract:The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at this https URL.
53. 【2605.22170】Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?
链接:https://arxiv.org/abs/2605.22170
作者:Luca Modica,Filip Landin,Mehrdad Farahani,Livia Qian,Gabriel Skantze,Richard Johansson
类目:Computation and Language (cs.CL)
关键词:written text jointly, recent years, Causal Mediation Analysis, Speech Language Models, represent speech
备注: In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics
点击查看摘要
Abstract:In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.
Comments:
In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.22170 [cs.CL]
(or
arXiv:2605.22170v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.22170
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
54. 【2605.22148】Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
链接:https://arxiv.org/abs/2605.22148
作者:Xing Zhang,Yanwei Cui,Guanghui Wang,Ziyuan Li,Wei Qiu,Bing Zhu,Peiyang He
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Self-evolving skill libraries, LLM agents accumulate, frozen LLM agents, agents accumulate reusable, accumulate reusable knowledge
备注: 16 pages, 2 figures, 6 tables. Extends [arXiv:2605.19576](https://arxiv.org/abs/2605.19576) with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)
点击查看摘要
Abstract:Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.
55. 【2605.22140】Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues
链接:https://arxiv.org/abs/2605.22140
作者:Chaogui Gou,Jiarui Liang
类目:Computation and Language (cs.CL)
关键词:shown substantial potential, psychological support tasks, large language models, recent years, large language
备注:
点击查看摘要
Abstract:In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single-turn question answering or short multi-turn dialogues, making it difficult to characterize how college students' psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy-Chronicle, a structured data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. We generate a semester-spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions. Based on Psy-Chronicle, we construct and open-source CPCD, a Chinese long-horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD-Bench to evaluate models' long-horizon campus counseling capabilities from three dimensions: session-level response, long-horizon memory recall, and temporal-causal reasoning. Experimental results show that CPCD effectively improves session-level response generation and long-horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal-causal reasoning remain limited, indicating that event-chain organization and causal explanation are key challenges in long-horizon psychological counseling modeling. The related code and data are available at: this https URL
56. 【2605.22138】Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
链接:https://arxiv.org/abs/2605.22138
作者:Mingkai Deng,Jinyu Hou,Lara Sá Neves,Varad Pimpalkhute,Taylor W. Killian,Zhengzhong Liu,Eric P. Xing
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:reasoning, simulative reasoning, planning, agent decide, System
备注: Code and model artifacts are available at [this https URL](https://github.com/sailing-lab/sr2am)
点击查看摘要
Abstract:How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.
57. 【2605.22137】Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
链接:https://arxiv.org/abs/2605.22137
作者:Andrew Ivan Soegeng,Patrick Sutanto,Tan Sang Nguyen
类目:Computation and Language (cs.CL)
关键词:Large Language Models, exhibit significant performance, significant performance discrepancies, demonstrate strong capabilities, Large Language
备注: Accepted to The 1st Workshop on Multilinguality in the Era of Large Language Models
点击查看摘要
Abstract:Although Large Language Models (LLMs) demonstrate strong capabilities across various tasks, they exhibit significant performance discrepancies across languages. While prompting LLMs in English typically yields the highest general performance, it often induces a Western-centric bias, hindering the model's ability to accurately reflect diverse cultural knowledge. We hypothesize that LLMs already possess rich cultural knowledge embedded within local-language representations, but fail to retrieve it when prompted in English. To bridge this cross-lingual knowledge gap, we propose a novel self-supervised framework. Our method leverages multilingual self-consistency to identify the most reliable cultural responses across languages, combined with a self-critique mechanism to transfer this knowledge to the weaker language. Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data. Ultimately, our work demonstrates that latent cultural knowledge can be successfully surfaced and propagated across languages, enabling more culturally equitable and consistent LLMs.
58. 【2605.22099】A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering
链接:https://arxiv.org/abs/2605.22099
作者:Sereiwathna Ros,Phannet Pov,Ratanaktepi Chhor,Kimleang Ly,Wan-Sup Cho,Saksonita Khoeurn
类目:Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, improving factual accuracy, outputs in retrieved, retrieved evidence, promising paradigm
备注: 14 pages, 1 figure,
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.
59. 【2605.22081】ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination
链接:https://arxiv.org/abs/2605.22081
作者:Wajdi Zaghouani,Shimaa Amer Ibrahim,Mabrouka Bessghaier,Houda Bouamor
类目:Computation and Language (cs.CL)
关键词:public Arabic Facebook, Arabic Facebook posts, Facebook posts, Arabic Facebook, discussing racism
备注: Accepted at LREC 2026 Main Conference
点击查看摘要
Abstract:We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.
60. 【2605.22079】Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements
链接:https://arxiv.org/abs/2605.22079
作者:Ryo Kanazawa,Koyo Hidaka,Teppei Miyamoto,Takayuki Kato,Tomoki Ando,Chenguang Wang,Dayuan Jiang,Naofumi Fujita,Shuhei Saitoh,Atomu Kondo,Koki Arakawa,Daiho Nishioka
类目:Computation and Language (cs.CL)
关键词:satisfy industry-standard XML, public resources remain, resources remain limited, simultaneously satisfy industry-standard, Large language models
备注: 7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face
点击查看摘要
Abstract:Large language models (LLMs) are widely used to generate structured outputs such as JSON, SQL, and code, yet public resources remain limited for evaluating generation that must simultaneously satisfy industry-standard XML and domain vocabulary constraints. This paper presents Ishigaki-IDS-Bench, a benchmark for evaluating the ability to generate Information Delivery Specification (IDS) XML from Building Information Modeling (BIM) information requirements. The benchmark contains 166 BIM/IDS expert-authored and verified examples created by expanding 83 practical scenarios into Japanese and English, corresponding gold IDS files, and metadata for input format, language, turn setting, IFC version, and construction domain. Its evaluation combines IDSAuditTool-based Processability, Structure, and Content audits with content-agreement evaluation against gold IDS files. In zero-shot evaluation over 10 LLMs, the best model reaches 65.6% macro F1 for content agreement, while only 27.7% of outputs pass the Content audit. These results show that current LLMs can express part of the information requirements as IDS, but still struggle to stably generate XML that satisfies the IDS standard and IFC vocabulary constraints. Ishigaki-IDS-Bench supports comparative evaluation, failure analysis, and the development of constrained structured generation methods that conform to domain standards. We release the evaluation scripts and benchmark data under the CC BY 4.0 license on GitHub and Hugging Face.
61. 【2605.22074】From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
链接:https://arxiv.org/abs/2605.22074
作者:Xitai Jiang,Zihan Tang,Wenze Lin,Yang Yue,Shenzhi Wang,Gao Huang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:outcome-based RLVR remains, RLVR remains inefficient, correct final-answer rollouts, Curriculum Reinforcement Learning, outcome-based RLVR
备注:
点击查看摘要
Abstract:Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.
62. 【2605.22072】Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
链接:https://arxiv.org/abs/2605.22072
作者:Changyuan Tian,Zhicong Lu,Huaxing Liu,Xiang Wang,Shuai Li,Yu Chen,Wenqian Lv,Zichuan Lin,Juncheng Diao,Deheng Ye
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, work extends RLVR, language models, large language, Reinforcement learning
备注: 20 pages, 7 figures, 3 tables. Preprint
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated Focus token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.
63. 【2605.22064】Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild
链接:https://arxiv.org/abs/2605.22064
作者:Mao Zheng,Zheng Li,Tao Chen,Bo Lv,Mingrui Sun,Mingyang Song,Jinlong Song,Hong Huang,Decheng Wu,Hai Wang,Yifan Song,Yanfeng Chen,Guanwei Zhang,Guanghua Yu,Yi Su,Hong Liu,Jinxiang Ou,Keyao Wang,Weile Chen,Haozhao Kuang,Kai Wang,Nuo Chen,Zihao Zheng,Chenhao Wang,Bin Xing,Chengcheng Xu,Tinghao Yu,Binghong Wu,Long Xu,Jiacheng Shi,Yunhao Wang,Baifang Chen,Lei Zhang,Qi Yang,Zhao Wu,Jiacheng Li,Lan Jiang,Lanrui Wang,Kai Zhang,Shuaipeng Li,Zhongzhi Chen,Weixuan Sun,Jiaqi Zhu,An Wang,Wei Li,Jun Xia,Weidong Han,Wutian Yang,Litong Hui,Luoguo Jia,Jiajia Wu,Xinpeng Zhou,Tianxiang Fei
类目:Computation and Language (cs.CL)
关键词:complex real-world scenarios, designed for complex, multilingual translation models, translation models designed, fast-thinking multilingual translation
备注:
点击查看摘要
Abstract:Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. For on-device deployment, with AngelSlim 1.25-bit extreme quantization, the 1.8B model requires only 440 MB of storage and improves inference speed by 1.5x. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall.
64. 【2605.22057】FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing
链接:https://arxiv.org/abs/2605.22057
作者:Rongjun Li,Ziyu Zhou,Yihang Wu
类目:Computation and Language (cs.CL)
关键词:deployed profiles stay, profiles stay static, Enterprise routers assign, exemplars current, stay static
备注: 13 pages, 5 figures, 5 tables
点击查看摘要
Abstract:Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.
65. 【2605.22035】HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering
链接:https://arxiv.org/abs/2605.22035
作者:Yiran Wang,Chenyi Xiong,Ziyue Qin,Miao Zhang,Kui Xiao,Zhifei Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Continual Visual Question, Visual Question Answering, Question Answering, preserving past knowledge, Continual Visual
备注: Accepted by IJCAI 2026
点击查看摘要
Abstract:Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.
66. 【2605.22012】LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
链接:https://arxiv.org/abs/2605.22012
作者:Yifan Dai,Zhenhua Wu,Bohan Zeng,Daili Hua,Jialing Liu,Bozhou Li,Yuran Wang,Chengzhuo Tong,Hao Liang,Xiaochen Ma,Junbo Niu,Tianyu Guo,Yang Shi,Yue Ding,Yiyan Ji,Bingyin Mei,Yushuo Guan,Yuanxing Zhang,Pengfei Wan,Fangcheng Fu,Wentao Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:current multimodal large, requires fine-grained evidence, multimodal large language, reasoning requires fine-grained, large language models
备注: 21 pages, 15 figures
点击查看摘要
Abstract:Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
67. 【2605.22007】Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
链接:https://arxiv.org/abs/2605.22007
作者:Jewon Yeom,Jaewon Sok,Heejun Kim,Seonghyeon Park,Jeongjae Park,Taesup Kim
类目:Computation and Language (cs.CL)
关键词:model answers incorrectly, correct concept, missing knowledge, generation-time distribution, correct
备注:
点击查看摘要
Abstract:Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.
68. 【2605.22005】Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
链接:https://arxiv.org/abs/2605.22005
作者:Hisashi Miyashita
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:reveals interpretable semantic, transformer-based large language, interpretable semantic subspaces, semantic subspaces directly, weight matrix
备注:
点击查看摘要
Abstract:We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2605.22005 [cs.LG]
(or
arXiv:2605.22005v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.22005
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
69. 【2605.22003】From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification
链接:https://arxiv.org/abs/2605.22003
作者:Dip Biswas Shanto,Mitali Yadav,Prajwal Panth,Suresh Chandra Satapathy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:opinion mining, extract opinion, Natural Language Processing, Sentiment analysis, opinion
备注: 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending
点击查看摘要
Abstract:Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.
70. 【2605.22001】Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
链接:https://arxiv.org/abs/2605.22001
作者:Aaditya Pai
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:protect LLM agents, protect LLM, LLM agents, override directives, deployed to protect
备注: 8 pages, 3 figures, 2 tables. Submitted to EMNLP 2026 ARR cycle
点击查看摘要
Abstract:Injection detectors deployed to protect LLM agents are calibrated on static, template-based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p 0.001 for Llama; chi^2 = 17.05, p 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few-shot detectors to dedicated safety classifiers. We further show that multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.
71. 【2605.21984】Echo: Learning from Experience Data via User-Driven Refinement
链接:https://arxiv.org/abs/2605.21984
作者:Hande Dong,Xiaoyun Liang,Jiarui Yu,Jiayi Lin,Changqing Ai,Feng Liu,Wenjun Zhang,Rongbi Wei,Chaofan Zhu,Linjie Che,Feng Wu,Xin Shen,Dexu Kong,Xiaotian Wang,Qiuyuan Chen,Bingxu An,Yueting Lei,Qiang Lin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:faces inherent limitations, faces inherent, inherent limitations, human data, expensive to scale
备注:
点击查看摘要
Abstract:Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2605.21984 [cs.AI]
(or
arXiv:2605.21984v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2605.21984
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
72. 【2605.21965】SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents
链接:https://arxiv.org/abs/2605.21965
作者:Mehrdad Saberi,Keivan Rezaei,Soheil Feizi
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, language models increasingly, solve information-intensive tasks, increasingly use external
备注:
点击查看摘要
Abstract:Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: this https URL
73. 【2605.21958】Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines
链接:https://arxiv.org/abs/2605.21958
作者:Yoon Jeonghun,Kim Dongchan
类目:Computation and Language (cs.CL)
关键词:LLM agent fails, multi-module LLM agent, multi-module LLM, Diagnostic Paradox empirically, LLM agent
备注: Preprint. Under review at EMNLP 2026 (ARR)
点击查看摘要
Abstract:When a multi-module LLM agent fails, the module most responsible for the failure is not necessarily the best place to intervene. We demonstrate this Diagnostic Paradox empirically: causal analysis consistently identifies the routing module -- which selects which tool to call next -- as the primary bottleneck across three independent agent families. Yet injecting prompt-level correction examples into this module consistently degrades performance, sometimes severely. Patching an upstream query-rewriting module instead reliably improves outcomes. The effect holds with statistical significance on two agent families and directional consistency on a third; alternative repair strategies at the routing module (instruction rewriting, model upgrade) are neutral, confirming that the harm is specific to correction-injection patching. We explain this asymmetry through the Linguistic Contract hypothesis: each downstream module implicitly adapts to its upstream's characteristic error distribution, so correcting the bottleneck breaks this implicit alignment in a way that upstream corrections do not. We operationalize this via a per-agent co-adaptation measure, derived from diagnosis alone, and show it is consistently associated with patching harm across agent families: higher co-adaptation co-occurs with harm, lower with safety. This trend holds across all three agent families, providing preliminary support for the hypothesis beyond a single-agent observation.
Comments:
Preprint. Under review at EMNLP 2026 (ARR)
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2605.21958 [cs.CL]
(or
arXiv:2605.21958v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.21958
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
74. 【2605.21949】Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2605.21949
作者:Shao Kan
类目:Computation and Language (cs.CL)
关键词:Medical RAG systems, Medical RAG, PAU Precision, RAG systems, require conditions
备注: 22 pages, 7 figures, 11 tables
点击查看摘要
Abstract:Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to {full, partial, conflict, abstain}. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.
75. 【2605.21902】Planning in the LLM Era: Building for Reliability and Efficiency
链接:https://arxiv.org/abs/2605.21902
作者:Michael Katz,Harsha Kokel,Kavitha Srinivas,Shirin Sohrabi
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:central capabilities, attention to intelligent, put a spotlight, Growing attention, intelligent agents
备注: Published at ICAPS 2026
点击查看摘要
Abstract:Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time -- generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.
76. 【2605.21883】oken-weighted Direct Preference Optimization with Attention
链接:https://arxiv.org/abs/2605.21883
作者:Chengyu Huang,Zhuohang Li,Sheng-Yen Chou,Claire Cardie
类目:Computation and Language (cs.CL)
关键词:aligns Large Language, Large Language Models, Large Language, Direct Preference Optimization, aligns Large
备注:
点击查看摘要
Abstract:Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) -- a novel training objective grounded on token-weighted RL -- and AttentionPO -- an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.
77. 【2605.21858】Hypergraph as Language
链接:https://arxiv.org/abs/2605.21858
作者:Mengqi Lei,Guohuan Xie,Shihui Ying,Shaoyi Du,Jun-Hai Yong,Siqi Li,Yue Gao
类目:Computation and Language (cs.CL)
关键词:recently shown strong, shown strong potential, Large language models, recently shown, shown strong
备注:
点击查看摘要
Abstract:Large language models (LLMs) have recently shown strong potential in modeling relational structures. However, existing approaches remain fundamentally graph-centric: they focus on processing pairwise graph structures into tokens that LLMs can understand. In contrast, many real-world relational patterns do not naturally conform to the pairwise-edge assumption, and are better modeled as high-order associations in hypergraphs. For hypergraph structures, existing methods often fail to preserve the native semantics that multiple objects are jointly connected by the same high-order relation, limiting their ability to exploit complex structures. To address this limitation, we put forth the "Hypergraph as Language" perspective and propose Hyper-Align, a hypergraph-native alignment framework for large language models. Hyper-Align compiles the query-object-centered hypergraph context into hypergraph tokens directly consumable by a base LLM. Specifically, we introduce Hypergraph Incidence Detail Template with Overview (HIDT-O), which serializes high-order association structures into a fixed-shape hybrid template combining local incidence details and overview-level summaries. We then design a Hypergraph Incidence Projector (HIP), which maps native high-order incidence structures into the LLM token space through explicit semantic-structural decoupling and bidirectional message passing between vertices and hyperedges. We further define a concrete Hypergraph-as-Language input protocol, which jointly feeds hypergraph tokens and textual prompts into a frozen base LLM, supporting both vertex-level and hyperedge-level tasks under a unified question-answering paradigm. To systematically evaluate different methods in hypergraph structural modeling, we introduce HyperAlign-Bench. Extensive experiments show that Hyper-Align significantly outperforms existing methods across in-domain and zero-shot evaluations.
78. 【2605.21850】ACC: Compiling Agent Trajectories for Long-Context Training
链接:https://arxiv.org/abs/2605.21850
作者:Qisheng Su,Zhen Fang,Shiting Huang,Yu Zeng,Yiming Zhao,Kou Shi,Ziao Zhang,Lin Chen,Zehui Chen,Lijun Wu,Feng Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent development, renewed demand, Recent, ACC, Agent Context Compilation
备注:
点击查看摘要
Abstract:Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.
79. 【2605.21849】Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
链接:https://arxiv.org/abs/2605.21849
作者:Sungjun Lim,Heedong Kim,Andrew Lee,Kyungwoo Song
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Mechanistic interpretability aims, identifying causally responsible, causally responsible internal, Mechanistic interpretability, responsible internal structures
备注:
点击查看摘要
Abstract:Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.
80. 【2605.21845】Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity
链接:https://arxiv.org/abs/2605.21845
作者:Geoffrey Martin,Xuan Zhong Feng,Yifan Peng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:United States, extracting structured information, requires extracting structured, death investigation narratives, investigation narratives
备注: Accepted at IEEE ICHI 2026
点击查看摘要
Abstract:Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.
81. 【2605.21842】Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
链接:https://arxiv.org/abs/2605.21842
作者:Athanasios Zeris
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
关键词:computes pairwise similarity, Standard transformer attention, attention computes pairwise, Standard transformer, intrinsic informational content
备注: 12 pages, 4 figures
点击查看摘要
Abstract:Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures -- the energetically dominant, spatially organized patterns that persist amid background chaos -- carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned linear projection that discovers the dominant spectral mode of the embedding field. On TinyShakespeare, EGA achieves +0.103 validation loss improvement with only 12,480 additional parameters (0.26% overhead) and no measurable computational cost. The result is consistent on Penn Treebank (+0.101), demonstrating dataset independence. A systematic ablation across three wavelet families (fixed Morlet, Daubechies db2/db4, and a parametric Morlet) establishes that fixed structured bases are suboptimal -- the optimal energy direction is data-adaptive and non-sinusoidal -- while identifying learned wavelet packets as a promising open direction. The learned energy threshold converges to tau ~= 0.35 independently of initialization, corresponding to the fraction (~36%) of tokens carrying above-average spectral energy in English text, a stable linguistic property consistent with the fraction of content words in running English text.
82. 【2605.21827】Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions
链接:https://arxiv.org/abs/2605.21827
作者:Daniel Tabach(Georgia Institute of Technology)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:produce numeric actions, Claude Haiku receives, language models preserve, Claude Haiku, words
备注: 9 figures, 2 tables, 16 references
点击查看摘要
Abstract:Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.
Comments:
9 figures, 2 tables, 16 references
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2605.21827 [cs.CL]
(or
arXiv:2605.21827v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.21827
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
83. 【2605.21807】When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
链接:https://arxiv.org/abs/2605.21807
作者:Doeun Lee,Muge Zhang,Yi Yu,Ashish Manne,Stephen Koesters,Frank Wen,Brady Buchanan,Lynda Villagomez,Oluwatoba Moninuola,James Lim,Kathryn Tobin,Andrew Srisuwananukorn,Ping Zhang,Sachin Kumar
类目:Computation and Language (cs.CL)
关键词:codify best studied, studied diagnostic, diagnostic and treatment, medical, treatment pathways
备注: 34 pages, 20 figures
点击查看摘要
Abstract:Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.
84. 【2605.21801】Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
链接:https://arxiv.org/abs/2605.21801
作者:Zheyuan Zhang,Kaiwen Shi,Han Bao,Zehong Wang,Tianyi Ma,Yanfang Ye
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, critic-free models enable, models enable scalable, enable scalable learning, language models
备注:
点击查看摘要
Abstract:Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.
85. 【2605.21796】MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
链接:https://arxiv.org/abs/2605.21796
作者:Anna Deichler,Jim O'Regan,Fethiye Irmak Dogan,Lubos Marcinek,Anna Klezovich,Iolanda Leite,Jonas Beskow
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:physical world requires, dynamically during conversation, physical world, world requires, requires AI systems
备注: Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis
点击查看摘要
Abstract:Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.
86. 【2605.21792】Residual Skill Optimization for Text-to-SQL Ensembles
链接:https://arxiv.org/abs/2605.21792
作者:Jiongli Zhu,Haoquan Guan,Parjanya Prajakta Prashant,Nikki Lijing Kuang,Seyedeh Baharan Khatami,Canwen Xu,Xiaodong Yu,Yingyu Lin,Zhewei Yao,Yuxiong He,Babak Salimi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
关键词:drawing multiple SQL, multiple SQL candidates, multiple SQL, SQL candidates, single-candidate generation
备注:
点击查看摘要
Abstract:Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.
87. 【2605.21781】Reflective Prompt Tuning through Language Model Function-Calling
链接:https://arxiv.org/abs/2605.21781
作者:Farima Fatahi Bayat,Moin Aminnaseri,Pouya Pezeshkpour,Estevam Hruschka
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, making prompting, parameter updates, increasingly capable
备注: 17 pages, 6 figures
点击查看摘要
Abstract:Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.
88. 【2605.21776】PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
链接:https://arxiv.org/abs/2605.21776
作者:Juliette Woodrow,Chris Piech
类目:Computation and Language (cs.CL)
关键词:Estimating mutual information, Estimating mutual, task-specific critic, pointwise mutual information, mutual information
备注:
点击查看摘要
Abstract:Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.
89. 【2605.21748】RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
链接:https://arxiv.org/abs/2605.21748
作者:Zhenwei Tang,Zhaoyan Liu,Rasa Hosseinzadeh,Tongzi Wu,Keyvan Golestan,Jesse C. Cresswell
类目:Computation and Language (cs.CL)
关键词:interactive LLM-based applications, generated text, created and refined, interactive LLM-based, LLM-based applications
备注:
点击查看摘要
Abstract:As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.
90. 【2605.21728】BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model
链接:https://arxiv.org/abs/2605.21728
作者:Gonçalo Gomes,Bruno Martins,Chrysoula Zerva
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Image captioning evaluation, vision-language models evolve, Large Language Models, captioning evaluation remains, Image captioning
备注:
点击查看摘要
Abstract:Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.
91. 【2605.21726】Probabilistic Attribution For Large Language Models
链接:https://arxiv.org/abs/2605.21726
作者:Shilpika Shilpika,Carlo Graziani,Bethany Lusch,Venkatram Vishwanath,Michael E. Papka
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, nature of Large, Language Models, generative nature
备注: 29 pages, 13 figures
点击查看摘要
Abstract:The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.
92. 【2605.21713】Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews
链接:https://arxiv.org/abs/2605.21713
作者:André V. Duarte,Brian Tufts,Aditya Oke,Fei Fang,Arlindo L. Oliveira,Lei Li
类目:Computation and Language (cs.CL)
关键词:reviews, textual features, human, peer reviews, peer
备注:
点击查看摘要
Abstract:How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.
93. 【2605.21712】Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries
链接:https://arxiv.org/abs/2605.21712
作者:Mahdi Azhdari,Eric J. Gonzales
类目:Computation and Language (cs.CL)
关键词:access remains uneven, GIS-based workflows, community stakeholders, safety analysis requires, Transportation safety
备注: 30 pages, 5 figures
点击查看摘要
Abstract:Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites create a gap between analytical tools central to safety planning and the practitioners able to use them. Local agencies, school committees, and residents may have safety concerns but limited capacity to retrieve, filter, map, and analyze relevant data. Generative AI offers a way to narrow this divide, but its public-sector use raises questions about reliability, reproducibility, and governance. This paper presents a schema-grounded natural language interface for transportation safety analysis, using a large language model (LLM) to interpret user intent while preserving deterministic, reviewable execution against an authoritative database. User queries are translated into structured semantic frames, validated by a rule-based layer, compiled into a typed directed acyclic graph of spatial operations, and executed against a PostGIS database. This bounded design separates language interpretation from deterministic execution, keeping results reproducible and schema-grounded while removing access barriers. The framework is evaluated using a statewide Massachusetts transportation safety database integrating crash records, roadway attributes, and geospatial layers including schools, bus stops, crosswalks, and municipal boundaries. All queries executed successfully; the validation layer corrects errors in 29% of evaluation queries, reflecting the gap between flexible natural language and strict schema-grounded requirements. The results suggest that combining natural language accessibility with deterministic execution is a practical direction for broadening access to transportation safety data, with implications for trustworthy AI in public-sector planning.
94. 【2605.21699】X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
链接:https://arxiv.org/abs/2605.21699
作者:Sharath Turuvekere Sreenivas,Adithyakrishna Venkatesh Hanasoge,Mingyu Yang,Ali Taghibakhshi,Saurav Muralidharan,Ashwath Aithal,Pavlo Molchanov
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Cross-tokenizer knowledge distillation, Cross-tokenizer knowledge, incompatible vocabularies, model to learn, Cross-tokenizer
备注:
点击查看摘要
Abstract:Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.
95. 【2605.21654】Value-Gradient Hypothesis of RL for LLMs
链接:https://arxiv.org/abs/2605.21654
作者:Arip Asadulaev,Daniil Ognev,Karim Salta,Martin Takac
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reinforcement learning substantially, pretrained language models, learning substantially improves, substantially improves pretrained, improves pretrained language
备注:
点击查看摘要
Abstract:Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.
96. 【2605.21653】Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
链接:https://arxiv.org/abs/2605.21653
作者:Alexander Smirnov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:text detectors amplify, amplify a pretrained, pretrained typicality axis, AUROC, pretrained typicality
备注:
点击查看摘要
Abstract:AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| = 0.0081), and = 97% of the LoRA-full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.
97. 【2605.21649】EntmaxKV: Support-Aware Decoding for Entmax Attention
链接:https://arxiv.org/abs/2605.21649
作者:Gonçalo Duarte,Miguel Couceiro,Marcos V. Treviso
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:KV-cache memory traffic, size grows linearly, generated token attends, sparse decoding, increasingly limited
备注:
点击查看摘要
Abstract:Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $\alpha$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $\delta$, showing that output error is controlled by $\delta$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: this https URL.
98. 【2605.21625】Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
链接:https://arxiv.org/abs/2605.21625
作者:Aditya Chetan,Eric Cai,Peeyush Kushwaha,Bharath Raj Nagoor Kani,Utkarsh Mall,Qianqian Wang,Noah Snavely,Bharath Hariharan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Vision-Language Models, emergence of Large, Large Vision-Language, Models
备注: CVPR 2026
点击查看摘要
Abstract:The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.
99. 【2605.21609】CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety
链接:https://arxiv.org/abs/2605.21609
作者:Heajun An,Qi Zhang,Vedanth Achanta,Jin-Hee Cho
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Large language models, mediating information seeking, Large language, emotionally sensitive interactions, adolescent digital environments
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.
100. 【2605.21558】From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment
链接:https://arxiv.org/abs/2605.21558
作者:Hao Chen,Qi Zhang,Liyao Li,Zhanming Shen,Wentao Ye,Lirong Gao,Ningtao Wang,Xing Fu,Xiaoyu Shen,Junbo Zhao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Adapting Large Language, Adapting Large, Language Models, Large Language
备注: Accepted@ICML26, 28 pages, 11 figures, 26 tables
点击查看摘要
Abstract:Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.
101. 【2605.21540】Detecting Synthetic Political Narratives in Cross-Platform Social Media Discourse
链接:https://arxiv.org/abs/2605.21540
作者:Despoina Antonakaki,Sotiris Ioannidis
类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:synthetic political communication, synthetic political narratives, large language models, Synthetic Narrative Coordination, Narrative Coordination Score
备注:
点击查看摘要
Abstract:The proliferation of large language models has introduced a new paradigm of synthetic political communication in which narratives may be generated, semantically coordinated, and strategically disseminated across platforms at scale. We present a cross-platform framework for detecting synthetic political narratives using four coordination signals -- lexical diversity D(C), temporal burstiness B(C), rhetorical repetition R(C), and semantic homogenization H(C) -- combined into a Synthetic Narrative Coordination Score SNC(C). We apply the framework to a corpus of 353,223 records spanning six geopolitical event windows collected from six Telegram channels and nine Reddit communities (2023--2026). Results show that IntelSlava exhibits the lowest lexical diversity (MATTR 0.52--0.54), the highest burstiness (B=+0.48 to +0.73), and the highest rhetorical overlap with peer channels (Jaccard 0.12), ranking first in the composite SNC(C) on four of six event windows (SNC 0.45--0.60). Rybar ranks last on all windows despite its high semantic homogenization, because its Russian-language output yields high lexical diversity and near-zero rhetorical Jaccard with English-language channels -- demonstrating that no single indicator is sufficient for coordination detection. Multi-dimensional SNC(C) scoring provides a more robust and interpretable signal than any individual metric.
Subjects:
Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:
arXiv:2605.21540 [cs.SI]
(or
arXiv:2605.21540v1 [cs.SI] for this version)
https://doi.org/10.48550/arXiv.2605.21540
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
102. 【2605.21496】HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
链接:https://arxiv.org/abs/2605.21496
作者:Brandon Dent
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:clinical workflows faster, evaluate them safely, Frontier language models, sustained clinical pressure, Frontier language
备注: 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: [this https URL](https://github.com/GOATnote-Inc/healthcraft)
点击查看摘要
Abstract:Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.
103. 【2605.21491】aching Language Models to Forecast Research Success Through Comparative Idea Evaluation
链接:https://arxiv.org/abs/2605.21491
作者:Srujan P Mule,Aniketh Garikaparthi,Manasi Patwardhan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:automating hypothesis generation, generation and implementation, bottleneck emerges, evaluating and filtering, exhaustive experimentation
备注: ACL 2026 Findings
点击查看摘要
Abstract:As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.
信息检索
1. 【2605.22766】Diversed Model Discovery via Structured Table Discovery
链接:https://arxiv.org/abs/2605.22766
作者:Zhengyuan Dong,Renée J. Miller
类目:Information Retrieval (cs.IR)
关键词:describe model behavior, including performance, mixture of textual, Model, Model cards describe
备注: 8 pages excluding references. 5 figures
点击查看摘要
Abstract:Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exploration of alternatives. We argue that model search is inherently comparative: users want models that are task-aligned yet differentiated in measurable ways. We hypothesize that this balance requires retrieval over condensed, high-quality evidence rather than verbose descriptions, and much of that evidence is concentrated in structured tables. We present StructuredSemanticSearch, a table-driven model search framework built on the ModelTables benchmark. Given a query, StructuredSemanticSearch combines a semantic baseline for task alignment with a structure-aware pipeline that discovers query-related model-card tables using table discovery operators such as unionability, joinability, and keyword search. Retrieved tables are mapped back to model cards under a controlled top-k budget, enabling fair comparison between text-based and table-based retrieval. Beyond retrieval, StructuredSemanticSearch adapts table integration to the model-table domain through orientation-aware integration, producing compact integrated views of tables from partially overlapping and sometimes transposed evidence tables. For evaluation, we introduce a nugget-based, auditable protocol that extracts compact evidence items from model cards, matches queries to condition- or intent-specific nuggets, and measures evidence coverage and diversity over retrieved model-card candidate sets. This protocol also provides a scalable path toward approximate, evidence-based labeling in dynamic model lakes. Experiments on 597 model-recommendation queries show improved nugget coverage for the structure-aware pipeline than semantic baseline
2. 【2605.22544】One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
链接:https://arxiv.org/abs/2605.22544
作者:Yevhen Kostiuk,Kenneth Enevoldsen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Instruction embedding models, embedding models, Instruction embedding, Instruction, prompt
备注:
点击查看摘要
Abstract:Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.
3. 【2605.22511】Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
链接:https://arxiv.org/abs/2605.22511
作者:Zihan Liang,Yufei Ma,Ben Chen,Zhipeng Qian,Xuxin Zhang,Huangyu Dai,Lingtao Mao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:competent search-augmented reasoning, turning a language, search-augmented reasoning agent, language model, dominant recipe
备注:
点击查看摘要
Abstract:Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with offline self-distillation (OFSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches $0.440$ average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.
4. 【2605.22501】BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
链接:https://arxiv.org/abs/2605.22501
作者:Darya Shlyk,Stefano Montanelli,Lawrence Hunter
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Biomedical Entity Linking, Biomedical Entity, remains computationally inefficient, large language models, Entity Linking
备注: Accepted to ACM SIGIR 2026
点击查看摘要
Abstract:Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.
5. 【2605.22358】Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study
链接:https://arxiv.org/abs/2605.22358
作者:Wenhao Zhang,Ruihao Yu,Yi Bai,Zhumin Chen,Pengjie Ren
类目:Information Retrieval (cs.IR)
关键词:existing approaches directly, directly map queries, approaches directly map, require multi-step reasoning, existing approaches
备注: This work was initially submitted to kdd 2026 in August 2025
点击查看摘要
Abstract:While generative retrieval (GR) demonstrates competitive performance on standard retrieval benchmarks, existing approaches directly map queries to document identifiers (docids) without intermediate deliberation, limiting their effectiveness for complex queries that require multi-step reasoning. As a preliminary study on integrating chain-of-thought (CoT) into generative retrieval, we introduce ThinkGR, a unified framework that interleaves CoT with docid generation, enabling iterative thinking and retrieval within a single generative process. To bridge the gap between free-form thought generation and structured retrieval targets, we design (1) a hybrid decoding strategy that dynamically switches between unconstrained thought generation and constrained docid decoding, and (2) a two-phase training approach that first aligns thought-retrieval patterns through supervised fine-tuning, then optimizes thought quality via retrieval-grounded reinforcement learning. Experiments on four multi-hop retrieval benchmarks demonstrate that ThinkGR achieves state-of-the-art performance with an average improvement of +6.86\%. Our work opens new avenues for enhancing generative retrieval with explicit deliberation capabilities, with promising implications for retrieval tasks requiring complex reasoning.
6. 【2605.22255】Direct content-based retrieval from music scores images
链接:https://arxiv.org/abs/2605.22255
作者:Noelia Luna-Barahona,Antonio Ríos-Vila,David Rizo,Jorge Calvo-Zaragoza
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:musical scores plays, preservation and accessibility, metadata searches, title or composer, digitization of musical
备注: 17 pages (14 pages + references), 3 figures (with subfigures)
点击查看摘要
Abstract:The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.
7. 【2605.22073】Behavior-Guided Candidate Calibration for Multimodal Recommendation
链接:https://arxiv.org/abs/2605.22073
作者:Zesheng Li,Chengchang Pan,Honggang Qi
类目:Information Retrieval (cs.IR)
关键词:Multimodal recommendation benefits, recommendation benefits, benefits from content, ranking pipeline, content signals
备注:
点击查看摘要
Abstract:Multimodal recommendation benefits from content signals, but the gain depends on how those signals interact with the ranking pipeline. We find that moderate cross-view agreement helps, while stronger agreement suppresses recommendation-specific variation. Spectral analysis shows a clear split: low-frequency components capture shared structure, and higher-frequency components preserve more discriminative signal. Based on this finding, we introduce a behavior-guided candidate calibration model that converts training-only co-user overlap into signed candidate evidence and applies it only to the shortlist produced by the multimodal backbone. The backbone keeps the representation space stable; behavior evidence acts only where ranking is decided. Results on Amazon Baby, Sports, and Electronics show consistent gains over strong multimodal baselines. Code is available at this https URL.
8. 【2605.22003】From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification
链接:https://arxiv.org/abs/2605.22003
作者:Dip Biswas Shanto,Mitali Yadav,Prajwal Panth,Suresh Chandra Satapathy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:opinion mining, extract opinion, Natural Language Processing, Sentiment analysis, opinion
备注: 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending
点击查看摘要
Abstract:Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.
9. 【2605.21987】Generative Conversational Recommender System
链接:https://arxiv.org/abs/2605.21987
作者:Sixiao Zhang,Mingrui Liu,Cheng Long
类目:Information Retrieval (cs.IR)
关键词:natural language interactions, provide personalized recommendations, recommender systems aim, language interactions, aim to provide
备注:
点击查看摘要
Abstract:Conversational recommender systems aim to provide personalized recommendations via natural language interactions. However, existing approaches either decouple recommendation from dialog generation or rely on retrieval-based pipelines, limiting the integration between recommendation and response generation and leading to suboptimal modeling of user intent. In this paper, we propose a fully generative conversational recommender system that unifies recommendation and dialog generation within a single autoregressive framework. Our approach represents items as discrete semantic IDs and integrates them directly into the generation process, enabling joint prediction of items and responses via next-token modeling. We further introduce a structured generation paradigm that factorizes conversational recommendation into a sequence of interdependent decisions, where the model first predicts the response intent and the recommendation target, and then generates the response conditioned on them. This design enables end-to-end optimization, enforces a more coherent dependency structure, and supports faithful item generation via constrained decoding. Extensive experiments demonstrate that our method consistently improves recommendation performance, achieving gains of up to 29% on Recall@1 over strong baselines, while maintaining competitive dialog quality.
10. 【2605.21969】LLM Retrieval for Stable and Predictable Ad Recommendations
链接:https://arxiv.org/abs/2605.21969
作者:Vinodh Kumar Sunkara,Satheeshkumar Karuppusamy,Hangjun Xu,Sai Deepika Regani,Kshitij Gupta,Gaby Nahum,Sneha Iyer,Jean-Baptiste Fiot,Yinglong Guo,Xiaowen Guo,Atul Jangra,Yucheng Liu,Jinghao Yan,Vijay Pappu,Benjamin Schulte,Deepak Chandra
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:normalized discounted cumulative, discounted cumulative gain, primarily focused, focused on optimizing, accuracy of click
备注: SIGIR 2026 AgentSearch Workshop, Melbourne Australia
点击查看摘要
Abstract:Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG). With the hyper-growth of ads inventory and liquidity with generative AI technologies, the prediction stability and predictability is becoming increasingly critical. Intuitively, prediction stability and predictability can be defined to quantify system robustness with respect to minor/noisy input (ads, creatives) perturbations, the lack of which could lead to advertiser perceivable problems such as repeatability, cold start and under-exploration. In this paper, we introduce a new evaluation framework for quantifying stability and predictability of an ads recommender system, and present an online validated semantic candidate generation framework powered by fine-tuned Large Language Models (LLMs) that showed significant improvement along these metrics by fundamentally improving the semantic-awareness of the system. The approach extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion, ensuring the retrieved candidates encapsulate semantic variants of an ad, guaranteeing that small creative variants from the advertiser yield consistent and explainable delivery results to the user. We tested this LLM ads retrieval framework in a large-scale industrial ads recommendation system, demonstrating significant improvements across offline and online A/B experiments, showcasing gains in both predictability and traditional performance metrics. Although evaluated in the ads stack, this is a general framework that can be applied broadly to any large-scale recommendation and retrieval systems facing similar scaling and predictability challenges.
11. 【2605.21967】Reinforced Preference Optimization for Reasoning-Augmented Recommendations
链接:https://arxiv.org/abs/2605.21967
作者:Jingtong Gao,Zeyu Song,Chi Lu,Xiaopeng Li,Derong Xu,Maolin Wang,Peng Jiang,Kun Gai,Qingpeng Cai,Xiangyu Zhao
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, delivering personalized content, richer world knowledge
备注:
点击查看摘要
Abstract:Recommender systems are critical for delivering personalized content across digital platforms, and recent advances in Large Language Models (LLMs) offer new opportunities to enhance them with richer world knowledge and explicit reasoning capabilities. With the help of reasoning knowledge, recommendations can better infer users' underlying intents, adapt to evolving preferences, and leverage semantic relationships for improved accuracy and interpretability. However, existing reasoning-based recommendation methods often fail to fully align the LLM's reasoning process with recommendation-specific objectives due to structural disruption during integration and difficulties in translating free-form generation into accurate item predictions. In this paper, we introduce RPORec, a reinforced preference optimization framework that unifies an LLM backbone's reasoning ability with a dedicated recommendation head (Rechead) for precise item retrieval. RPORec comprises two stages: (1) Reasoning-Augmented Recommendation Modeling, where high-quality Chain-of-Thought (CoT) reasoning is generated and used as auxiliary knowledge to guide the Rechead in learning recommendation-specific representations; and (2) Advanced Reasoning Refinement and Alignment, in which the trained Rechead produces verifiable rewards to fine-tune the LLM backbone via reinforcement learning, enhancing reasoning quality, structural consistency, and task relevance. Extensive experiments on public benchmarks and large-scale online deployments show that RPORec consistently outperforms state-of-the-art LLM-based recommendation methods, demonstrating the effectiveness of reasoning-augmented recommendation modeling in real-world systems.
12. 【2605.21812】Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb
链接:https://arxiv.org/abs/2605.21812
作者:Wendy Ran Wei,Hao Li,Weiwei Guo,Xiaowei Liu,Xueyin Chen,Dillon Davis,Malay Haldar,Soumyadip Banerjee,Kedar Bellare,Huiji Gao,Stephanie Moyerman,Sanjeev Katariya
类目:Information Retrieval (cs.IR)
关键词:Deploying natural language, critical cold-start challenge, natural language search, language search systems, Airbnb natural language
备注:
点击查看摘要
Abstract:Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb's natural language search. For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage. We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces verbose queries with KL divergence of 12.03 vs. real users; our seed-guided approach achieves 0.66, a 7.5x improvement. For attribute type distributions, our approach achieves the lowest KL divergence (0.04), outperforming even seed queries (0.09). Experiments show our approach produces harder evaluation examples than the no-seed baseline (79% vs. 97% pairwise accuracy), providing discriminative signal for model improvement. We deploy production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2605.21812 [cs.IR]
(or
arXiv:2605.21812v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2605.21812
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
计算机视觉
1. 【2605.22823】Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
链接:https://arxiv.org/abs/2605.22823
作者:Jongseo Lee,Hyuntak Lee,Sunghun Kim,Sooa Kim,Jihoon Chung,Jinwoo Choi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Video Large Language, Language Models, Large Language, basic perceptual primitive
备注: Preprint. 59 pages, including appendix. Code: [this https URL](https://github.com/KHU-VLL/DeltaDirect)
点击查看摘要
Abstract:Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: this https URL
2. 【2605.22819】Cambrian-P: Pose-Grounded Video Understanding
链接:https://arxiv.org/abs/2605.22819
作者:Jihan Yang,Zifan Zhao,Xichen Pan,Shusheng Yang,Junyi Zhang,Bingyi Kang,Hu Xu,Saining Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Camera pose matters, video, pose, pose matters, Camera
备注: Project Page: [this https URL](https://cambrian-mllm.github.io/)
点击查看摘要
Abstract:Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.
3. 【2605.22818】MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
链接:https://arxiv.org/abs/2605.22818
作者:Lee Hsin-Ying,Hanwen Jiang,Yiqun Mei,Jing Shi,Ming-Hsuan Yang,Zhixin Shu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current motion-controlled, generation models rigidly, rigidly follow user-provided, follow user-provided trajectories, models rigidly follow
备注: ICML 2026. Project page: [this https URL](https://motimotion.github.io/)
点击查看摘要
Abstract:Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.
4. 【2605.22816】AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
链接:https://arxiv.org/abs/2605.22816
作者:Wenxuan Guo,Xiuwei Xu,Yichen Liu,Xiangyu Li,Hang Yin,Huangxing Chen,Wenzhao Zheng,Jianjiang Feng,Jie Zhou,Jiwen Lu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:ground language instructions, visual environment, ground language, language instructions, VLN
备注: Accepted to CVPR 2026. Project page: [this https URL](https://gwxuan.github.io/AwareVLN/)
点击查看摘要
Abstract:Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: this https URL.
5. 【2605.22812】GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
链接:https://arxiv.org/abs/2605.22812
作者:Wenxuan Guo,Ziyuan Li,Meng Zhang,Yichen Liu,Yimeng Dong,Chuxi Xu,Yunfei Wei,Ze Chen,Erjin Zhou,Jianjiang Feng
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong potential, general-purpose robot manipulation, existing VLA systems, VLA systems primarily, shown strong
备注: Project page: [this https URL](https://gwxuan.github.io/GesVLA/)
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: this https URL.
6. 【2605.22809】Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
链接:https://arxiv.org/abs/2605.22809
作者:Jiahao Wang,Bo Sun,Yijing Bai,Vincent Casser,Songyou Peng,Zehao Zhu,Meng-Li Shih,Xander Masotto,Shih-Yang Su,Kanaad V Parvate,Tiancheng Ge,Linn Bieske,Dragomir Anguelov,Mingxing Tan,Chiyu Max Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autonomous Driving Systems, Driving Systems, Autonomous Driving, Autonomous Vehicle, require massive
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.
7. 【2605.22777】DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
链接:https://arxiv.org/abs/2605.22777
作者:Tianhang Wang,Yitong Chen,Wei Song,Zuxuan Wu,Min Li,Jiaqi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robust high-level representations, vision foundation models, latent diffusion models, Representation Autoencoders, providing robust high-level
备注:
点击查看摘要
Abstract:Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.
8. 【2605.22767】Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
链接:https://arxiv.org/abs/2605.22767
作者:Ganlin Feng,Yuxi Long,Erin Lou,Lianghong Chen,Zihao Jing,Pingzhao Hu,Wei Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, extreme data scarcity, distinctive facial phenotypes, privacy constraints, limited data sharing
备注: CVPR 2026 CV4CHL workshop
点击查看摘要
Abstract:Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.
9. 【2605.22751】Spectral Tail Auxiliary Learning for AI-Generated Image Detection
链接:https://arxiv.org/abs/2605.22751
作者:Xingyi Li,Jiahui Zhang,Yiheng Li,Yun Cao,Wenhao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:detection increasingly challenging, models evolve rapidly, evolve rapidly, continues to narrow, increasingly challenging
备注:
点击查看摘要
Abstract:As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or high-frequency discrepancies. However, the specific and recurring spectral regularities remain insufficiently understood and characterized. In this paper, we systematically analyze the one-dimensional radial log-power spectra of real and generated images. We find that generated images do not necessarily exhibit higher or lower energy across the entire spectrum or high-band range. Instead, their spectra deviate from the power-law decay and show an anomalous uplift in the ultra-high-frequency tail. We term this phenomenon spectral tail uplift. We further attribute this phenomenon to nonlinear harmonic accumulation in trained generative models, suggesting that it can serve as a structural cue across generative architectures. Based on this observation, we propose Spectral Tail Auxiliary Learning (STAL), a frequency-domain auxiliary supervision framework for generalizable AI-generated image detection. STAL transfers spectral-tail cues from a tail-aware frequency teacher to a spatial detector during training, while all frequency-domain modules are discarded at inference time. Consequently, STAL introduces no inference overhead. Extensive experiments on 9 public datasets show that STAL achieves strong generalization and stability across generators, data distributions, and real-world scenarios.
10. 【2605.22718】WorldKV: Efficient World Memory with World Retrieval and Compression
链接:https://arxiv.org/abs/2605.22718
作者:Jung Yi,Minjae Kim,Paul Hyunbin Cho,Wooseok Jang,Sangdoo Yun,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autoregressive video diffusion, video diffusion models, Autoregressive video, action-conditioned world generation, video diffusion
备注: Project Page: [this https URL](https://cvlab-kaist.github.io/WorldKV/)
点击查看摘要
Abstract:Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: this https URL
11. 【2605.22715】AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
链接:https://arxiv.org/abs/2605.22715
作者:Baiyu Chen,Zechen Li,Wilson Wongso,Lihuan Li,Xiachong Lin,Hao Xue,Benjamin Tag,Flora Salim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:continuously sense human, daily life, increasingly embedded, embedded in daily, offer a practical
备注:
点击查看摘要
Abstract:As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: this https URL.
12. 【2605.22697】Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions
链接:https://arxiv.org/abs/2605.22697
作者:Yannick Porto,Renato Martins,Thomas Chalumeau,Cedric Demonceaux
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:important domain shifts, real-world scenarios, key capability, capability for effective, effective deployment
备注: Accepted to ICPR 2026. Code and trained models available at: [this https URL](https://icb-vision-ai.github.io/OrientationAware-HAR)
点击查看摘要
Abstract:Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: this https URL
13. 【2605.22695】Improving Viewpoint-Invariance and Temporal Consistency for Action Detection
链接:https://arxiv.org/abs/2605.22695
作者:Yannick Porto,Renato Martins,Thomas Chalumeau,Cedric Demonceaux
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Viewpoint change invariance, change invariance, consistency are critical, critical aspects, effective deployment
备注: Accepted at ICIP 2026. Code and trained models are available at: [this https URL](https://icb-vision-ai.github.io/HydraView-TAD)
点击查看摘要
Abstract:Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows. This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties. In the first stage, we extract motion features from augmented virtual viewpoints, solely used at training. Then, the second stage introduces a new view-invariant, multi-scale temporal encoder based on selective state-space sequence modelling to aggregate information across viewpoints and time scales. Experiments on PKU-MMD and BABEL benchmarks demonstrate that this approach significantly outperforms state-of-the-art methods in all considered splits. Code and trained models are available at: this https URL
14. 【2605.22679】Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models
链接:https://arxiv.org/abs/2605.22679
作者:Piotr Kubaty,Patryk Marszałek,Łukasz Struski,Adam Wróbel,Jacek Tabor,Marek Śmieja
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:learn powerful multimodal, powerful multimodal embeddings, semantics remain opaque, internal semantics remain, models learn powerful
备注:
点击查看摘要
Abstract:Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.
15. 【2605.22678】Swift Sampling: Selecting Temporal Surprises via Taylor Series
链接:https://arxiv.org/abs/2605.22678
作者:Dahye Kim,Bhuvan Sachdeva,Karan Uppal,Naman Gupta,Vineeth N. Balasubramanian,Deepti Ghadiyaram
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:critical information resides, temporal surprises, critical information, information resides, resides in temporal
备注:
点击查看摘要
Abstract:While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.
16. 【2605.22677】Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment
链接:https://arxiv.org/abs/2605.22677
作者:Janek Haberer,Jon Eike Wilhelm,Olaf Landsiedel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deploying vision models, varying resource constraints, maintaining separate models, typically requires training, Deploying vision
备注: Accepted at Mobile AI Workshop 2026 (CVPR'26 Workshop)
点击查看摘要
Abstract:Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.
17. 【2605.22671】From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
链接:https://arxiv.org/abs/2605.22671
作者:Bing Hu,Zaijing Li,Rui Shao,Junda Chen,April Hua Liu,Wei-Shi Zheng,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:learn generalized behavior, generalized behavior representations, models often suffer, distribution shifts, varying environments
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbf{Visuomotor Behavior Encoder (VBE)}, which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the \textbf{Phase-conditioned Behavior Decoder (PBD)}, which decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. Experiments on RoboTwin 2.0, LIBERO, and CALVIN demonstrate state-of-the-art success rates of 58\%, 98\%, and 4.36 (this http URL), respectively. Notably, in real-world sim-to-real transfer, BehaviorVLA matches the performance of OpenVLA-OFT using only 50\% of the demonstration data, showcasing its superior data efficiency and generalization.
18. 【2605.22668】SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
链接:https://arxiv.org/abs/2605.22668
作者:Javad Rajabi,Kimia Shaban,Koorosh Roohi,David B. Lindell,Babak Taati
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion transformers, Rotary Position Embeddings, training range, dominant architecture, performance drops
备注: 27 pages, 14 figures. Project page: [this https URL](https://rajabi2001.github.io/sega/)
点击查看摘要
Abstract:Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.
19. 【2605.22658】SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
链接:https://arxiv.org/abs/2605.22658
作者:Zhenyu Lu,Liupeng Li,Jinpeng Wang,Haoqian Kang,Yan Feng,Ke Chen,Yaowei Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
关键词:large language models, segmentation pipelines fail, large language, pipelines fail, fail to transparently
备注: Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables
点击查看摘要
Abstract:While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at this https URL.
20. 【2605.22654】Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
链接:https://arxiv.org/abs/2605.22654
作者:Shanshan Wang,Fengying Ye,Hanjia Lyu,Caiwen Gou,Junchao Wu,Jingming Yao,Chengzhong Xu,Jiebo Luo,Derek F. Wong
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Previous detection studies, modern Chinese poetry, addressed modern Chinese, modern Chinese, Chinese poetry
备注:
点击查看摘要
Abstract:Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.
21. 【2605.22651】What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining
链接:https://arxiv.org/abs/2605.22651
作者:Hyejin Go,Semi Lee,Hyesong Choi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CLIP-style contrastive pretraining, contrastive pretraining typically, pretraining typically curates, typically curates web-scale, curates web-scale image-text
备注: 11 pages, 2 figures, 4 tables. Preprint
点击查看摘要
Abstract:CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.
22. 【2605.22649】From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder
链接:https://arxiv.org/abs/2605.22649
作者:Yilin Zhang,Nicholas C. Harvey,Nicholas R. Fuggle,Rahman Attar
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Dual-energy X-ray absorptiometry, Dual-energy X-ray, large-scale skeletal assessment, variation remains challenging, interpretable factor-specific anatomical
备注: 7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)
点击查看摘要
Abstract:Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.
23. 【2605.22635】he Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
链接:https://arxiv.org/abs/2605.22635
作者:Erjian Zhang,Yatong Hao,Liejun Wang,Zhiqing Guo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:radiology report generation, automatic radiology report, multi-task learning based, learning based automatic, based automatic radiology
备注: Accepted by ICML 2026
点击查看摘要
Abstract:While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3\% on MIMIC-CXR and 1.9\% on IU X-Ray. Our code is available at this https URL.
24. 【2605.22631】AtomicMotion: Learning Human Motion From Different Human Parts
链接:https://arxiv.org/abs/2605.22631
作者:Runzhen Liu,Chuhua Xian,Fa-Ting Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurately reconstructing full-body, Accurately reconstructing, head and hand, hand trajectories, foundational challenge
备注:
点击查看摘要
Abstract:Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents'' embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.
25. 【2605.22629】H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning
链接:https://arxiv.org/abs/2605.22629
作者:Zhanbo Huang,Xiaoming Liu,Yu Kong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:non-rigid surface dynamics, soft tissue, represent the non-rigid, dynamics of clothing, clothing and soft
备注: 19 pages, 7 figures, 4 tables
点击查看摘要
Abstract:Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication
26. 【2605.22619】GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT
链接:https://arxiv.org/abs/2605.22619
作者:Shuo Jiang,Yuhao Hong,Chunbo Jiang,Weihong Chen,Huangwei Chen,Shenghao Zhu,Beining Wu,Mingxuan Liu,Zhu Zhu,Feiwei Qin,Min Tan,Yifei Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:verifiable clinical interpretation, remains challenging due, Grounding radiology report, radiology report descriptions, clinical interpretation
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.
27. 【2605.22607】Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
链接:https://arxiv.org/abs/2605.22607
作者:Shijing Wang,Yaping Huang,Chaoqun Cui,David Wong,Yihua Cheng,Alexandros Neophytou,Hyung Jin Chang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaze, in-scene person, gaze reasoning, scene understanding, scene
备注: 11 pages, 8 figures
点击查看摘要
Abstract:Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.
28. 【2605.22605】Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection
链接:https://arxiv.org/abs/2605.22605
作者:Liuyang Wang,Feitian Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, large scale variations, scale variations
备注:
点击查看摘要
Abstract:Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.
29. 【2605.22591】Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
链接:https://arxiv.org/abs/2605.22591
作者:Zitong Li,Haoyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Foundation Models, Frozen Vision Foundation, Foundation Models, Vision Foundation, lightweight classification heads
备注:
点击查看摘要
Abstract:Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $\chi^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.
30. 【2605.22581】SceneAligner: 3D-Grounded Floorplan Localization in the Wild
链接:https://arxiv.org/abs/2605.22581
作者:Junhyeong Cho,Ruojin Cai,Hadar Averbuch-Elor
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:public buildings provide, buildings provide floorplans, visitors orient, Floorplan localization, Floorplan
备注: Project Page: [this https URL](https://Cornell-VAILab.github.io/SceneAligner)
点击查看摘要
Abstract:Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.
31. 【2605.22578】Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping
链接:https://arxiv.org/abs/2605.22578
作者:Chouaib Bencheikh Lehocine,Adam Lilja,Junsheng Fu,Lars Hammarstrand
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving systems, costly high-definition maps, crucial component, component of autonomous, autonomous driving
备注:
点击查看摘要
Abstract:Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies predominantly on mean average precision (mAP) based on thresholded Chamfer distance (CD). This framework lacks sensitivity to point ordering and provides limited granularity in assessing geometric quality, making it difficult to distinguish which methods truly excel over others. In this work, we address these limitations on two fronts. For the single-instance similarity measure, we introduce sequence optimal sub-pattern assignment (SOSPA), an order-aware metric that enables fine-grained evaluation of individual geometries while satisfying all metric axioms. For the multi-instance evaluation framework, we propose polyline localisation and detection (PLD), a soft metric that jointly captures detection quality and geometric accuracy, replacing the hard thresholding of mAP with a principled soft assignment. Through evaluations on nuScenes, we demonstrate that PLD effectively ranks SOTA online mapping methods (MapTRv2, StreamMapNet, MapTracker) while providing a decomposed error analysis. This analysis identifies detection capability as the dominant bottleneck in current methods, revealing a performance trend that mAP fails to capture. Code for evaluation using our metrics will be released.
32. 【2605.22572】SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation
链接:https://arxiv.org/abs/2605.22572
作者:Hasaan Maqsood,Saif Ur Rehman Khan,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, Accurate segmentation, multi-parametric MRI, MRI is critical, brain tumour sub-regions
备注:
点击查看摘要
Abstract:Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimensional residual encoder--decoder network introducing a novel SegAttentionGate module that explicitly supervises the decoder to produce spatially discriminative attention maps for each tumour sub-region necrotic core, peritumoral oedema, and enhancing tumour via a lightweight auxiliary loss, adding less than 0.2% parameter overhead. This sub-region supervision maintains decoder discriminability between visually ambiguous classes while providing free-of-cost spatial interpretability at inference without any post-hoc explanation method. Evaluated independently on BraTS2021 and BraTS2023 GLI across 251 held-out subjects each, SegGuidedNet achieves mean Dice of 0.905 (ET= 0.873, TC=0.906, WT=0.935) and 0.897 (ET=0.859, TC=0.902, WT=0.931) respectively, surpassing ensemble-based nnU-Net and HNF-Netv2 as a single model and approaching Swin UNETR a 10-model ensemble within 2--4 Dice points at a fraction of the inference cost. The consistency of results across two benchmark editions further confirms the generalisability of the proposed approach, offering competitive accuracy with built-in interpretability in a lightweight, clinically practical framework.
33. 【2605.22570】VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
链接:https://arxiv.org/abs/2605.22570
作者:Jinho Park,Youbin Kim,Hogun Park,Eunbyung Park
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, capability for Multimodal
备注: 82 pages, 91 figures (7 in main paper, 84 in appendix). Project page: [this https URL](https://zinosii.github.io/VGenST-Bench/)
点击查看摘要
Abstract:Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.
34. 【2605.22563】Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain
链接:https://arxiv.org/abs/2605.22563
作者:Francesco Benedetto,Roberto Basla,Luca Magri,Giacomo Boracchi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Training Deep Neural, Deep Neural Networks, Training Deep, Deep Neural, Neural Networks
备注: 6 pages, Accepted at the International Conference on Image Processing (ICIP) 2026
点击查看摘要
Abstract:Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to address important medical problems like tissue repair or cancer treatment. Generating synthetic videos along with their Ground Truth annotations is a promising solution that relies, as a foundational first step, on the synthesis of single cell annotations (or phantoms). Phantoms need to be time consistent, as they have to replicate biological processes that are specific to the cell types. In this work, we propose a novel framework for generating videos of cell phantoms in the Elliptical Fourier Descriptors (EFDs) domain, a compact and geometrically interpretable representation for 2D closed contours. We represent the cell phantom evolution as a multivariate time series of EFD coefficients, introducing a strong prior for cell morphology and enabling the efficient generation of sequences that evolve coherently in time. Our experimental validation proves that modelling the temporal evolution in EFD space enables the generation of biologically plausible phantom videos. Our method can be used in generative pipelines for synthesizing annotated data for cell tracking, thus strongly mitigating the annotation effort for creating new datasets. Our code is available for download here: this https URL.
35. 【2605.22558】GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
链接:https://arxiv.org/abs/2605.22558
作者:Deshui Miao,Xingsen Huang,Yameng Gu,Xin Li,Haijun Zhang,Ming-Hsuan Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserve physical geometry, semantic appearance, preserve physical, geometric, visual tokens
备注:
点击查看摘要
Abstract:Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at this https URL .
36. 【2605.22552】FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
链接:https://arxiv.org/abs/2605.22552
作者:Haokun Wen,Xuemeng Song,Xinghao Xie,Xiaolin Chen,Xiangyu Zhao,Weili Guan
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:modern e-commerce systems, e-commerce systems, Fashion image retrieval, cornerstone of modern, modern e-commerce
备注:
点击查看摘要
Abstract:Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at this https URL.
37. 【2605.22550】MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding
链接:https://arxiv.org/abs/2605.22550
作者:Varun A. Paturkar,Shankar Gangisetty,C.V. Jawahar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Global South, disproportionately high share, Advanced Driver Assistance, Driver Assistance Systems, disproportionately high
备注:
点击查看摘要
Abstract:Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 1,629 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https: //varuniiith.this http URL
38. 【2605.22547】Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement
链接:https://arxiv.org/abs/2605.22547
作者:Yiming Xu,Yixuan Liu,Yuhang Zhang,Ling Zheng,Yihan Wang,Qi Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:brought significant progress, Deep learning, effectively leverage similar, leverage similar cases, learning has brought
备注:
点击查看摘要
Abstract:Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by historical similar cases and their associated symptoms. To simulate this diagnostic process, we propose a framework that performs case-aware reasoning using multimodal knowledge graphs for explainable medical image diagnosis. Given an input image, our method constructs a multimodal knowledge graph from adaptively retrieved similar cases, enabling more effective utilization of related samples. We further introduce a knowledge propagation and injection mechanism, where an image-centric Graph Attention Network propagates knowledge semantics to obtain case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, adaptively adjusting its contribution to the final prediction and providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets show that our approach consistently outperforms strong baselines, and ablation studies validate the effectiveness of each component. The source code is publicly available at this https URL.
39. 【2605.22538】Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
链接:https://arxiv.org/abs/2605.22538
作者:Deyi Zhu,Yuji Wang,Yong Liu,Yansong Tang,Bingyao Yu,Jiwen Lu,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Traditional visual object, Traditional visual, visual object tracking, task-specific supervised training, methods typically rely
备注:
点击查看摘要
Abstract:Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at this https URL.
40. 【2605.22536】SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
链接:https://arxiv.org/abs/2605.22536
作者:Xiaolong Zhou,Yifei Liu,Ziyang Gong,Jiarui Li,Qiyue Zhao,Muyao Niu,Yuanyuan Gao,Le Ma,Xue Yang,Hongjie Zhang,Zhihang Zhong
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.
41. 【2605.22504】LACO: Adaptive Latent Communication for Collaborative Driving
链接:https://arxiv.org/abs/2605.22504
作者:Tianhao Chen,Yuheng Wu,Dongman Lee
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling connected vehicles, partial observability, aims to improve, improve safety, safety and efficiency
备注:
点击查看摘要
Abstract:Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.
42. 【2605.22492】raining-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline
链接:https://arxiv.org/abs/2605.22492
作者:Sebastian Cavada,Francesco Pelosin,Lapo Faggi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visually similar classes, similar classes, requires both precise, precise localization, localization and discrimination
备注: Accepted at the 13th Workshop on Fine-Grained Visual Categorization, CVPR 2026
点击查看摘要
Abstract:Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage framework that decouples segmentation from classification. SAM3 first produces class-agnostic mushroom masks using macro-taxonomic prompts, and DINOv3 then assigns fine-grained labels through prototype matching in the embedding space. To improve this stage, we apply a simple transformation of the DINOv3 feature space that improves prototype-based classification. Compared with class-specific prompting, our approach is more scalable and keeps the segmentation cost low. We report results from one-shot to few-hundred-shot regimes, providing, to the best of our knowledge, the first baseline for fine-grained semantic segmentation in low-data settings.
43. 【2605.22484】Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling
链接:https://arxiv.org/abs/2605.22484
作者:David Méndez,Roberto Confalonieri,Natalia Díaz Rodríguez
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:massive paired datasets, shared space, requires expensive, training with massive, images and text
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.
44. 【2605.22478】Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agents for Zero-Shot Compositional Image Retrieval
链接:https://arxiv.org/abs/2605.22478
作者:Xingtian Pei,Yukun Song,Changwei Wang,Shunpeng Chen,Rongtao Xu,Shibiao Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Compositional Image Retrieval, Zero-Shot Compositional Image, Compositional Image, reference image, requires both preserving
备注: 10 pages, 5 figures,4 tables
点击查看摘要
Abstract:Zero-Shot Compositional Image Retrieval (ZS-CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitutes the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Law (TTS) into ZS-CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate pool. Subsequently, the Decision Manager combines a Training-free Reasoning Policy Distillation mechanism with a Tournament-style TTS strategy to achieve self-evolving fine-grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience-driven self-evolution and TTS represent a highly promising and scalable path for achieving zero-shot fine-grained multimedia retrieval. The code will be made publicly available upon acceptance.
45. 【2605.22469】MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation
链接:https://arxiv.org/abs/2605.22469
作者:Patryk Bartkowiak,Lennart Petersen,Bartosz Kotrys,Dominik Michels,Soren Pirk,Wojtek Palubicki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:captures identity fidelity, diffusion requires measuring, generated scene matches, Evaluating single-concept personalization, captures identity
备注: 20 pages, 2 figures, 7 tables
点击查看摘要
Abstract:Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.
46. 【2605.22467】SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data
链接:https://arxiv.org/abs/2605.22467
作者:Patryk Bartkowiak,Bartosz Kotrys,Dominik Michels,Soren Pirk,Wojtek Palubicki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:common computer vision, computer vision tasks, quantitative similarity metric, downstream model training, similarity
备注:
点击查看摘要
Abstract:We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline .
47. 【2605.22455】Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light
链接:https://arxiv.org/abs/2605.22455
作者:Valeria Pais,Malena Mendilaharzu,Daniele Faccio,Luis Oala,Christoph Clausen,Bruno Sanguinetti
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optics (physics.optics)
关键词:Real-world deployment, training and testing, fueled and limited, Real-world, vision models
备注: Accepted non-archival paper at the CVPR 2026 AUTOPILOT Workshop (Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks)
点击查看摘要
Abstract:Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluations. Synthetic data can fill these gaps, providing us with a way to sample the input space more continuously and improve data coverage for benchmarks. Focusing on the autonomous driving safety-critical case of pedestrian detection in the dark, we show how synthetic low-light samples can be used to better characterize the performance of a state-of-the-art object detection model as a function of the scene illumination. We use a synthetic RAW image augmentation technique to generate low-light samples that match the noise model of the camera sensor. Performance metrics on real and synthetic low-light data are similar, indicating that the AI model finds it hard to distinguish between them.
48. 【2605.22446】Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
链接:https://arxiv.org/abs/2605.22446
作者:Zhen Sun,Yongjian Guo,Haoran Sun,Luqiao Wang,Wei Lu,Jiachi Ji,Shengzhe Ji,Junwu Xiong,Zhijun Meng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:long-horizon embodied intelligence, generative world models, advanced long-horizon embodied, learning-based action generation, practical deployment remains
备注:
点击查看摘要
Abstract:While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.
49. 【2605.22423】Moment-Reenacting: Inverse Motion Degradation with Cross-shutter Guidance
链接:https://arxiv.org/abs/2605.22423
作者:Ji Xiang,Lin Guixu,Yin Zhengwei,Zhao Jiancheng,Zheng Yinqiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:global shutter, rolling shutter, remains a fundamental, low-light conditions, fundamental challenge
备注: Accepted by TPAMI
点击查看摘要
Abstract:Motion degradation, manifested as blur in global shutter (GS) images or rolling shutter (RS) distortion in RS counterparts, remains a fundamental challenge in computational imaging, especially under fast motion or low-light conditions. While prior works have treated blur decomposition and RS temporal super-resolution as separate tasks, this separation fails to exploit their intrinsic complementarity. In this paper, we propose a unified framework to invert motion degradation and reenact imaging moment by jointly leveraging the complementary characteristics of GS blur and RS distortion. To this end, we introduce a novel dual-shutter setup that captures synchronized blur-RS image pairs and demonstrate that this combination effectively resolves temporal and spatial ambiguities inherent in both modalities. For allowing flexible performance-cost trade-offs, we further extend this dual-shutter setup to a stereo Blur-RS configuration with a narrow baseline. In addition, we construct a triaxial imaging system to collect a real-world dataset with aligned GS-RS pairs and ground-truth high-speed frames, enabling robust training and evaluation beyond synthetic data. Our proposed network explicitly disentangles motion into context-aware and temporally-sensitive representations via a dual-stream motion interpretation module, followed by a self-prompted frame reconstruction stage. Extensive experiments validate the superiority and generalizability of our approach, establishing a new paradigm for realistic high-speed video reconstruction under complex motion degradations. Codes and more resources are available at this https URL.
50. 【2605.22422】FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers
链接:https://arxiv.org/abs/2605.22422
作者:Laziz Hamdi,Amine Tamasna,Pascal Boisson,Thierry Paquet
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Table structure recognition, Tiny Recursive Module, precise separator localization, lightweight Tiny Recursive, requires both table-level
备注:
点击查看摘要
Abstract:Table structure recognition (TSR) requires both table-level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid-centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long-range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables-1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low-latency inference. We further study robustness under pixel-level anonymisation and show an extension to curved separators for camera-captured documents. The source code will be made publicly available at this https URL .
51. 【2605.22420】Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction
链接:https://arxiv.org/abs/2605.22420
作者:Henry Che,Jingkang Wang,Yun Chen,Ze Yang,Sivabalan Manivasagam,Raquel Urtasun
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:development and testing, Urban scene reconstruction, real-world observations, observations has emerged, powerful tool
备注: ICRA 2026. Project page: [this https URL](https://waabi.ai/genre)
点击查看摘要
Abstract:Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.
52. 【2605.22417】he Neglected Baseline in Model Interpretation
链接:https://arxiv.org/abs/2605.22417
作者:Yongjin Cui,Xiaohui Fan
类目:Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
关键词:model interpretation, methods generally ignore, interpretation, existing model interpretation, model interpretation methods
备注:
点击查看摘要
Abstract:We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretation. In this paper, we reformulate the task of model interpretation and the interpretation principles for model interpretation results to demonstrate the importance of the baseline. We further unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion, clarifying the connections among them and explicitly identifying the baseline for each method. On this basis, we analyze the flaws and errors in related model interpretation methods (IG, LayerCAM, ODAM, Difference Map). We advocate evaluating the quality of model interpretation results precisely through the attribution error between the attribution result and the attribution target, rather than adopting flawed evaluation methods, such as those based on marginal-effect or the assumption of perfect model performance. We revise IG and develope a model interpretation method with a clear and reasonable baseline, achieving better results. Our method supports model interpretation based on features from any layer. Interpretation based on features from different layers are all reasonable, and the differences among these results reflect varying degrees of feature extraction at different feature extraction stages.
53. 【2605.22414】owards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
链接:https://arxiv.org/abs/2605.22414
作者:Xingyue Wang,Bo Liu,Meng Wang,Zhixuan Zhang,Chengcheng Zhu,Huazhu Fu,Jiang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:holds great promise, Visual Question Answering, Question Answering, ophthalmic VQA, holds great
备注:
点击查看摘要
Abstract:Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.
54. 【2605.22413】From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding
链接:https://arxiv.org/abs/2605.22413
作者:Yandi Wang,Libin Zhan,Ziwei Huang,Tiancheng Luo,Yuxuan Jiang,Wang Dong,Leilei Gan,Jun Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Extracting structured information, Visual Information Extraction, Extracting structured, Multimodal Large Language, business automation
备注:
点击查看摘要
Abstract:Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at this https URL.
55. 【2605.22403】ranslating Signals to Languages for sEMG-Based Activity Recognition
链接:https://arxiv.org/abs/2605.22403
作者:Ming Wang,Haoxuan Qu,Qiuhong Ke,Wei Zhou,Hossein Rahmani,Jun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:attracted increasing research, increasing research attention, Surface electromyography, signal-based activity recognition, sEMG signal-based activity
备注:
点击查看摘要
Abstract:Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into sEMG language, integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.
56. 【2605.22366】AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture
链接:https://arxiv.org/abs/2605.22366
作者:Zi Ye,Yibin Wen,Xiaoya Fan,Xinyu Zhang,Jing Wu,Kun Zeng,Zurong Mai,Jiarui Zhang,Bohan Shi,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:decision-making increasingly requires, transform visual observations, increasingly requires multimodal, requires multimodal systems, Agricultural decision-making increasingly
备注:
点击查看摘要
Abstract:Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at this https URL.
57. 【2605.22359】GazePrior: Zero-Shot AR/VR Eye Tracking via Learned 3D Gaze Reconstruction
链接:https://arxiv.org/abs/2605.22359
作者:Corentin Dumery,David Colmenares,Alexander Fix,Pascal Fua,Ali Behrooz,Jogendra Kundu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:foundational technology, technology for advanced, Eye tracking, data, real data collection
备注: Project page: [this https URL](https://corentindumery.github.io/projects/gazeprior.html)
点击查看摘要
Abstract:Eye tracking (ET) is a foundational technology for advanced AR/VR applications. However, training ET models for every new ET device is challenging: real data collection is costly and time-consuming, while existing synthetic data generation methods lack realism. To remove the need for additional data collection while maintaining data quality, we introduce a data-driven 3D prior that models the distribution of human eyes across diverse identities, gaze directions, and light settings. This model, which we coin GazePrior, then enables sparse-input 3D reconstruction of annotated data collected with previous ET devices, which can in turn be rendered from the cameras of any target ET device. Our approach synthesizes data with the realism, diversity and ground-truth accuracy of real data collection without its prohibitive costs. Our experiments demonstrate that ET models trained with our synthesized data outperform previous zero-shot methods, achieving higher accuracy and robustness.
58. 【2605.22357】VEELA: A Clinically-Constrained Benchmark for Liver Vessel Segmentation in Computed Tomography Angiography
链接:https://arxiv.org/abs/2605.22357
作者:Ziya Ata Yazıcı,N. Sinem Gezer,İlkay Öksüz,İlker Özgür Koska,Tuğçe Toprak,Pervin Bulucu,Ufuk Beşenk,A. Emre Kavur,Pierre-Henri Conze,Hazım Kemal Ekenel,Oğuz Dicle,Mustafa Ege Şeker,Mustafa Said Kartal,Ariorad Moniri,Orhan Özkan,Osman Faruk Bayram,Hakan Polat,Musa Balcı,Ece Tuğba Cebeci,Baran Cılga,Kardelen Peçenek,M. Alper Selver
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:peripheral visibility limitations, computed tomography angiography, remains challenging due, contrast-enhanced computed tomography, complex vascular topology
备注: 27 pages, 25 figures, 5 tables
点击查看摘要
Abstract:Accurate segmentation of hepatic and portal vessels in contrast-enhanced computed tomography angiography (CTA) remains challenging due to complex vascular topology, peripheral visibility limitations, and acquisition-induced ambiguities. While existing public datasets offer valuable benchmarks, few include clinically realistic annotation constraints. We introduce VEELA (Vessel Extraction and Extrication for Liver Analysis), a rigorously curated liver vessel dataset derived from 40 CTA scans inherited from the CHAOS grand-challenge cohort. All vessels were manually delineated slice-by-slice under multi-expert consensus, using a strict visibility-driven annotation policy and avoiding anatomically inferred interpolation. This design explicitly captures anatomical variability and imaging-related uncertainty. As a continuation of the CHAOS challenge, VEELA enables reproducible cross-benchmark evaluation while extending the scope to fine-grained hepatic and portal vessel segmentation. We further establish a standardized benchmarking framework and analyze complementary evaluation metrics, including topology-aware (clDice), overlap-based (IoU), boundary-sensitive (NSD), and geometry-aware (area, length) measures. Our results demonstrate that different metrics capture distinct aspects of vascular integrity, underscoring the necessity of multi-perspective evaluation for clinically meaningful vessel segmentation. VEELA is publicly released to facilitate reproducible research and support the development of robust vascular segmentation methods. Researchers can access the evaluation metrics, dataset, and submission platform at this https URL.
59. 【2605.22351】QuantSR+: Pushing the Limit of Quantized Image Super-Resolution Networks
链接:https://arxiv.org/abs/2605.22351
作者:Haotong Qin,Xudong Ma,Xianglong Liu,Jie Luo,Jinyang Guo,Michele Magno,Yulun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compress super-resolution, resource-limited devices, computation costs, costs for deployment, deployment on resource-limited
备注:
点击查看摘要
Abstract:Low-bit quantization is widely used to compress super-resolution (SR) models and reduce storage and computation costs for deployment on resource-limited devices. However, when SR models are pushed to ultra-low precision (2-4 bits), performance can drop sharply due to diminished representational capacity and the detail-sensitive nature of SR. To address these issues, we propose QuantSR+, a unified framework that improves quantization operators, network design, and training optimization, achieving better trade-offs between accuracy and efficiency than prior low-bit SR methods. QuantSR+ mainly relies on three technical contributions: (1) Redistribution-driven Bit Determination (RBD), which reshapes quantization distributions in both forward and backward passes to preserve representation fidelity; (2) Quantized Slimmable Architecture (QSA), which begins with an over-parameterized model and progressively prunes less critical blocks to meet efficiency budgets while pushing the accuracy performance; and (3) Slimming-guided Function-localized Distillation (SFD), which enforces block-aware feature alignment via a direct loss and a progressive, function-local training schedule to capture quantization effects better and speed up convergence. Extensive experiments show that QuantSR+ achieves state-of-the-art performance against both specialized quantized SR methods and generic quantization approaches. For SwinIR-S on Urban100 (x4), it improves PSNR by 0.29 dB over the 2-bit SOTA baseline. Meanwhile, it delivers strong efficiency gains at 2-bit, reducing operations by up to 87.9% and storage by 89.4%. QuantSR+ is effective for both convolutional and transformer-based SR models, indicating broad applicability.
60. 【2605.22344】Bernini: Latent Semantic Planning for Video Diffusion
链接:https://arxiv.org/abs/2605.22344
作者:Bernini Team:Chenchen Liu,Junyi Chen,Lei Li,Lu Chi,Mingzhen Sun,Zhuoying Li,Yi Fu,Ruoyu Guo,Yiheng Wu,Ge Bai,Zehuan Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:Multimodal large language, reached remarkable maturity, large language models, heterogeneous multimodal inputs, diffusion models
备注: Project Page: [this https URL](https://bernini-ai.github.io/)
点击查看摘要
Abstract:Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.
61. 【2605.22342】4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting
链接:https://arxiv.org/abs/2605.22342
作者:Sifan Zhou,Hang Zhang,Yuhang Wang,Ming Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Gaussian Splatting, revolutionized high-fidelity dynamic, safeguarding the intellectual, open challenge, revolutionized high-fidelity
备注: 9 pages main paper, 7 figures, 18 pages in total
点击查看摘要
Abstract:While 4D Gaussian Splatting (4DGS) has revolutionized high-fidelity dynamic reconstruction, safeguarding the intellectual property of these assets remains an open challenge. Conventional steganographic techniques often neglect the underlying kinematic manifolds, triggering non-physical artifacts such as severe temporal flickering and "FVD collapse". To address this, we propose \textbf{4D-GSW}, a kinematic-aware watermarking framework designed to embed robust copyright information while preserving high spatio-temporal consistency. Unlike prior 4D steganography that primarily focuses on opacity-guided invisibility, our approach explicitly addresses the physical coherence of motion trajectories. We introduce a \textbf{Spatio-Temporal Curvature (STC)} metric to identify "Dynamic Instants," adaptively gating watermark gradient injection to shield critical motion manifolds from non-physical perturbations. To ensure global coherence across complex deformations, we formulate a joint \textbf{HMM-MRF energy minimization} model that synchronizes watermark phases within both temporal trajectories and spatial neighborhoods. Furthermore, an \textbf{anisotropic gradient routing} mechanism ensures that watermark embedding remains strictly decoupled from photometric reconstruction fidelity. Extensive experiments have demonstrated the superior performance of our method in robustly hiding watermarks while resisting various attacks and maintaining high rendering quality and spatiotemporal consistency.
62. 【2605.22328】3D LULC classification using multispectral LiDAR and deep learning: current and prospective schemes
链接:https://arxiv.org/abs/2605.22328
作者:Narges Takhtkeshha,Aldino Rizaldy,Markus Hollaus,Juha Hyyppä,Fabio Remondino,Gottfried Mandlburger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Land Cover, Land Use Land, geospatial analysis, Land, sustainable planning
备注:
点击查看摘要
Abstract:Land Use Land Cover (LULC) classification is essential for national 3D mapping, geospatial analysis, and sustainable planning. Multispectral (MS) LiDAR provides synchronized spatial-spectral information, and deep learning (DL) enables 3D point cloud semantic segmentation; however, adoption is limited by the lack of publicly available urban and suburban MS LiDAR datasets aligned with National Mapping and Cadastral Agencies (NMCAs) classification schemes. This study addresses these gaps by introducing L1 and L2 NMCA-aligned LULC classification schemes and a new benchmark MS LiDAR dataset. We evaluate seven state-of-the-art DL models and perform spectral ablation studies at both levels of detail. Results show that Point Transformer V3 achieves the best performance, with mIoU of 79.4% (L1, 8 classes) and 58.9% (L2, 20 classes) using a dual-wavelength LiDAR system (532 nm and 1064 nm). Ablation results show that multispectral information improves performance over geometry-only inputs, with gains of 1.1 percentage points at L1 and 7.8 points at L2. These results highlight the value of LiDAR reflectance for fine-grained material discrimination and support the evolution of NMCA LULC schemes toward higher semantic detail. The Loosdorf-MSL dataset contributes a new benchmark for consistent national and international LULC mapping.
63. 【2605.22327】Robustness of breast lesion segmentation under MRI undersampling improves with k-space-aware deep learning
链接:https://arxiv.org/abs/2605.22327
作者:Lukas T. Rotkopf,Marco Schlimbach,Julius C. Holzschuh,Heinz-Peter Schlemmer,Jens Kleesiek,Moritz Rempe
类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:acquired MRI k-space, accelerated or noisy, data are accelerated, acquired MRI, MRI
备注:
点击查看摘要
Abstract:Purpose: To assess whether breast lesion segmentation can be learned directly from acquired MRI k-space, and whether doing so improves robustness when data are accelerated or noisy. Materials and Methods: This retrospective study used public breast dynamic contrast-enhanced MRI (DCE-MRI) datasets with acquired and synthetic k-space, together with a within-dataset synthetic control. We compared four 3D U-Net variants: a hybrid k-space-to-image model, a native k-space model, and magnitude and complex image-space baselines. Models were evaluated under increasing undersampling and added complex Gaussian k-space noise. The primary outcome was patient-level Dice similarity coefficient under cross-validation, with the hybrid model prespecified as the main comparison against the magnitude image-space baseline. Results: At full sampling, the hybrid and image-space models performed similarly. As acceleration increased, the hybrid model retained substantially more segmentation accuracy and significantly outperformed the magnitude image-space baseline across moderate to high undersampling levels. The same pattern was observed when noise was added directly to k-space: the hybrid model degraded more slowly, whereas the image-space baseline failed under heavier noise. This advantage was reproduced in the within-dataset synthetic control. Feature analysis suggested that the k-space stage and image-space stage played complementary roles, with frequency-domain filtering concentrated before image-domain lesion localization. Conclusion: K-space-aware deep learning improves the robustness of breast lesion segmentation under MRI undersampling and k-space noise, while matching image-space methods at full sampling.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Cite as:
arXiv:2605.22327 [cs.CV]
(or
arXiv:2605.22327v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.22327
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Lukas Thomas Rotkopf [view email] [v1]
Thu, 21 May 2026 11:18:26 UTC (3,934 KB)
64. 【2605.22311】PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models
链接:https://arxiv.org/abs/2605.22311
作者:Jose Edgar Hernandez Cancino Estrada,Mauro Díaz Lupone,Žiga Emeršič,Vitomir Štruc,Peter Peer,Darian Tomašević
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe privacy concerns, raise severe privacy, models enable high-quality, diffusion models enable, Identity-conditioned diffusion models
备注:
点击查看摘要
Abstract:Identity-conditioned diffusion models enable high-quality and identity-consistent face generation, but they also raise severe privacy concerns, as models may continue to synthesize individuals despite their right to be forgotten. While machine unlearning has been extensively studied for concept and data removal, identity unlearning remains largely unexplored, particularly in models conditioned directly on identity embeddings rather than text prompts. In this work, we study identity unlearning in Arc2Face, a state-of-the-art identity-conditioned latent diffusion model for face generation, and introduce Proximity-guided Identity Unlearning (PIU), an anchor-guided framework for identity unlearning. Specifically, we formulate identity removal as an identity replacement objective that reassigns the source identity to a selected anchor identity in the learned identity space, and we complement it with a proximity-based anchor selection strategy motivated by the geometry of ArcFace representations. We further show that effective unlearning can be achieved through localized fine-tuning of a small subset of identity-sensitive cross-attention layers. Experiments across many target identities show that our framework effectively suppresses generation of the target identity while preserving realism and identity consistency for retained identities, as validated by improved performance on unlearning and image-quality metrics, together with qualitative evaluation. The source code for the PIU framework is publicly available at this https URL .
65. 【2605.22290】Detection of Virus and Small Cell Patches in Foci Images Using Switchable Convolution and Feature Pyramid Networks
链接:https://arxiv.org/abs/2605.22290
作者:Amrita Singh,Snehasis Mukherjee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:analyzing cellular structures, quantifying viral infection, Accurate detection, focus-forming unit, cellular structures
备注:
点击查看摘要
Abstract:Accurate detection and counting of virus patches in focus-forming unit (FFU) images, also known as foci images, are important for quantifying viral infection and analyzing cellular structures. This task is challenging because biomedical targets often vary substantially in size, density, contrast, and shape. In this paper, we propose an enhanced YOLOv2-based detector that integrates a Feature Pyramid Network (FPN) to improve multi-scale feature representation. We also incorporate a switchable atrous convolution mechanism to adapt the receptive field for fine-grained targets in dense microscopy images. The proposed method is evaluated on biomedical foci image datasets for virus patch and small cell patch detection. For small cell patch detection, the model achieves a mean average precision (mAP) of 40.5% at a 25% Intersection over Union (IoU) threshold. For FFU virus patch detection, the model achieves an mAP of 68%. These results indicate that combining FPN-based feature fusion with switchable convolution improves the suitability of YOLOv2 for specialized biomedical object detection tasks
66. 【2605.22273】Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability
链接:https://arxiv.org/abs/2605.22273
作者:Xiang Chen,Yuxian Dong,Chao Li,Chengyin Hu,Jiaju Han,Fengyu Zhang,Yiwei Wei,Jiahuan Long,Jiujiang Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scenarios remains underexplored, Vision-language models, diverse multimodal tasks, achieved strong performance, scenarios remains
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.
67. 【2605.22272】Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors
链接:https://arxiv.org/abs/2605.22272
作者:Jiahe Chen,ZiRui Wang,Feiyu Jia,Xiao Chen,Xiaojie Niu,Weishuai Zeng,Tianfan Xue,Xiaowei Zhou,Jiangmiao Pang,Jingbo Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Whole-body Humanoid-Object Interaction, Whole-body Humanoid-Object, scarcity of high-fidelity, explicit CAD models, Humanoid-Object Interaction
备注:
点击查看摘要
Abstract:Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.
68. 【2605.22269】MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
链接:https://arxiv.org/abs/2605.22269
作者:Junbin Xiao,Jiajun Chen,Tianxiang Sun,Xun Yang,Angela Yao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:large language models, remains challenging due, limited reasoning length, Long streaming video, language models
备注: To appear at CVPR'26. Code is available at [this https URL](https://github.com/IMBALDY/MuKV)
点击查看摘要
Abstract:Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.
69. 【2605.22268】Impact of Atmospheric Turbulence and Pointing Error on Earth Observation
链接:https://arxiv.org/abs/2605.22268
作者:Celia Sánchez-de-Miguel,Antonio M. Mercado-Martínez,Beatriz Soret,Antonio Jurado-Navas,Miguel Castillo-Vázquez
类目:Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth Observation, pointing jitter, effects are rarely, rarely considered, satellite pointing jitter
备注: Conference
点击查看摘要
Abstract:Earth Observation (EO) imagery is often degraded by atmospheric turbulence and pointing jitter; yet, these effects are rarely considered in datasets used to train AI-based detection models. Based on prior work, this paper presents an enhanced image simulator that enables the incorporation of vertical-path atmospheric turbulence and satellite pointing jitter, arising from platform and sensor vibrations, to generate physically realistic distorted images. As a case study, vessel detection is evaluated using YOLOv8 and RetinaNet on images generated by the proposed simulator under different levels of turbulence and pointing errors. Results show that YOLOv8 recall decreases from 91% under ideal conditions to 60% in the presence of weak turbulence, and falls below 40% under strong turbulence or jitter. In contrast, RetinaNet demonstrates greater robustness, maintaining approximately 75% recall across degraded conditions. These results highlight the importance of incorporating realistic physical degradations into EO training datasets to ensure reliable performance of AI-based models in operational environments, as demonstrated in maritime surveillance applications.
70. 【2605.22259】An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion
链接:https://arxiv.org/abs/2605.22259
作者:Jan Nausner,Michael Hubner
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:classifying CBRNE threats, classifying CBRNE, Heterogeneous sensor fusion, CBRNE threats, Heterogeneous sensor
备注: 6 pages, 1 figure; \c{opyright} 2026 The Authors. Submitted to the 2026 IEEE International Conference on Multisensor Fusion and Integration (MFI 2026). Under review
点击查看摘要
Abstract:Heterogeneous sensor fusion is vital for detecting, localizing, and classifying CBRNE threats. However, individual sensors are often only capable of detecting a subset of relevant threats with varying reliability or can even provide only indirect threat indications, making threat classification challenging. Furthermore, high clutter rates on the sensor side present a great challenge for fusion systems. Additionally, the limited availability of high quality datasets hinders the advancement of learning-based detection and classification models in smart sensors. To mitigate these sensor related shortcomings, a context-aware and domain knowledge-enhanced fusion process is proposed. First, a novel evidence hierarchy is established that enables modeling of direct, indicative, and contextual information. Second, contextual information about the environment is introduced into the fusion process, by collecting, processing, and exploiting OSINT inputs. Third, all levels of the evidence hierarchy are used to craft a Bayesian threat type classification mechanism with domain knowledge-informed priors. The proposed methodology is evaluated in simulated scenarios, and the results demonstrate the benefit of the proposed fusion approach in terms of robustness to clutter and prior mismatch, with an overall classification accuracy of up to 95%.
71. 【2605.22255】Direct content-based retrieval from music scores images
链接:https://arxiv.org/abs/2605.22255
作者:Noelia Luna-Barahona,Antonio Ríos-Vila,David Rizo,Jorge Calvo-Zaragoza
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:musical scores plays, preservation and accessibility, metadata searches, title or composer, digitization of musical
备注: 17 pages (14 pages + references), 3 figures (with subfigures)
点击查看摘要
Abstract:The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.
72. 【2605.22249】D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities
链接:https://arxiv.org/abs/2605.22249
作者:Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:effective treatment planning, Accurate brain tumor, Accurate brain, treatment planning, critical for effective
备注:
点击查看摘要
Abstract:Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.
73. 【2605.22231】REACH: Hand Pose Estimation from Room Corners
链接:https://arxiv.org/abs/2605.22231
作者:Shu Nakamura,Ryo Kawahara,Genki Kinoshita,Ryosuke Hirai,Yasutomo Kawanishi,Shohei Nobuhara,Ko Nishino
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:frequently occluded views, room corners, typically from fixed, occluded views, hand pose
备注:
点击查看摘要
Abstract:We introduce a novel 3D hand pose estimator that can accurately recover the shape and pose of people's hands in a room from afar, typically from fixed cameras at room corners, in extremely low-resolution and frequently occluded views. Our key idea is to fully leverage hand-body coordination, its temporal progression, and multiview observations. We achieve this with a novel Transformer-based model, in which hand and body configurations are modeled through correlations between their visual features expressed as per-view tokens, and their temporal coordination is exploited in an autoregressive manner. We introduce a novel dataset, which we refer to as REACH, Room-Environment dataset Annotated with Chest cameras for Hand pose estimation, to train and test our method. REACH is a first-of-its-kind large-scale hand pose dataset that captures accurate hand movements of 50 participants across a wide variety of daily activities. In order to avoid interfering with natural movements while annotating the hands with accurate shape and pose, we leverage concealed chest cameras. Through extensive experiments, including comparative studies with existing methods, we show that our model, REACH-Net, achieves highly accurate 3D hand pose estimation from afar. These results broaden the horizon of 3D hand pose estimation, especially towards "in-the-wild" continuous human behavior analysis.
74. 【2605.22216】A Robust Semantic Segmentation Pipeline for the CVPR 2026 8th UG2+ Challenge Track 2
链接:https://arxiv.org/abs/2605.22216
作者:Jinming Chai,Libo Yan,Licheng Jiao,Fang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adverse weather conditions, Adverse Weather, Challenge Track, WeatherProof Dataset Challenge, semantic segmentation task
备注:
点击查看摘要
Abstract:This report presents our solution for the WeatherProof Dataset Challenge, namely CVPR 2026 8th UG2+ Challenge Track 2: Semantic Segmentation in Adverse Weather. For the semantic segmentation task under adverse weather conditions, we propose a semi-supervised segmentation pipeline. Our method is trained exclusively on the WeatherProof dataset, without using any additional external data. Specifically, we adopt UniMatch V2 as the baseline model and treat all degraded-weather images as unlabeled data for semi-supervised training, thereby fully exploiting the data distribution provided by the challenge. During inference, we further apply test-time augmentation to improve the robustness and segmentation accuracy of the final predictions. The code is publicly available at: this https URL.
75. 【2605.22209】GALAR-TemporalNet v2: Anatomy-Guided Dual-Branch Temporal Classification with Bidirectional Mamba and Dual-Graph GCN for Video Capsule Endoscopy -- after competition results
链接:https://arxiv.org/abs/2605.22209
作者:Jiye Won(1),Seangmin Lee(1),Soon Ki Jung(1) ((1) Kyungpook National University)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Capsule Endoscopy, requiring simultaneous localization, Video Capsule, Capsule Endoscopy, temporal classification problem
备注: 7 pages, 2 figures. Post-competition preprint for the ICPR 2026 RARE-VISION Challenge
点击查看摘要
Abstract:Video Capsule Endoscopy (VCE) poses a challenging multi-label temporal classification problem, requiring simultaneous localization of 8 anatomical regions and detection of 9 pathological findings across tens of thousands of frames. We present GALAR-TemporalNet v2, a hierarchical temporal model that addresses three core challenges: extreme class imbalance, long-range temporal dependencies, and pathology--anatomy entanglement. Our architecture combines windowed self-attention for local modeling, a Dual-Graph GCN for global frame relationships, and Bidirectional Mamba for selective boundary context encoding. A novel anatomy prototype residual pathway decouples pathological deviation signals from normal organ appearance, and a frame-level GCN skip connection stabilizes training of visually confusable rare classes. The competition version, GALAR-TemporalNet, achieved an overall mAP@0.5 of 0.2644 and mAP@0.95 of 0.2353 on the RARE-VISION test set. Following the competition, the redesigned GALAR-TemporalNet v2 -- incorporating a restructured pathology branch, refined loss functions, and extended post-processing -- improved these results to mAP@0.5 of 0.3409 and mAP@0.95 of 0.3333.
76. 【2605.22208】EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning
链接:https://arxiv.org/abs/2605.22208
作者:Kailin Zhuang,Jiawei Wu,Zhi Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Model, Multimodal Large, Language Model, Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Model (MLLM)-driven image restoration agent demonstrates effectiveness in degradation coupling scenarios by flexibly selecting tools and determining removal orders. However, their zero-shot planning often fails without experience, necessitating severe trial-and-error overhead to achieve satisfactory outcomes. Currently, two paradigms are employed to address this issue, yet a dilemma persists: Training-based methods embed intrinsic experience into parameters, achieving high inference efficiency but lacking compatibility with new tools or degradation. In contrast, training-free methods utilize explicit experience storage for compatibility but still incur trial-and-error overhead due to naive experience. To resolve the dilemma, we propose EvoIR-Agent, which first systematically formulates the experience components of a training-free image restoration agent. Subsequently, a hierarchical experience pool is constructed, which enables coarse-to-fine guidance for diverse tools and removal orders. Furthermore, a self-evolving mechanism is introduced to update the pool from scratch using accumulated records, thereby greatly improving performance and efficiency. Extensive experiments reveal that EvoIR-Agent achieves a significant lead in the full reference metrics and yields a remarkable Pareto-optimal balance between performance and efficiency compared to the state-of-the-art methods.
77. 【2605.22201】Zero-Shot Temporal Action Localization Through Textual Guidance
链接:https://arxiv.org/abs/2605.22201
作者:Benedetta Liberatori,Alessandro Conti,Lorenzo Vaquero,Paolo Rota,Yiming Wang,Elisa Ricci
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Zero-shot temporal action, consists of classifying, Zero-shot temporal, classifying and localizing, classes are unseen
备注: Accepted to FG 2026
点击查看摘要
Abstract:Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training
78. 【2605.22200】OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
链接:https://arxiv.org/abs/2605.22200
作者:Hanna Hoffmann,Setareh Bady,Claas de Boer,Max Kirchner,Jan Egger,Rainer Röhrig,Frank Hölzle,Lennart Johannes Gruber,Kunpeng Xie,Marlon Neuhaus,Victor Alves,Guilherme Barbosa,Leonardo Barroso,João Carvalho,Hao Chen,Gabriella d'Albenzio,André Ferreira,Nuno Gomes,Yuichiro Hayashi,Kousuke Hirasawa,Rebecca Hisey,Seungjae Hong,Seoi Jeong,Tiago Jesus,Daehong Kang,Satoshi Kasai,Shunsuke Kikuchi,Takayuki Kitasaka,Satoshi Kondo,Hyoun-Joong Kong,Youngbin Kong,Atsushi Kouno,Shlomi Laufer,Kyu Eun Lee,Bining Long,Nooshin Maghsoodi,Hiroki Matsuzaki,Evangelos Mazomenos,Ori Meiraz,Kensaku Mori,Marina Music,Masahiro Oda,Roi Papo,Jieun Park,Rafael Piexoto,Saeid Rezaei,Mariana Ribeiro,Soyeon Shin,Yang Shu,Idan Smoller,Danail Stoyanov,Yihui Wang,Xinkai Zhao,Sebastian Bodenstedt,Isabel Funke,Stefanie Speidel,Behrus Hinrichs-Puladi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:optimal patient outcomes, Achieving high levels, Achieving high, patient outcomes, skill assessment
备注: Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA
点击查看摘要
Abstract:Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.
Comments:
Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2605.22200 [cs.CV]
(or
arXiv:2605.22200v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.22200
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2605.22192】Ultra-High-Definition Image Quality Assessment via Graph Representation Learning
链接:https://arxiv.org/abs/2605.22192
作者:Shaode Yu,Enqi Chen,Ming Huang,Xuemin Ren,Songnan Zhao,Zhicheng Zhang,Qiurui Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:global scene context, suppress scale-sensitive distortions, Blind image quality, images remains challenging, Blind image
备注:
点击查看摘要
Abstract:Blind image quality assessment (BIQA) for ultrahighdefinition (UHD) images remains challenging because native-resolution inference is computationally expensive, whereas aggressive resizing or isolated cropping may suppress scale-sensitive distortions and weaken the relationship between local artifacts and global scene context. This paper aims to improve UHD-BIQA by explicitly modeling the structural dependencies among sampled image regions rather than treating them as independent views, and a graph representation learning framework UHD-GCN-BIQA is proposed. The framework samples aspect-ratio-aligned patches from each UHD image, encodes them as graph nodes, and constructs a hybrid k-nearest-neighbor graph using spatial proximity and feature similarity. Residual graph convolution is used to propagate contextual information across regions, and gated attention pooling aggregates patchlevel evidence into an imagelevel quality prediction. An exponential moving average normalized multiobjective loss function is adopted to stabilize the joint optimization of regression, correlation, and ranking objectives. Experiments on the UHD-IQA benchmark show that UHD-GCN-BIQA achieves PLCC = 0.7784, SRCC = 0.8019, and RMSE = 0.0519, obtaining competitive correlation performance and the lowest RMSE among the compared methods. These results indicate that graph-based region relation modeling is effective for UHD image quality assessment, particularly for improving absolute quality score estimation under high-resolution visual content.
80. 【2605.22190】No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos
链接:https://arxiv.org/abs/2605.22190
作者:Matteo Balice,Yanik Kunzi,Chenyangguang Zhang,Matteo Matteucci,Marc Pollefeys,Sungwhan Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made dramatic progress, single feed-forward pass, unknown camera poses, existing method jointly, addresses dynamic content
备注: [this https URL](https://bralani.github.io/nopo4d_html/)
点击查看摘要
Abstract:Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.
81. 【2605.22186】Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset
链接:https://arxiv.org/abs/2605.22186
作者:Senyan Xu,Zhijing Sun,Kean Liu,Xin Lu,Ruixuan Jiang,Mingyang Huang,Xueyang Fu,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-light image enhancement, essential global illumination, high dynamic range, inherent noise sensitivity, incorporating high dynamic
备注:
点击查看摘要
Abstract:Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations. Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset. Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at this https URL.
82. 【2605.22185】Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis
链接:https://arxiv.org/abs/2605.22185
作者:Tomaso Trinci,Henrique Piñeiro Monteagudo,Leonardo Taccari
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multimodal Large Language, Large Language Models, general visual understanding, demonstrated impressive capabilities, Multimodal Large
备注: Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.
83. 【2605.22169】Balancing Uncertainty and Diversity of Samples: Leveraging Diversity of Least, High Confidence Samples for Effective Active Learning
链接:https://arxiv.org/abs/2605.22169
作者:Vipul Arya,S.H. Shabbeer Basha,Srikrishna U N,Sunainha Vijay,Snehasis Mukherjee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, including Convolutional Neural, Neural Networks, Convolutional Neural, Vision Transformers
备注:
点击查看摘要
Abstract:Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model's performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at this https URL.
84. 【2605.22158】ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
链接:https://arxiv.org/abs/2605.22158
作者:Bingjun Luo,Tony Wang,Chaoqi Chen,Xinpeng Ding
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, processing long videos
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in this https URL.
85. 【2605.22147】Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution
链接:https://arxiv.org/abs/2605.22147
作者:Jiangwei Mo,Xi Lu,Hanlin Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth observation applications, High-resolution remote sensing, High-resolution remote, crucial for Earth, Earth observation
备注:
点击查看摘要
Abstract:High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40--1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.
86. 【2605.22144】One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
链接:https://arxiv.org/abs/2605.22144
作者:Yufei Shi,Weilong Yan,Naixuan Huang,Yucheng Chen,Chenyu Zhang,Tao He,Si Yong Yeo,Ming Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:one-shot LLM generated, LLM generated scripts, requiring extensive manual, production typically rely, extensive manual review
备注:
点击查看摘要
Abstract:Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.
87. 【2605.22139】EventGait: Towards Robust Gait Recognition with Event Streams
链接:https://arxiv.org/abs/2605.22139
作者:Senyan Xu,Shuai Chen,Chuanfu Shen,Kean Liu,Zhijing Sun,Chengzhi Cao,Xueyang Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:uncontrolled environments due, recognition enables non-intrusive, Gait recognition, Gait recognition enables, enables non-intrusive
备注:
点击查看摘要
Abstract:Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity of conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition. Therefore, we propose \textbf{EventGait}, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structure Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets are released at this https URL.
88. 【2605.22132】Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
链接:https://arxiv.org/abs/2605.22132
作者:Carmelo Scribano,Mohammad Mahdi,Nedyalko Prisadnikov,Yuqian Fu,Giorgia Franchini,Danda Pani Paudel,Marko Bertogna,Luc Van Gool
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:foundation models deliver, models deliver strong, vision foundation models, Pretrained vision foundation, deliver strong performance
备注: Accepted at ICPR 2026
点击查看摘要
Abstract:Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.
89. 【2605.22126】AesFormer: Transform Everyday Photos into Beautiful Memories
链接:https://arxiv.org/abs/2605.22126
作者:Tianxiang Du,Hulingxiao He,Yuxin Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:portrait enhancement methods, aesthetically appealing moments, camera viewpoint, everyday photography, methods cannot fix
备注: Accepted by ICML 2026
点击查看摘要
Abstract:In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo's aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at this https URL.
90. 【2605.22121】MotionDPS: Motion-Compensated 3D Brain MRI Reconstruction
链接:https://arxiv.org/abs/2605.22121
作者:Antonio Ortiz-Gonzalez,Erich Kobler,Lukas Schletter,Alexander Effland
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic resonance imaging, Magnetic resonance, long acquisition times, resonance imaging, highly susceptible
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Magnetic resonance imaging (MRI) is highly susceptible to patient motion due to its relatively long acquisition times and the fact that data are acquired sequentially in k-space. Even small patient movements introduce phase inconsistencies across measurements, leading to severe artifacts such as blurring, ghosting, and geometric distortions that can compromise diagnostic quality. Retrospective motion compensation remains challenging, particularly in accelerated acquisitions, due to the ill-posed nature of the joint reconstruction and motion estimation problem. In this work, we propose a unified Bayesian framework for motion-compensated 3D MRI that jointly estimates the anatomical image, rigid-body motion parameters, and coil sensitivity maps directly from motion-corrupted k-space data. Our approach integrates pretrained 3D complex-valued score-based diffusion models as expressive anatomical image priors within a physics-based forward model. Inference is performed by alternating diffusion posterior image updates with efficient proximal optimization steps for motion and coil sensitivity estimation, enabling fully unsupervised reconstruction without the need for paired motion-free training data. Experiments on simulated and real-motion brain MRI datasets demonstrate that the proposed method achieves improved image quality and motion robustness compared to state-of-the-art classical and learning-based motion correction techniques, particularly in the presence of severe motion and high acceleration.
91. 【2605.22109】Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
链接:https://arxiv.org/abs/2605.22109
作者:Caixin Kang,Tianyu Yan,Sitong Gong,Mingfang Zhang,Liangyang Ouyang,Ruicong Liu,Bo Zheng,Huchuan Lu,Kaipeng Zhang,Yoichi Sato,Yifei Huang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, superficial pattern matching
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.
92. 【2605.22104】OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization
链接:https://arxiv.org/abs/2605.22104
作者:Feng Zhu,Shuyang Xie,Yihan Zeng,Ming Liu,Wangmeng Zuo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interacting mixed degradations, challenging due, interacting mixed, Real-world image restoration, restoration
备注:
点击查看摘要
Abstract:Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.
93. 【2605.22098】xtTeacher: What Can Language Teach About Images?
链接:https://arxiv.org/abs/2605.22098
作者:Tobias Christian Nauen,Stanislav Frolov,Brian Bernhard Moser,Federico Raue,Ahmed Anwar,Andreas Dengel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:shared representation geometry, platonic representation hypothesis, representation hypothesis suggests, sufficiently large models, large models converge
备注: Published at TMLR
点击查看摘要
Abstract:The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: this https URL
Comments:
Published at TMLR
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
MSC classes:
68T05 (Primary), 68T45 (Secondary)
ACMclasses:
I.2.6; I.2.10
Cite as:
arXiv:2605.22098 [cs.CV]
(or
arXiv:2605.22098v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.22098
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Transactions on Machine Learning Research, ISSN 2835-8856, 2026
94. 【2605.22096】VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results
链接:https://arxiv.org/abs/2605.22096
作者:Bo-Cheng Qiu,Fang-Ying Lin,Ming-Han Sun,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Capsule endoscopy event, clinically relevant findings, endoscopy event detection, Capsule endoscopy, visually heterogeneous
备注:
点击查看摘要
Abstract:Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.
95. 【2605.22089】LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
链接:https://arxiv.org/abs/2605.22089
作者:Xiaodong Mei,Diankun Zhang,Hongwei Xie,Guang Chen,Hangjun Ye,Dan Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:existing VLAs typically, promising framework, VLAs typically rely, VLA, enhanced VLA framework
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.
96. 【2605.22086】GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery
链接:https://arxiv.org/abs/2605.22086
作者:Zhiqing Hong,Zelong Li,Xiubin Fan,Guang Yang,Baoshen Guo,Haotian Wang,Tian He,Desheng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown remarkable effectiveness, Human Activity Recognition, Activity Recognition, intelligent manufacturing, shown remarkable
备注:
点击查看摘要
Abstract:Human Activity Recognition (HAR) has shown remarkable effectiveness in various applications, such as smart healthcare and intelligent manufacturing. However, a major challenge faced by HAR is the distribution shift across different sensor data domains, which often leads to decreased performance when deployed for real-world applications. To address this issue, this paper introduces GenHAR, a novel framework designed to mitigate the domain gap by learning domain-invariant sensor representations. GenHAR aims to enhance the generalization capabilities of HAR on target domains purely with data from the source domain. The key novelty of GenHAR lies in two aspects. Firstly, GenHAR tokenizes sensor data and learns correlations among frequency sensor channel dimensions to improve the robustness of HAR models. Secondly, GenHAR improves the efficiency via selective masking and an efficient attention mechanism. We conduct a systematic analysis of GenHAR by comparing it with state-of-the-art HAR methods on real-world human activity datasets. Results show that GenHAR outperforms state-of-the-art methods by 9.97% in accuracy, and reduces Floating Point Operations by 6.4 times. Moreover, we deploy GenHAR at a leading logistics company in 4 cities, and have detected 2.15 billion real-time activities. We release our code at: this https URL.
97. 【2605.22080】JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
链接:https://arxiv.org/abs/2605.22080
作者:Yue Xun,Junyu Liu,Qian Niu,Xinyi Wang,Zheng Yuan,Zirui Li,Zequn Zhang,Bowen Zhao,Shujun Wang,Irene Li,Kan Hatakeyama-Sato,Yusuke Iwasawa,Yutaka Matsuo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multi-profession Japanese healthcare, Japanese healthcare licensing, healthcare licensing benchmark, multi-profession Japanese, Japanese healthcare
备注:
点击查看摘要
Abstract:We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.
98. 【2605.22078】Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
链接:https://arxiv.org/abs/2605.22078
作者:Bingjun Luo,Tony Wang,Hanqi Chen,Xinpeng Ding
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in this https URL.
99. 【2605.22072】Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
链接:https://arxiv.org/abs/2605.22072
作者:Changyuan Tian,Zhicong Lu,Huaxing Liu,Xiang Wang,Shuai Li,Yu Chen,Wenqian Lv,Zichuan Lin,Juncheng Diao,Deheng Ye
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, work extends RLVR, language models, large language, Reinforcement learning
备注: 20 pages, 7 figures, 3 tables. Preprint
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated Focus token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.
100. 【2605.22069】WINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
链接:https://arxiv.org/abs/2605.22069
作者:Hyeseong Kim,Geonhui Son,Deukhee Lee,Dosik Hwang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:achieving high-quality scene, sparse-view inputs poses, computer vision, Thin Plate Splines, limited viewpoints
备注: Accepted to CVPR 2025, Project page: [this https URL](https://sandokim.github.io/twings/)
点击查看摘要
Abstract:Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.
101. 【2605.22068】COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition
链接:https://arxiv.org/abs/2605.22068
作者:Junhyub Lee,Seunghun Chae,Hyosu Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open tree decomposition, granularity and flexibility, formalize and enable, enable the task, visual components
备注:
点击查看摘要
Abstract:We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at this https URL.
102. 【2605.22066】Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos
链接:https://arxiv.org/abs/2605.22066
作者:Yanan Liu,Qinya Li,Hao Zhang,Kangjian He,Xuan Yang,Hao Li,Dan Xu,Lei Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:echocardiography is highly, temporal discontinuity, highly desirable, challenged by geometric, geometric ambiguity
备注:
点击查看摘要
Abstract:Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentally challenged by geometric ambiguity and temporal discontinuity. To tackle these issues, we propose Echo4DIR, a novel test-time 4D implicit reconstruction framework. Specifically, we learn robust 3D shape priors from statistical shape models (SSMs) via a cardiac conditional SDF, constructing an Epipolar Mask Encoder module with epipolar cross attention to effectively fuse multi-view features. To bridge the synthetic-to-real domain gap, we introduce a self-supervised SDF-tailored differentiable rendering strategy for patient-specific 3D shape adaptation using uncalibrated clinical masks without requiring 3D ground truth. Crucially, the inherent continuity of implicit representation overcomes sparse observations, enabling anatomically reliable geometry at arbitrary resolutions. Furthermore, to empower our framework with physically continuous 4D extension, we introduce a Radial SDF Alignment strategy that strictly locks shape evolution to the predicted velocity field, fundamentally eliminating mesh drift. Extensive experiments on synthetic benchmarks and real clinical datasets demonstrate that Echo4DIR achieves state-of-the-art 4D cardiac mesh reconstruction, notably yielding an impressive clinical overlap of up to 98.35% Dice and 96.75% IoU.
103. 【2605.22061】Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
链接:https://arxiv.org/abs/2605.22061
作者:Guojun Xu,Mingyang Zhang,Jianwen Xiang,Cheng Tan,Yanchao Yang,Junwei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Distributed Image Compression, side information, Distributed Image, multi-view transmission, crucial for multi-view
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates ( 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.
104. 【2605.22051】EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
链接:https://arxiv.org/abs/2605.22051
作者:Yue Ma,Xu Ye,Qinghe Wang,Yucheng Wang,Hongyu Liu,Yinhan Zhang,Xinyu Wang,Yuanpeng Che,Shanhui Mo,Paul Liang,Fangneng Zhan,Qifeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating high-fidelity visual, typically demands massive, demands massive datasets, prohibitive computational power, computational power due
备注: Accepted by SIGGRAPH 2026. Project page: [this https URL](https://easy-vfx.github.io/)
点击查看摘要
Abstract:Generating high-fidelity visual effects (VFX) typically demands massive datasets and prohibitive computational power due to the intricate coupling of spatial textures and temporal dynamics. In this paper, we introduce EasyVFX, a resource-efficient framework that achieves realistic VFX synthesis under stringent constraints. Our core philosophy lies in frequency-domain decomposition: we observe that the complexity of VFX can be significantly mitigated by decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics. This spectral disentanglement transforms a high-dimensional learning problem into manageable sub-tasks, thereby lowering the optimization barrier and reducing data dependency. Building upon this insight, we propose a two-stage training paradigm. First, we design a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture. By utilizing a soft routing mechanism, our model assigns specialized experts to distinct spectral bands, enabling them to cultivate robust priors for appearance and motion dynamics. This specialization allows the model to acquire foundational VFX knowledge with fewer GPU resources. Second, we introduce a Test-Time Training strategy powered by a novel Frequency-constraint Loss. This allows the pre-trained model to swiftly adapt to specific, unseen effects through localized optimizations, requiring only about 100 steps on a single GPU. Experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects, proving that frequency-aware learning is a key catalyst for democratizing professional-grade VFX.
105. 【2605.22050】Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations
链接:https://arxiv.org/abs/2605.22050
作者:Yuanmin Huang,Mi Zhang,Chen Chen,Feifei Li,Geng Hong,Xiaoyu You,Min Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:memorize training data, training data poses, data poses significant, poses significant privacy, diffusion models excel
备注: KDD 2026, extended version
点击查看摘要
Abstract:While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'' artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $0.999$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ($\approx0.01$s per image).
106. 【2605.22044】Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin
链接:https://arxiv.org/abs/2605.22044
作者:Mengxiao Wang,Yilin Lyu,Julia Camps,Ching Hui Sia,Mark Yan-Yee Chan,Yanrui Jin,Shuzhi Sam Ge,Chengliang Liu,Lei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:risk stratification, infarction is essential, essential for risk, Accurate localization, Accurate
备注: Early-accepted by MICCAI 2026. This version corresponds to the submitted version. The final version will be available on Springer Link
点击查看摘要
Abstract:Accurate localization of myocardial infarction is essential for risk stratification. While LGE-MRI remains the gold standard, it is resource-intensive. Integrating cine MRI with ECG enables a more detailed representation of infarct properties. Existing inverse MI inference methods overlook realistic scar morphology and cardiac repolarization, reducing sensitivity to subtle ECG variations and interpretability of infarct-induced electrophysiological changes. In this paper, we propose a novel framework for noninvasive MI localization using cardiac digital twins. To bridge the domain gap between simulation and reality, we introduce an anatomy-aware stochastic infarct synthesis strategy to synthesize realistic, irregular scars with border zones, mimicking ischemic transmural progression. We then construct a virtual cohort to simulate QRS-T waveforms, capturing both depolarization and repolarization dynamics. Furthermore, we design a Physiology and Anatomy Aware Network (PAA-Net) that jointly encodes 3D myocardial geometry and multi-lead ECGs to infer infarct areas with varying localizations, sizes, spatial extents, and transmuralities. Experimental results demonstrate that our framework significantly outperforms existing methods in inverse inference, achieving Dice scores of 0.7391 and 0.5503 for scar and border zone segmentation, respectively, while further enhancing the interpretability of the ECG-infarct relationship. Our code will be released upon acceptance.
107. 【2605.22036】GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
链接:https://arxiv.org/abs/2605.22036
作者:Jiahao Yang,Zihan Wang,Xiangyang Li,Xing Zhu,Yujun Shen,Yinghao Xu,Shuqiang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:dense RGB videos, produce excessive patch, substantial computational overhead, limited spatial reasoning, dense RGB
备注:
点击查看摘要
Abstract:Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.
108. 【2605.22035】HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering
链接:https://arxiv.org/abs/2605.22035
作者:Yiran Wang,Chenyi Xiong,Ziyue Qin,Miao Zhang,Kui Xiao,Zhifei Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Continual Visual Question, Visual Question Answering, Question Answering, preserving past knowledge, Continual Visual
备注: Accepted by IJCAI 2026
点击查看摘要
Abstract:Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.
109. 【2605.22034】AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
链接:https://arxiv.org/abs/2605.22034
作者:Haocheng Li,Juepeng Zheng,Zenghao Yang,Kaiqi Du,Guilong Xiao,Gengmeng Pu,Haohuan Fu,Jianxi Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:agricultural visual grounding, enabling applications, selective weeding, targeted harvesting, task of localizing
备注: 45 pages,12 figures
点击查看摘要
Abstract:Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at this https URL .
110. 【2605.22031】SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction
链接:https://arxiv.org/abs/2605.22031
作者:Pengcheng Fang,Hongli Chen,Fangfang Tang,Feng Liu,Xiaohao Cai,Shanshan Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large spatial regions, Accelerated MRI reconstruction, requires recovering missing, recovering missing details, preserving anatomically coherent
备注:
点击查看摘要
Abstract:Accelerated MRI reconstruction requires recovering missing details while preserving anatomically coherent structures across large spatial regions. State-space models such as Mamba provide efficient long-range modeling, making them attractive learned regularizers for unrolled reconstruction. However, in a data-consistency-coupled unrolled solver, different stages operate on different reconstruction iterates, where the resident carrier should preserve coherent reconstruction content across stages while stage-dependent non-resident evidence is tied to the current update. Treating these roles uniformly can place persistent resident-carrier evidence and update-dependent non-resident evidence into the same recurrent content route. We therefore propose SO-Mamba, a state-ownership Mamba regularizer that assigns reconstruction evidence within each Mamba stage to recurrent residency, state-interface access, and non-state output correction. SO-Mamba implements this ownership rule with a State-Ownership Router (SOR), which constructs a resident carrier for recurrent content and routes non-resident evidence to affine modulation of the B/C state interfaces and an output correction outlet. The resident carrier supplies the Mamba content route, while the non-resident evidence stream adapts the state interfaces and contributes through the output outlet without entering the recurrent content route. We further introduce a two-level outer-band leakage diagnostic that separates hidden-state storage from readout expression by measuring outer-band energy in the selective-scan state trajectory and the post-scan Mamba readout. Experiments on five public MRI reconstruction benchmarks spanning diverse anatomies, sampling patterns, and coil configurations show that SO-Mamba consistently improves over CNN-, Transformer-, and Mamba-based baselines with competitive computational efficiency.
111. 【2605.22020】ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting
链接:https://arxiv.org/abs/2605.22020
作者:Yuke Li,Weihang Liu,Cheng Zhang,Yuefeng Zhang,Jiadi Cui,Zixuan Wang,Junran Ding,Haoyu Wu,Yujiao Shi,Jingyi Yu,Xin Lou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:http URL present, http URL offloading, http URL unrolls, http URL fine-tuning, http URL instantiate
备注:
点击查看摘要
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) models offer fast single-pass reconstruction,but scaling them to match per-scene optimization quality is fundamentally hindered by the scarcity of large-scale 3D annotations.A practical compromise is predict-then-refine,where post-prediction optimization compensates for the limited capacity of the feed-forward this http URL,standard feed-forward 3DGS is trained solely for zero-step rendering error,ignoring whether its output constitutes a good initialization for the downstream this http URL present ForeSplat,an optimization-aware training framework that equips feed-forward 3DGS models to produce initializations explicitly designed for rapid,effective this http URL offloading part of the scene-modeling burden to the optimizer,ForeSplat substantially reduces the capacity pressure on the feed-forward model,making high-quality reconstruction feasible even with compact this http URL its core is MetaGrad,a lightweight multi-anchor meta-gradient training rule that bypasses costly higher-order differentiation through the 3DGS this http URL unrolls a short inner-loop refinement trajectory,samples anchor states,and back-propagates aggregated first-order gradients to the prediction head as a surrogate optimization-aware this http URL fine-tuning adds no inference cost and enables high-quality reconstruction within seconds after a few refinement this http URL instantiate ForeSplat on diverse backbones,including AnySplat,Pi3X,and a distilled variant tailored for edge this http URL all tested architectures,a ForeSplat-trained initialization converges in fewer refinement steps and reaches a higher peak reconstruction quality than its vanilla counterpart,even fully this http URL framework consistently bridges the gap between amortized prediction and per-scene optimization,establishing a practical path toward lightweight,high-fidelity 3D reconstruction.
112. 【2605.22018】FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments
链接:https://arxiv.org/abs/2605.22018
作者:Connor Malone,Sebastien Demmel,Sebastien Glaser
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Flooded Road Environments, Road Environments Dataset, multi-modal autonomous driving, autonomous driving dataset, driving dataset specifically
备注:
点击查看摘要
Abstract:The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360$^\circ$ point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle's data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.
113. 【2605.22017】Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction
链接:https://arxiv.org/abs/2605.22017
作者:Lei Chu,Yuhuan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deepgenerative models havebecomeapromisingapproach, represent diverse human, Deepgenerative models, capture multimodal distributions, diverse human
备注: MEIS-- CVPR
点击查看摘要
Abstract:Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.
114. 【2605.22015】ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration
链接:https://arxiv.org/abs/2605.22015
作者:Hangyeol Lee,Joo-Young Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
关键词:Diffusion Transformer, generating high-quality images, powerful model architecture, video DiT, powerful model
备注:
点击查看摘要
Abstract:Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
Cite as:
arXiv:2605.22015 [cs.CV]
(or
arXiv:2605.22015v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.22015
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
115. 【2605.22013】PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought
链接:https://arxiv.org/abs/2605.22013
作者:Chaoqi Chen,Qile Xu,Wenjun Zhou,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:visual computing, fundamental challenge, challenge in computer, computer graphics, graphics and visual
备注:
点击查看摘要
Abstract:Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.
116. 【2605.22012】LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
链接:https://arxiv.org/abs/2605.22012
作者:Yifan Dai,Zhenhua Wu,Bohan Zeng,Daili Hua,Jialing Liu,Bozhou Li,Yuran Wang,Chengzhuo Tong,Hao Liang,Xiaochen Ma,Junbo Niu,Tianyu Guo,Yang Shi,Yue Ding,Yiyan Ji,Bingyin Mei,Yushuo Guan,Yuanxing Zhang,Pengfei Wan,Fangcheng Fu,Wentao Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:current multimodal large, requires fine-grained evidence, multimodal large language, reasoning requires fine-grained, large language models
备注: 21 pages, 15 figures
点击查看摘要
Abstract:Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
117. 【2605.22011】Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness
链接:https://arxiv.org/abs/2605.22011
作者:Hangyeol Lee,Hyojeong Lee,Joo-Young Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, image generation quality, quadratic computational complexity, computational complexity relative, achieve superior image
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2605.22011 [cs.CV]
(or
arXiv:2605.22011v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.22011
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
118. 【2605.22002】ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation
链接:https://arxiv.org/abs/2605.22002
作者:Joao Batista Florindo,Amanda Pontes de Oliveira Ornelas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling precise delineation, Biomedical image segmentation, Breast Ultrasound Images, Thyroid Ultrasound Images, treatment planning
备注:
点击查看摘要
Abstract:Biomedical image segmentation is a critical task in medical diagnosis and treatment planning, enabling precise delineation of anatomical structures and pathological regions. Despite significant advancements, challenges persist due to the inherent variability, noise, and complex morphology present in diverse medical imaging modalities. This paper introduces ConvNeXt-FD, a novel deep learning architecture for robust biomedical image segmentation, built upon a U-Net-like encoder-decoder framework leveraging the powerful ConvNeXt backbone. Our approach integrates a hybrid loss function combining the Dice coefficient with a boundary-aware regularization term inspired by a differentiable formulation of Fractal Dimension, designed to enhance the model's sensitivity to object boundaries and shape fidelity. We rigorously evaluate ConvNeXt-FD across six distinct biomedical datasets: BUSI (Breast Ultrasound Images), DDTI (Thyroid Ultrasound Images), FluoCells (Fluorescent Cell Images), IDRiD (Diabetic Retinopathy Images for Optic Disc Segmentation), ISIC2018 (Skin Lesion Images), and MoNuSeg (Nuclei Segmentation). Experimental results demonstrate that ConvNeXt-FD, particularly when initialized with ImageNet pre-trained weights, achieves competitive and often superior performance compared to existing state-of-the-art methods across various metrics, including Dice, Jaccard, Accuracy, Sensitivity, Specificity, and False Positive Rate. The integration of ConvNeXt as a strong encoder, coupled with the boundary-aware regularization, proves effective in capturing both high-level semantic features and fine-grained boundary details, leading to more accurate and reliable segmentations in challenging biomedical contexts.
119. 【2605.22000】Virtual 3D HE Staining from Phase-contrast Back-illumination Interference Tomography
链接:https://arxiv.org/abs/2605.22000
作者:Anthony Song,Boyan Zhou,Mayank Golhar,Marisa Morakis,Alex Baras,Nicholas Durr
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:transform disease management, Back-illumination Interference Tomography, unprocessed tissues, in-vivo assessment, enabling volumetric characterization
备注:
点击查看摘要
Abstract:Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable HE images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic HE volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual HE staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: this https URL.
120. 【2605.21988】Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
链接:https://arxiv.org/abs/2605.21988
作者:Dazhao Du,Jian Liu,Jialong Qin,Tao Han,Bohai Gu,Fangqi Zhu,Yujia Zhang,Eric Liu,Xi Chen,Song Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Video large language, large language models, achieve strong benchmark, strong benchmark accuracy, large language
备注: Project website: [this https URL](https://ddz16.github.io/crpo.github.io/)
点击查看摘要
Abstract:Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at this https URL .
121. 【2605.21981】RiT: Vanilla Diffusion Transformers Suffice in Representation Space
链接:https://arxiv.org/abs/2605.21981
作者:Le Zhang,Ning Mang,Aishwarya Agrawal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:clean data point, manifold structure effectively, Flow matching, exploit low-dimensional manifold, low-dimensional manifold structure
备注:
点击查看摘要
Abstract:Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at this https URL.
122. 【2605.21980】Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow
链接:https://arxiv.org/abs/2605.21980
作者:Chengsheng Zhang,Chenghao Sun,Zhining Xie,Xinmei Tian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, demonstrating remarkable capabilities, Large Vision-Language, Vision-Language Models, represent a significant
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.
123. 【2605.21977】Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
链接:https://arxiv.org/abs/2605.21977
作者:Zhengcen Li,Chenyang Jiang,Liangxu Su,Tong Shao,Shiyang Zhou,Ming Tao,Jingyong Su
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:deployment pipelines, rapidly improving, creating an urgent, visual modalities, SOTA AI-generated image
备注:
点击查看摘要
Abstract:AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.
124. 【2605.21973】Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding
链接:https://arxiv.org/abs/2605.21973
作者:Zelin Zheng,Xinyan Liu,Ruixin Li,Antoni B. Chan,Guorong Li,Qingming Huang,Laiyun Qing
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unstructured visual-token stream, Current Video-LLM approaches, direct timestamp generation, typically rely, visual-token stream
备注: Accepted by ICML 2026
点击查看摘要
Abstract:Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.
125. 【2605.21964】Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection
链接:https://arxiv.org/abs/2605.21964
作者:Xuquan Wang,Guishuo Yang,Dapeng Yan,Yujie Xing,Xuanyu Qian,Kai Zhang,Xiong Dun,Jiande Sun,Zhanshan Wang,Xinbin Cheng
类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
关键词:substantial inference latency, combine image reconstruction, object detection, Computational imaging enables, introduce substantial inference
备注: 15 pages, 11 figures; supplementary material: 3 pages, 2 figures
点击查看摘要
Abstract:Computational imaging enables compact infrared systems, but deep-learning pipelines that combine image reconstruction and object detection often introduce substantial inference latency. Most existing acceleration strategies compress the reconstruction network while overlooking physical priors from the optical path, leaving a trade-off between accuracy and speed. We present Physics-aware Dual-Integrated Network (PDI-Net), a low-latency framework that integrates infrared reconstruction with object detection and further embeds optical priors into the learning process. PDI-Net uses a supervised U-Net during training, while a semi-U-Net encoder shares features directly with a YOLO-based detector during inference, avoiding full image reconstruction. To bridge the gap between fidelity-oriented reconstruction features and detection-oriented semantics, we introduce a physics-aware large-small bridge (PALS-Bridge), which uses field-dependent point spread function priors to adaptively modulate multiscale convolutional branches. A physics-informed optical degradation simulation pipeline is also developed for training and validation. The method is deployed on a single-lens infrared camera, reducing system weight by about 50% compared with traditional multi-lens designs. On the M3FD benchmark under low-SNR conditions, PDI-Net reduces inference time by 84.06% compared with the Rec+Det with pruning strategy while improving mAP@0.5:0.95 by 5.07%. These results demonstrate compact, low-latency computational infrared imaging for real-time object detection on resource-constrained platforms.
126. 【2605.21957】Bounding-Box Trajectories Matter for Video Anomaly Detection
链接:https://arxiv.org/abs/2605.21957
作者:Inpyo Song,Jangwon Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains highly challenging, extensive research due, safety and security, scene dynamics, remains highly
备注: 17 pages, 3 figures
点击查看摘要
Abstract:Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.
127. 【2605.21954】MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
链接:https://arxiv.org/abs/2605.21954
作者:Dazhao Du,Liao Duan,Jian Liu,Tao Han,Yujia Zhang,Eric Liu,Xi Chen,Song Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large language models, multimodal large language, Video temporal grounding, Temporal Grounding Heads, language models
备注: Project Website: [this https URL](https://ddz16.github.io/mllmsknowwhen.github.io/)
点击查看摘要
Abstract:Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at this https URL.
128. 【2605.21931】EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
链接:https://arxiv.org/abs/2605.21931
作者:Shiqi Huang,Ziyue Wang,Zhongrong Zuo,Han Qiu,Qi She,Bihan Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Recent Video Large, Video Large Language, Large Language, demonstrated strong capabilities
备注: Project page: [this https URL](https://huangshiqi128.github.io/EvoVid.io/)
点击查看摘要
Abstract:Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.
129. 【2605.21924】Visual-Advantage On-Policy Distillation for Vision-Language Models
链接:https://arxiv.org/abs/2605.21924
作者:Ruiqi Liu,Xiaolei Lv,Gengsheng Li,Ximo Zhu,Zhiheng Wang,Zhengbo Zhang,Junkai Chen,Zhiheng Li,Bo Li,Jun Gao,Shu Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:On-policy knowledge distillation, vision-language models, standard on-policy distillation, on-policy distillation, fine-grained visual detail
备注:
点击查看摘要
Abstract:On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on this http URL make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language this http URL propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.
130. 【2605.21919】SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
链接:https://arxiv.org/abs/2605.21919
作者:Zihang Lin,Huaiyuan Qin,Muli Yang,Hongyuan Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Assessing progress, Sustainable Development Goals, requires multi-step reasoning, imperfect evidence integration, introduce hidden prediction
备注:
点击查看摘要
Abstract:Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.
131. 【2605.21917】MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
链接:https://arxiv.org/abs/2605.21917
作者:Han Zhang,Wanting Jiang,Tomasz Kornuta,Tian Zheng,Vidya Murali
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision Language Models, Training Vision Language, Language Models, Vision Language, requires high-quality structured
备注: CVPR 2026 Workshop
点击查看摘要
Abstract:Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream QA generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.
132. 【2605.21913】Multi-scale interaction network for stereo image super-resolution
链接:https://arxiv.org/abs/2605.21913
作者:Liyi Xu,Lin Qi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:leveraging complementary information, generate high-resolution images, Stereo image super-resolution, binocular systems, image super-resolution aims
备注:
点击查看摘要
Abstract:Stereo image super-resolution aims to generate high-resolution images by leveraging complementary information from binocular systems. Although previous studies have achieved impressive results, the potential of intra-view and cross-view information has not been fully exploited. To address this issue, we propose a novel multi-scale interaction network for stereo image super-resolution. Specifically, we design a Multi-scale Spatial-Channel Attention Module that utilizes multi-scale large separable kernel attention and simple channel attention to improve intra-view feature extraction. Additionally, we propose a Dual-View Epipolar Attention Module, utilizing an optimal transport algorithm to achieve more accurate matching along the epipolar line. Extensive experimental and ablation studies show that our method achieves competitive results that outperform most SOTA methods.
133. 【2605.21907】Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion
链接:https://arxiv.org/abs/2605.21907
作者:Gang Dai,Yining Huang,Yiming Xia,Guohao Chen,Shuaicheng Niu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficient Test-Time Scaling, paradigm offers, perspective for enhancing, enhancing the generation, generation performance
备注:
点击查看摘要
Abstract:The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.
134. 【2605.21906】Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining
链接:https://arxiv.org/abs/2605.21906
作者:Yuheng Li,Yuan Gao,Haoyu Dong,Yuxiang Lai,Shansong Wang,Mojtaba Safari,James E. Baciak,Xiaofeng Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CT-based artificial intelligence, artificial intelligence remains, intelligence remains fragmented, Computed tomography, CT-based artificial
备注:
点击查看摘要
Abstract:Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Code is available at this https URL
135. 【2605.21882】hermo-VL: Extending Vision-Language Models to Thermal Infrared Perception
链接:https://arxiv.org/abs/2605.21882
作者:Rusiru Thushara,Yasiru Ranasinghe,Jay Paranjape,Vishal M. Patel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visible cues degrade, infrared preserves complementary, preserves complementary scene, complementary scene structure, thermal infrared preserves
备注: 18 pages, 11 figures
点击查看摘要
Abstract:Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo's pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We also introduce a pixel-aligned RGB-thermal instruction-tuning dataset and Thermo-VL-Bench, a manually screened RGB-thermal VQA benchmark for low-light and cross-spectrum reasoning. Experiments show strong gains on challenging thermal-only and RGB+thermal reasoning tasks, highlighting the value of prompt-conditioned multispectral fusion. Our dataset and code are publicly available at: this https URL
136. 【2605.21869】wo-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
链接:https://arxiv.org/abs/2605.21869
作者:Dinithi Dissanayake,Shaveen Silva,Ovindu Atukorala,Prasanth Sasikumar,Suranga Nanayakkara
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Emotional Mimicry Intensity, emotion intensity dimensions, continuous emotion intensity, Empathic Pain, Emotional Mimicry
备注: 10th Affective Behavior Analysis in-the-wild, CVPR Workshop 2026
点击查看摘要
Abstract:We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.
137. 【2605.21861】Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models
链接:https://arxiv.org/abs/2605.21861
作者:Yuting He,Chenyu You,Shuo Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:pronounced Non-IID feature, Non-IID feature statistics, foundation models, fundamentally challenged, challenged by pronounced
备注: Accepted by KDD 2026
点击查看摘要
Abstract:Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at this https URL.
138. 【2605.21854】CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
链接:https://arxiv.org/abs/2605.21854
作者:Zhi Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Direct Preference Optimisation, discrete-token autoregression, architectural patterns, rapidly converged, small set
备注: Workshop draft, 14 pages, 4 figures. Code, ckpts, data: [this https URL](https://github.com/lz-googlefycy/vla-lab)
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at this https URL.
139. 【2605.21852】Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
链接:https://arxiv.org/abs/2605.21852
作者:Lina Zhang,Tonmoy Monsoor,Peizheng Li,Jiarui Cui,Xinyi Peng,Chong Han,Prateik Sinha,Siyuan Dai,Jessica Nichole Pasqua,Colin M McCrimmon,Weiting Liu,Hailey Marie Miranda,Bing Hu,Xiangting Wu,Tengyou Xu,Chunhan Li,Jiaye Tian,Jiarui Tang,Detao Ma,Lingye Kong,Junnan Lyu,Jungang Li,Yan Zan,Junhua Huang,Rajarshi Mazumder,Vwani Roychowdhury
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Large Language, remains largely untested, demonstrated remarkable proficiency
备注: Accepted to ICML 2026 as a Spotlight presentation
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatio-temporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite, a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.
140. 【2605.21796】MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
链接:https://arxiv.org/abs/2605.21796
作者:Anna Deichler,Jim O'Regan,Fethiye Irmak Dogan,Lubos Marcinek,Anna Klezovich,Iolanda Leite,Jonas Beskow
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:physical world requires, dynamically during conversation, physical world, world requires, requires AI systems
备注: Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis
点击查看摘要
Abstract:Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.
141. 【2605.21788】SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching
链接:https://arxiv.org/abs/2605.21788
作者:Xuefei Sun,Xujia Zhang,Brendan Crowe,Doncey Albin,Christoffer Heckman
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:free-form natural language, requires localizing objects, grounding requires localizing, natural language, requires localizing
备注:
点击查看摘要
Abstract:Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.
142. 【2605.21766】BodyReLux: Temporally Consistent Full-Body Video Relighting
链接:https://arxiv.org/abs/2605.21766
作者:Li Ma,Mingming He,Xueming Yu,David M. George,Ahmet Levent Taşel,Paul Debevec,Julien Philip
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:content creation, relight human performance, fundamental task, task for post, post production
备注: Siggraph 2026 Journal Track. Project page: [this https URL](https://eyeline-labs.github.io/bodyrelux/)
点击查看摘要
Abstract:Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.
143. 【2605.21747】Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models
链接:https://arxiv.org/abs/2605.21747
作者:Steven Chen,Shivesh Khaitan,Nemanja Djuric
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Vision Language Model, leveraging Vehicle Make, self-driving applications, applications through zero-shot, Vision Language
备注: To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation
点击查看摘要
Abstract:We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.
144. 【2605.21728】BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model
链接:https://arxiv.org/abs/2605.21728
作者:Gonçalo Gomes,Bruno Martins,Chrysoula Zerva
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Image captioning evaluation, vision-language models evolve, Large Language Models, captioning evaluation remains, Image captioning
备注:
点击查看摘要
Abstract:Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.
145. 【2605.21714】AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking
链接:https://arxiv.org/abs/2605.21714
作者:Ziyi Kou,Ankit Kumar,Mia Huang,Taylor Niehues,Vatsal Mehta,Ergys Ristani,Li Guan
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:adaptive visual-IMU fusion, visual-IMU fusion approach, image with on-glove, adaptive visual-IMU, jointly modeling
备注:
点击查看摘要
Abstract:We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.
146. 【2605.21669】MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast
链接:https://arxiv.org/abs/2605.21669
作者:Jinghang Li,Tales Santini,Courtney Clark,Bruno de Almeida,Cong Chu,Salem Alkhateeb,Andrea Sajewski,Jacob Berardinelli,Hecheng Jin,Tobias Campos,Jeremy J. Berardo,Joseph Mettenburg,Ariel Gildengers,Howard J. Aizenstein,Minjie Wu,Tamer S. Ibrahim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:turbo spin echo, segmentation requires high-resolution, substantial data loss, subfield segmentation requires, requires high-resolution
备注:
点击查看摘要
Abstract:Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $\epsilon^2$= 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: this https URL
147. 【2605.21661】Hierarchical Variational Policies for Reward-Guided Diffusion
链接:https://arxiv.org/abs/2605.21661
作者:Kushagra Pandey,Farrin Marouf Sofian,Jan Niklas Groeneveld,Felix Draxler,Stephan Mandt
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Adapting pretrained diffusion, Adapting pretrained, requires expensive test-time, expensive test-time guidance, downstream objectives
备注:
点击查看摘要
Abstract:Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.
148. 【2605.21652】Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
链接:https://arxiv.org/abs/2605.21652
作者:Yue Zhou,Erxuan Wu,Yikang Sun,Hongjoo Lee,Yuan Bi,Huixiong Xu,Zhongliang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visual question answering, ultrasound remains suboptimal, significantly advanced medical, advanced medical visual, medical visual question
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.
149. 【2605.21642】Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
链接:https://arxiv.org/abs/2605.21642
作者:Tianyi Zhang,Mahtab Bigverdi,Ranjay Krishna
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:non-textual tokens intended, Vision-language models, visual thinking, intended to support, Token Replacement Test
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.
150. 【2605.21625】Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
链接:https://arxiv.org/abs/2605.21625
作者:Aditya Chetan,Eric Cai,Peeyush Kushwaha,Bharath Raj Nagoor Kani,Utkarsh Mall,Qianqian Wang,Noah Snavely,Bharath Hariharan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Vision-Language Models, emergence of Large, Large Vision-Language, Models
备注: CVPR 2026
点击查看摘要
Abstract:The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.
151. 【2605.21611】UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
链接:https://arxiv.org/abs/2605.21611
作者:Jiayun Wang,Yu Wang,Weijie Gan,Zhenting Wang,Wei Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:controllable image generation, contextual image generation, image generation, image, standalone text encoder
备注:
点击查看摘要
Abstract:We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2605.21611 [cs.CV]
(or
arXiv:2605.21611v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.21611
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
152. 【2605.21605】GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
链接:https://arxiv.org/abs/2605.21605
作者:Sixiang Chen,Zhaohu Xing,Tian Ye,Xinyu Geng,Yunlong Lin,Jianyu Lai,Xuanhua He,Fuxiang Zhai,Jialin Gao,Lei Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Open-ended image generation, Open-ended image, Visual Experience Distillation, longer a simple, Visual Experience
备注:
点击查看摘要
Abstract:Open-ended image generation is no longer a simple prompt-to-image problem. High-quality generation often requires an agent to combine a model's internal generative ability with external resources. As requests become more diverse and demanding, we aim to develop a general image-generation agent that can self-evolve through trajectories and use tools more effectively across varied generation challenges. To this end, we propose GenEvolve, a self-evolving framework based on Tool-Orchestrated Visual Experience Distillation. In GenEvolve, each generation attempt is modeled as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing agentic generation methods that mainly rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience, provided only to a privileged teacher branch. Inspired by on-policy self-distillation, Visual Experience Distillation provides dense token-level supervision, helping the student internalize better search, knowledge activation, reference selection, and prompt construction. We further construct GenEvolve-Data and GenEvolve-Bench. Experiments on public benchmarks and GenEvolve-Bench show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. Our website is as follows: this https URL
153. 【2605.21573】Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
链接:https://arxiv.org/abs/2605.21573
作者:Dong Chen,Fangyun Wei,Ziyu Wan,Dongdong Chen,Jiawei Zhang,Jinjing Zhao,Sirui Zhang,Yang Yue,Zhiyang Liang,Baining Guo,Chong Luo,Jianmin Bao,Ji Li,Lei Shi,Qinhong Yang,Xiuyu Wu,Xuelu Feng,Yan Lu,Yanchen Dong,Yitong Wang,Yunuo Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieves performance competitive, cases surpassing, achieves performance, performance competitive, requiring significantly
备注: Project Page: [this https URL](https://github.com/microsoft/Lens)
点击查看摘要
Abstract:We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.
154. 【2605.21572】PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
链接:https://arxiv.org/abs/2605.21572
作者:Ziang Cao,Yinghao Liu,Haitian Li,Runmao Yao,Fangzhou Hong,Zhaoxi Chen,Liang Pan,Ziwei Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:promising direction owing, promising direction, direction owing, broad applicability, Simulation-ready physical
备注: Project page: [this https URL](https://physx-omni.github.io/)
点击查看摘要
Abstract:Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.
155. 【2605.21493】Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins
链接:https://arxiv.org/abs/2605.21493
作者:Rahul D Ray
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:machine learning systems, average OOD AUROC, ability to detect, inputs is fundamental, fundamental to safe
备注:
点击查看摘要
Abstract:The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.
156. 【2605.22425】me-varying rPPG signal separation via block-sparse signal model
链接:https://arxiv.org/abs/2605.22425
作者:Kosuke Kurihara,Yoshihiro Maeda,Daisuke Sugimura,Takayuki Hamamoto
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:analyzing subtle color, Remote photoplethysmography, enables non-contact measurement, facial videos, non-contact measurement
备注: Accepted by IEEE International Conference on Image Processing (ICIP 2026)
点击查看摘要
Abstract:Remote photoplethysmography (rPPG) enables non-contact measurement of cardiac pulse signals by analyzing subtle color changes in facial videos. Nevertheless, extracting rPPG signals remains challenging because of their extremely weak signal strength and susceptibility to illumination noise. In this paper, we propose an rPPG signal extraction method that exploits the quasi-periodic characteristics of rPPG signals. Our approach models quasi-periodicity of the rPPG signal, which arises from the stable cardiac cycle, as a block-sparse structure in the time-frequency domain. To incorporate a block-sparse model and enable adaptive signal separation under illumination fluctuations, we construct a time-varying signal separation framework. Experiments using a public dataset demonstrate the effectiveness of our method.
157. 【2605.21970】Entropy-Guided Self-Supervised Learning for Medical Image Classification
链接:https://arxiv.org/abs/2605.21970
作者:Joao Florindo,Viviane Moura
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:early disease diagnosis, Accurate and robust, medical image classification, treatment planning, robust medical image
备注:
点击查看摘要
Abstract:Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.
158. 【2605.21835】An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation
链接:https://arxiv.org/abs/2605.21835
作者:Xiaofeng Liu,Qianru Zhang,Thibault Marin,Menghua Xia,Chi Liu,Georges El Fakhri,Jinsong Ouyang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:positron emission tomography, computed tomography, emission tomography, information from computed, information from positron
备注: Code available at: [this https URL](https://github.com/liu-xiaofeng/Foundation-Model-for-PET-CT)
点击查看摘要
Abstract:The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task-specific, are often trained on single-center cohorts, or adopt dual-branch fusion schemes that delay cross-modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open-source, multi-center, whole-body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet-shaped backbones with early channel-wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero-mean imputation, combined with a weighted global reconstruction loss. This design avoids non-physical intensity discontinuities at masked-region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10\% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5-shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated-modality pretraining. This multi-center foundation model demonstrates label efficiency and cross-modality representation learning for PET/CT tumor segmentation. It provides a robust, open-source basis for advancing automated oncologic imaging, significantly reducing the need for large-scale manual annotations in clinical practice.
159. 【2605.21804】Mapping Tomato Cropping Systems in California Using AlphaEarth Geospatial Embeddings and Deep Learning Analysis
链接:https://arxiv.org/abs/2605.21804
作者:Mohammadreza Narimani,Alireza Pourreza,Parastoo Farajpoor
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:remote-sensing workflows built, statewide crop identification, support supply-chain forecasting, hand-engineered spectral features, forecasting and policy
备注: 5 pages, 3 figures, 1 table. Preprint submitted to ASABE 2026 AIM
点击查看摘要
Abstract:Field-scale crop maps support supply-chain forecasting and policy, yet statewide crop identification still often depends on retrospective surveys or remote-sensing workflows built around hand-engineered spectral features. Those pipelines can be accurate, but they require repeated preprocessing and often lose robustness across years. This study evaluated whether Google DeepMind's AlphaEarth geospatial embeddings can serve as an analysis-ready alternative for mapping processing tomato systems in California. LandIQ 2018 crop polygons were used to assemble a balanced reference dataset of 4,742 tomato and 4,742 non-tomato fields. For each polygon, 64-band AlphaEarth embedding chips were extracted and aligned with binary masks, then divided into spatially independent training (n = 6,638), validation (n = 1,422), and test (n = 1,424) sets. A U-Net segmentation model was trained on AWS SageMaker using a composite masked binary cross-entropy and soft Dice loss. To complement hard predictions, Monte Carlo dropout was retained at inference and repeated 100 times per chip to estimate predictive mean and variance. On the independent test set, the model achieved 99.19% pixel accuracy, 98.69% precision, 99.40% recall, 99.04% F1 score, 98.11% intersection over union, and 99.02% chip accuracy. Uncertainty maps were consistently highest near field edges and low within field interiors. The results show that AlphaEarth embeddings retain crop-relevant spatial and temporal structure and can support accurate, field-scale tomato mapping without manual feature engineering.
160. 【2605.21671】HyperBench: Standardizing and Scaling Synthetic Evaluation for Hyperspectral Super-Resolution
链接:https://arxiv.org/abs/2605.21671
作者:Ritik Shah,Marco F. Duarte
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:low-resolution hyperspectral image, high-resolution multispectral image, hyperspectral image, fusing a low-resolution, image
备注:
点击查看摘要
Abstract:Hyperspectral super-resolution (HSR) reconstructs a high-spatial-resolution hyperspectral image by fusing a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI). In the absence of real-world paired data, HSR methods are evaluated almost exclusively on synthetic experiments derived from hyperspectral datasets through Wald's protocol. Despite the protocol's widespread adoption, its practical implementation varies markedly across research works, typically relying on a single (usually Gaussian) or very few point spread functions (PSFs), one or two spectral response functions (SRFs), and a couple of spatial downsampling factors. As a result, reported performance figures are difficult to compare across the literature, in addition to being often difficult to reproduce; furthermore, they may not generalize across realistic sensing conditions. We introduce HyperBench, a unified and extensible framework that standardizes synthetic experimentation for HSR. HyperBench supports diverse degradation configurations spanning ten PSFs, four SRFs derived from operational multispectral sensors, configurable spatial downsampling factors, and matched additive white Gaussian noise; its goal is to automate large-scale evaluation and structured logging. By decoupling model development from experimental design, the framework enables reproducible, apples-to-apples cross-method comparison with minimal friction. We use HyperBench to evaluate six recently proposed HSR methods across a 70-configuration sweep on four widely used hyperspectral scenes and observe that the inter-method PSNR spread widens from approximately 5 dB on the easiest PSF to over 13 dB on the hardest - a fragility that is structurally invisible to the prevailing single-configuration evaluation protocol. HyperBench code is available at this https URL .
161. 【2605.21633】VRXU-net: A Deep Learning Approach for Brain Ischemic Stroke Lesion Detection and Segmentation in T1W MRI
链接:https://arxiv.org/abs/2605.21633
作者:Sayed Amir Mousavi Mobarakeh
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:ischemic stroke lesions, oxygen delivery, tissues becomes insufficient, leading to cellular, cellular necrosis
备注:
点击查看摘要
Abstract:When the blood supply to the brain is obstructed by a clot, oxygen delivery to brain tissues becomes insufficient, leading to cellular necrosis. In healthcare settings, accurately identifying and delineating ischemic lesion boundaries is essential for treatment and surgical planning. However, ischemic stroke lesions vary widely in shape, size, and location, and in grayscale MRI modalities such as T1W they may resemble surrounding brain structures. This makes lesion detection and segmentation a challenging task for clinicians. This study introduces a novel VRU-Net architecture, derived from visual features, residual connections, and a U-shaped network, for detecting and segmenting ischemic stroke lesions in 3D magnetic resonance imaging scans. The proposed method first uses a modified VGG model to identify ischemic stroke in separate 2D slices. Then, a U-shaped segmentation model with residual blocks segments the lesion in each slice. This procedure is applied independently to the axial, sagittal, and coronal planes, and the final output is generated by aggregating the three segmentation results. To improve both performance and processing speed, a high-performance classifier is applied before the segmentation model in a sequential framework. This strategy reduces unnecessary segmentation of non-lesion slices and improves overall accuracy. In addition, decomposing 3D images into 2D slices reduces model complexity while allowing information from three anatomical planes to support more accurate lesion localization. The proposed model is trained on the Anatomical Tracings of Lesions After Stroke dataset and outperforms state-of-the-art models in terms of accuracy and Dice coefficient. Moreover, the segmentation output provides feedback that helps the classification model reduce false-positive predictions.
162. 【2605.21527】CryoNet: A Deep Learning Framework for Multi-Modal Debris-Covered Glacier Mapping. A Case Study of the Poiqu Basin, Central Himalaya
链接:https://arxiv.org/abs/2605.21527
作者:Farzaneh Barzegar,Tobias Bolch,Norbert Kuehtreiber,Silvia L. Ullo
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:remains challenging due, Principal Component Analysis, debris-covered glaciers, climate change, automatic delineation
备注: 15 pages, 10 figures, 5 tables. Preprint submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS); currently under review
点击查看摘要
Abstract:Glaciers play a critical role as freshwater reserves and indicators of climate change, yet their automatic delineation, especially for debris-covered glaciers, remains challenging due to spectral similarity with surrounding terrain. This study introduces CryoNet, a deep learning framework that leverages a rich multi-modal dataset combining Sentinel-2 optical imagery, DEM-derived topographic variables, spectral indices, Principal Component Analysis (PCA), InSAR coherence and phase, tasseled-cap features, and GLCM texture to discriminate clean-ice glaciers, debris-covered glaciers, and glacial lakes. CryoNet is an encoder-decoder CNN with nested skip connections and spatial-channel Squeeze-and-Excitation (scSE) attention, built upon a ResNet101 encoder to capture hierarchical contextual and spatial features. The study is conducted in the Poiqu Basin in the central Himalaya, and transferability is evaluated by applying the trained model to the Mont Blanc Massif in the Alps. We additionally analyse the importance of each data layer in improving glacier mapping performance. The proposed model achieves an overall IoU of 90.52%, mean Recall of 98.08%, and mean Precision of 92.26%. For debris-covered glaciers specifically, CryoNet obtains an IoU of 90.46%, a recall of 95.79%, and a precision of 94.21%. Across both per-class and overall metrics, CryoNet surpasses DeepLabV3+, SegFormer, and U-Net, taken as state-of-the-art (SOTA) references, demonstrating its effectiveness for robust glacier mapping in complex high-mountain environments.
163. 【2605.21523】ackle CSM in JPEG Steganalysis with Data Adaptation
链接:https://arxiv.org/abs/2605.21523
作者:Rony Abecidan(CRIStAL),Vincent Itier(IMT Nord Europe, CRIStAL),Jérémie Boulanger(CRIStAL),Patrick Bas(CRIStAL),Tomáš Pevný(CTU)
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Signal Processing (eess.SP)
关键词:Steganalysis models excel, Steganalysis models, Cover Source Mismatch, processing pipeline unseen, unseen during training
备注: ACM Workshop on Information Hiding and Multimedia Security, (IHMMSec '26), Jun 2026, Florence, Italy
点击查看摘要
Abstract:Steganalysis models excel on benchmark datasets but struggle in the wild when analyzed images are produced by a processing pipeline unseen during training. This problem known as Cover Source Mismatch (CSM) is particularly hard in realistic settings where practitioners (1) have access to only a small, unlabeled dataset, (2) are unsure of the processing techniques applied to these images, and (3) lack information on the proportion of covers and stegos in that set. To answer this challenge, we introduce TADA (Target Alignment through Data Adaptation), a framework learning to emulate the unknown processing pipeline from a small unlabeled target set. This architecture is trained with a loss combining residual covariance alignment, residual distribution matching, and a $\ell^2$ loss constraining the emulator to produce realistic images. Across toy and operational targets, TADA yields substantial gains in robustness to CSM and improves operational generalization compared to strong holistic and atomistic baselines. Additional resources are available at this link: this https URL
164. 【2605.21500】A Task-Agnostic Algebraic Integrity Metric for Event-Camera Streams Toward SOTIF-Compliant Perception using Pearson Correlation Coefficient
链接:https://arxiv.org/abs/2605.21500
作者:Arthur de Miranda Neto
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:low-latency sensing modality, automated driving systems, offering microsecond temporal, microsecond temporal resolution, low-latency sensing
备注: 12 pages, 6 figures, 3 tables, 14 equations. Theoretical framework paper with procedural-synthetic illustrations; empirical validation on real datasets reserved for follow-up. Code and demonstration video available
点击查看摘要
Abstract:Event cameras have emerged as a high-bandwidth, low-latency sensing modality for safety-critical perception in automated driving systems (ADS), offering microsecond temporal resolution, 120-140 dB dynamic range, and intrinsic absence of motion blur. However, no task-agnostic quality metric currently operates directly on the asynchronous event stream: state-of-the-art proxies require a downstream task (e.g., detection accuracy, tracking error) to assess stream integrity, which is incompatible with the certification requirements of ISO 21448 (SOTIF) and ISO/PAS 8800:2024. The recent BiasBench benchmark (CVPR 2025) explicitly identifies this gap. This work proposes a unified algebraic framework that lifts the Pearson Correlation Coefficient (PCC), historically used in two prior works for redundancy filtering and ROI selection on frame-based images, to the three standard event representations: Time Surface, Event Frame, and Voxel Grid. The framework yields three metrics: (i) r-TS for stream integrity monitoring against an ego-motion-predicted Time Surface, (ii) r2-EF for adaptive ROI selection requiring only integer comparisons, and (iii) r-VG for temporal redundancy gating. A structural isomorphism is established between the contrast-threshold mechanism of the event camera (|Delta L| = C) and the PCC-based change criterion, the three lifted metrics are formalized, and pipeline latency and information loss are analyzed symmetrically against the raw stream. Illustrative behavior of each metric is demonstrated on a procedural-synthetic event stream, generated by direct simulation of the emission model rather than drawn from any real or video-derived dataset, including a tunnel-dip integrity-anomaly scenario in which r_C drops from 0.93 (coherent flow) to below 0 (alarm). An explicit epistemic convention ([ESTABLISHED], [SOLID], [HYPOTH.], [OPEN]) delineates the status of every contribution.

