Szczegóły publikacji

Opis bibliograficzny

Enhancing full-reference image quality assessment via dual-attention fusion of global–local features and CLIP-based text prior / Mengyao Huang, Yi Zhang, Damon M. Chandler, Mikołaj LESZCZUK, Mylène C. Q. Farias // Journal on Image and Video Processing [Dokument elektroniczny]. — Czasopismo elektroniczne ; ISSN  3091-454X . — 2026 — vol. 2026 art. no. 6, s. 1–26. — Wymagania systemowe: Adobe Reader. — Bibliogr. s. 24–26, Abstr. — Publikacja dostępna online od: 2026-03-11

Autorzy (5)

Słowa kluczowe

constrastive language-image pre-trainingfull referencehuman visual systemperceptual quality assessmentfeature fusion

Dane bibliometryczne

ID BaDAP167765
Data dodania do BaDAP2026-06-02
Tekst źródłowyURL
DOI10.1186/s13640-026-00687-6
Rok publikacji2026
Typ publikacjiartykuł w czasopiśmie
Otwarty dostęptak
Creative Commons
Czasopismo/seriaEURASIP Journal on Image and Video Processing

Abstract

Image quality assessment (IQA) is a field that focuses on evaluating the quality of images, playing a crucial role in various image processing and/or computer vision applications. Traditional full-reference (FR) IQA algorithms struggle with an accurate perceptual quality evaluation due in part to their reliance on handcrafted features and simple mathematical functions to calculate the elementwise distance between the reference and distorted images. Although deep-learning-based FR IQA methods have shown advantages in providing a certain degree of tolerance to texture resampling, their performances are still limited by the redundant model parameters and ineffective quality-aware feature extraction/representation. To address this issue, in this paper, we propose a multi-modal dual-attention FR IQA algorithm based on combining a global-and-local image structure analysis with text information interpreted by the widely used large language model. Specifically, the proposed multi-modal dual-attention network consists of four modules. First, a global-and-local feature extraction module was employed to extract the quality-aware features from the reference and distorted images, which were then realigned along the spatial and channel dimensions by a feature fusion module. To take into account both the channel and spatial attentions and thus increase the model capacity in representing long-range dependencies among different image areas, a feature enhancement module was designed to encode the spatial information along two directions, based on which the direction-aware attention maps with position information were generated. Finally, the text prior knowledge interpreted by the contrastive language-image pre-training (CLIP) model was embedded to assist the attention-based prediction module for quality estimation. Experimental results on four benchmark datasets demonstrate the effectiveness of our model as compared with other state-of-the-art FR IQA methods. The code is available at https://vinelab.jp/m2da/.