CN117078656A

CN117078656A - Novel unsupervised image quality assessment method based on multi-mode prompt learning

Info

Publication number: CN117078656A
Application number: CN202311131117.9A
Authority: CN
Inventors: 纪荣嵘; 高体民; 潘文胜; 郑侠武; 张岩
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-11-17

Abstract

A novel unsupervised image quality assessment method based on multi-mode prompt learning belongs to the technical field of computer vision. The no reference image quality assessment aims to simulate human assessment of image quality without a reference image (original image). The invention fully plays the potential of the pre-trained CLIP model in the task of challenging image perception evaluation. The multi-mode prompt learning is introduced first, so that the expression space of the CLIP model in BIQA can be flexibly adjusted, and the potential in a challenging image perception evaluation task is stimulated. And secondly, improving a previous text prompt learning method, and replacing the anti-ambiguous text prompt learning used in the previous method by using the text prompt learning with fine granularity, so that the fine granularity characteristics of the image can be captured, and more accurate quality evaluation can be obtained.

Description

Novel unsupervised image quality assessment method based on multi-mode prompt learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a novel multi-mode prompt learning-based reference-free image quality assessment method.

Background

Image quality assessment (Image Quality Assessment, IQA for short) is an important research direction in the field of computer vision, and the main objective of the image quality assessment is to quantitatively predict and evaluate the visual quality of an image based on human visual perception features. Image quality assessment has important practical significance in many applications such as image processing, image transmission, video coding, and the like.

Over the years, a number of IQA methods have been developed and evaluated, which generally fall into 3 categories by how much the original reference image provides information: full Reference (Full Reference-IQA, FR-IQA), half Reference (Reduced Reference-IQA, RR-IQA) and No Reference (No Reference-IQA, NR-IQA), also called Blind Reference (Blind IQA, BIQA). The full reference image quality assessment may refer to the original undistorted image and obtain an image quality score of the distorted image based on a difference between the distorted image and the original image. The half-reference image quality assessment predicts image quality using partial information of the original image as a reference. However, in practical applications, the reference image is difficult to obtain, making the above two methods inapplicable, so that the work in recent years is gradually focused on the BIQA field.

Conventional reference-free image quality assessment methods generally rely on manually designed features and rules, and are difficult to cope with complex and variable image distortion and quality variation situations. With the rapid development of deep learning technology, an image quality evaluation method based on a deep neural network has made remarkable progress in terms of precision and generalization capability. The method utilizes the strong expression capability of the deep neural network, and can automatically learn and extract complex perception features in the image, thereby realizing accurate evaluation of the image quality. The limitation of manual design characteristics in the traditional method can be overcome through a deep learning method, and end-to-end training can be performed on a large-scale data set, so that the performance and generalization capability of image quality assessment are improved.

OpenAI proposes the CLIP model (Radford, et al learning transferable visual models from natural language preference in International conference on machine learning (PMLR), pages 8748-8763, 2021.) which is a self-supervising pre-training framework that encodes images and text into a shared vector space without labeling data. CLIP shows a powerful zero sample transfer learning capability over various tasks. Wang et al (Wang, et al, listing CLIP for assessing the look and feel of images. ArXiv preprintarXiv: 2207.12396 (2022)) explored the potential of CLIP in a challenging image perception assessment task for the first time. They propose a prompt pairing strategy using anti-ambiguous prompts (e.g. "good photo" and "bad photo") to reduce the ambiguity of the prompt. On this basis, zhang et al (Zhang, et al blind image quality assessment via vision-language correspondence: A multitask learning superpositive. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14071-14081,2023.) have proposed multitasking, fine tuning CLIP by combining image quality assessment with distortion type classification and scene classification tasks. This approach automatically determines sharing and loss weights for model parameters, exploiting the auxiliary knowledge from other tasks. They used a five quality level Likert scale, with the model output taking the quality level as a score for a logically weighted sum.

However, the effectiveness of these methods is often limited by the choice of entering the text prompt. For image quality assessment, it is crucial to select an appropriate text prompt, which may lead to instabilities and fluctuations in performance. In addition, some models in the existing methods still predict a higher quality score when the processed image is clear and other parts are distorted, which is not the case. On the other hand, the distortion condition of different areas of the image is rarely considered in the existing method, and only the perception evaluation of the whole image is focused. This makes these methods somewhat limited in processing images with local distortion.

Disclosure of Invention

The invention aims to provide a novel multi-mode prompt learning-based reference-free image quality assessment method, which fully exerts the potential of a CLIP model in a challenging image perception assessment task. The method introduces fine-grained text prompts, and enables the model to more accurately capture various fine differences of image quality by more finely classifying the image quality assessment problems. Meanwhile, each layer of the visual branch is added with a leachable image prompt, so that the model can better adjust the representation space of the model in the image quality assessment task, and the information related to quality in the image can be more comprehensively understood. In addition, a two-stage evaluation paradigm is also proposed, which enables a model to be adapted gradually from coarse perception to detail evaluation, thereby achieving a comprehensive understanding and accurate prediction of image quality.

The invention comprises the following steps:

1) Images with scores in the range of [49-50 ] are classified as level [50], corresponding to category [50], for example.

2) Introducing a learnable text prompt into a text branch of a model to solve the problem of prompt sensitivity of the CLIP model to downstream tasks; meanwhile, in order to enable the model to better adjust the representation space of the model in the image quality evaluation task, information related to quality in an image is more comprehensively understood, and a leachable image prompt is introduced into an image branch.

3) Training of the model is divided into two phases: in the first model training stage, only the text prompt and the image prompt are trained, other parameters are frozen, and the migrated CLIP model has the sensing and evaluation capability for the image quality under the condition of few training parameters. In the second model training stage, the prompts of the two branches are frozen, and only the image encoder is trained, so that the model can be gradually adapted to detail evaluation from coarse perception, and comprehensive understanding and accurate prediction of image quality are realized.

The invention has the characteristics and effects that:

the novel multi-mode prompt learning-based reference-free image quality assessment method provided by the invention solves the defect that the prior work has performance bottleneck, and greatly improves the performance. The invention introduces the concept of multi-mode prompt learning in the image quality assessment task, not only introduces the learnable text prompt in the text branch, but also adds the learnable image prompt in each layer of the image branch. The method and the device enable the model to fully utilize information interaction between the text and the image when evaluating the image quality, and better understand the quality-related characteristics in the image, so that the performance of the model in an image quality evaluation task is improved.

Drawings

FIG. 1 is a comparative diagram of the present invention and previous methods.

Fig. 2 is a frame diagram of the present invention.

Fig. 3 is a visual thermodynamic diagram of the present invention.

Detailed Description

The invention will be further illustrated by the following examples in conjunction with the accompanying drawings.

The flow of the method of the present invention is shown in FIG. 2. Comprising two phases. In the first model training stage, only the text prompt and the image prompt are trained, other parameters are frozen, and the migrated CLIP model has the sensing and evaluation capability on the image quality under the condition of few training parameters. In the second model training stage, the prompts of the two branches are frozen, and only the image encoder is trained, so that the model can be gradually adapted to detail evaluation from coarse perception, and comprehensive understanding and accurate prediction of image quality are realized. The upper right hand corner of fig. 2 is where a learnable hint is introduced at the image branches, with a learnable image hint added at each transducer layer of the image encoder.

1. Training instructions

The embodiment of the invention comprises the following steps:

1) Images with scores in the range of [49-50 ] are classified as level [50], corresponding to category [50], for example. The model will calculate the similarity from the output features of the text encoder and the input features of the image encoder and get the final quality score q (x) in a weighted sum.

Where P (c|x) represents the class probability after application of softmax and C represents the total number of classes.

2) A set of learnable text hints is introduced at the text branches of the model to fully exploit the characterization capabilities of the text encoder. The form of text entry is designed as follows: "[ X ]] ₁ [X] ₂ [X] ₃ ...[X] _M [class]。”“[X] ₁ [X] ₂ [X] ₃ ...[X] _M "is used to denote the prefix of the text input, where M denotes the number of learnable labels. In addition, at the image branch, a "depth visual cue" is introduced, which involves adding cues at each layer of the image encoder. The main purpose is to enhance the alignment between the perceived features of the image and the quality-level text features. By introducing cues at multiple levels, the learning process of image features is more finely controlled, thereby better adapting to the image quality assessment task.

A set of learnable labels P is introduced between the class token of the image and the block embedding of the image. At each layer of the image encoder, the input is represented as [ Cls ] _i-1 ，P _i-1 ，E _i-1 ]Where Cls represents class token and E represents block embedding. Through the ith transducer layer L _i After that, a new learnable marker P is continued to be introduced, which is connected to the outputs Cis and E.

[Cls _i ，○，E _i ]＝L _i ([Cls _i-1 ，P _i-1 ，E _i-1 ])#(2)

Wherein O indicates not being input to the next transducer layer. This design allows the present invention to introduce a learnable marker at each layer of the image encoder, enriching the ability of the model to capture and represent important image features and quality characteristics, ultimately contributing to a more efficient and accurate image quality assessment.

3) Training of the model is divided into two phases. In the first model training stage, only the text prompt and the image prompt are trained, other parameters are frozen, and the migrated CLIP model has the sensing and evaluation capability on the image quality under the condition of few training parameters. The loss function used is:

wherein i represents the ith picture, T and V represent output characteristics of the text encoder and the image encoder respectively, sim (,) represents cosine similarity of the two characteristics, B represents a minimatch, and A represents a set of all pictures in a small batch belonging to the same category as the picture i.

In the second model training stage, the prompts of the two branches are frozen, and only the image encoder is trained, so that the model can be gradually adapted to detail evaluation from coarse perception, and comprehensive understanding and accurate prediction of image quality are realized. The loss function used at this stage is:

introducing loss of fidelityTo consider pair-wise learning to rank the model estimates. Furthermore, use is made of a smooth L1LossAnd cross entropy loss with tag smoothing +.>To optimize. Wherein α and β are balance ∈ ->And->Is a coefficient of (a).

p and p' represent the true probability distribution and the predicted probability distribution, respectively.

I′ _k ＝(1-ε)I _k +ε/C represents the value in the quality level target distribution, P _k Representing the predicted logits of category k.

2. Implementation details

1) Model details

The invention is implemented using a Pytorch framework. Both the image encoder and the text encoder of the present invention are derived from the CLIP framework. Specifically, a ViT-B/16 architecture and a single layer transducer decoder were employed as the image encoder. This configuration contains 12 transducer layers, each with a concealment size of 768 dimensions. To match the output of the text encoder, the linear layer is used to reduce the dimension of the image feature vector from 768 to 512.

2) Training details

The training process is divided into two phases, each phase containing 60 epochs. In the first stage, only the learnable text cues and image cues are focused, freezing other parameters. Initial learning rate was 3×10 using Adam optimizer ^-5 And then performing attenuation by adopting cosine learning rate scheduling. To enhance the training data, each original image was randomly cropped into 8 sub-images, each sub-image having a size of 3×224×224. In the second stage, the image encoder is optimized. Adam optimizers are still used, but this phase involves a warm-up phase of up to 10 epochs. Learning rateFrom 9.5X10- ⁷ The linearity increases to 5 x 10-6. The learning rate was reduced at 30 th and 50 th epochs by multiplying by 0.1. Data enhancement at this stage includes random horizontal flipping and random clipping. The coefficient α is set to 0.001, and β is set to 0.1.

Each image is assigned a new category label based on the score label for each image. The batch sizes of LIVE and CSIQ data sets are set to 32, while the batch sizes of other data sets are set to 64. Data were divided into 80% for training and 20% for testing. To reduce the performance bias, each experiment was repeated 10 times and the average PLCC and SRCC were calculated.

3. Application field

The method and the device can be applied to the field of non-reference image quality evaluation, and can be used for judging the quality of the distorted picture under the condition of no original image.

Table 1 is a comparison of the performance of the model of the present invention with the previous model over the 6 common image quality assessment datasets. These datasets include four real datasets and two synthetic datasets. The real data set used includes LIVE, CSIQ, TID2013 and KADID. Synthetic datasets include LIVEC and KoniQ. For evaluating the performance of the model, pearson Linear Correlation Coefficient (PLCC) and spearman scale correlation coefficient (SRCC) were used as evaluation indexes. PLCC measures the accuracy of model predictions, while SRCC evaluates the monotonicity of BIQA algorithm predictions. The values of the two indexes are all from 0 to 1, and the higher the numerical value is, the better the prediction accuracy and the monotonicity are.

As can be seen from table 1, the model of the present invention exhibits competitive performance over the most advanced methods across all data sets. Notably, the most advanced performance is achieved on CSIQ, TID2013 and KADID datasets by increasing PLCC index by 1.0%, 1.7% and 5.0%, and SRCC index by 1.9%, 2.2% and 4.2%, respectively, over existing methods. These results highlight the effectiveness and leading performance of the model of the present invention in image quality assessment.

TABLE 1

Table 1：Performance comparison is measured by averages of SRCC and PLCC.Best results are highlighted in bold，second-best are underlined.

Table 2 is a generalization performance comparison of the model of the present invention over previous models on intersecting datasets. Specifically, one BIQA model is trained on one dataset and evaluated directly on the other dataset without trimming or adjusting parameters. Four data sets were used and median experimental results were reported. The model of the present invention exhibits superior performance on both the KonIQ and CSIQ data sets and remains competitive on other data sets. These experimental results highlight the robust generalization performance of the model.

TABLE 2

Figure 1 gives a comparison of the present invention and the previous method. It can be seen that CLIP-IQA ⁺ A learnable text hint is introduced on the basis of CLIP-IQA, but the way of disambiguation classification is not changed, and in comparison with the former two, a fine-grained way of classification by quality score interval is introduced. In addition, the invention also introduces a learnable prompt in the image branches.

FIG. 3 shows a visual thermodynamic diagram of the present invention, using GradCAM to visualize DEIQT (Qin G, hu R, liuY, et al data-Efficient Image Quality Assessment with Attention-Panel Decoder [ J ]. ArXiv preprint arXiv:2304.04952,2023 ]) and a feature-attention diagram of an input image in a model of the present invention. Focusing on low scoring images, the aim was to reveal the reason why the model of the invention was superior to DEIQT on the level. As shown in fig. 3, DEIQT is observed to be excessively concentrated in distortion within the main content of the image. This results in that DEIQT predicts a higher quality score even though the main content of the image remains clear, while other areas are severely distorted, which is clearly not realistic. In contrast, the model of the present invention takes into account the distortion of different regions of the image, thereby making a more accurate assessment of image quality. This advantage results from the design of the multi-mode prompt and two-stage training modes of the present invention. First, the multi-modal prompt design allows the model to contain both textual and visual information, enabling a more comprehensive understanding of the inherent features in the image. Particularly in low quality images, the model of the present invention can effectively capture fine distortions that may be ignored. Secondly, the two-stage training strategy of the invention enables the model to gradually adapt to the BIQA task, and the full understanding of the image quality is realized from rough perception to detailed evaluation.

The above-described embodiments are merely preferred embodiments of the present invention and should not be construed as limiting the scope of the present invention. All equivalent changes and modifications within the scope of the present invention are intended to be covered by the present invention.

Claims

1. The novel unsupervised image quality assessment method based on multi-mode prompt learning is characterized by comprising the following steps of:

1) Grading according to the score of each picture in the data set and giving a grade or category label;

2) Introducing a learnable text prompt into a text branch of a model to solve the problem of prompt sensitivity of the CLIP model to downstream tasks; meanwhile, in order to enable the model to better adjust the representation space in the image quality assessment task, the information related to quality in the image is more comprehensively understood, and a learnable image prompt is introduced into the image branch;

3) Training of the model is divided into two phases: in the first model training stage, only training text prompts and image prompts, freezing other parameters, and enabling the migrated CLIP model to have the sensing and evaluation capability on the image quality under the condition of few training parameters; in the second model training stage, the prompts of the two branches are frozen, and only the image encoder is trained, so that the model can be gradually adapted to detail evaluation from coarse perception, and comprehensive understanding and accurate prediction of image quality are realized.

2. The method for evaluating the quality of an unsupervised image based on multi-modal prompt learning as claimed in claim 1, wherein in step 1), the images are classified according to the score of each picture in the dataset and are given a class or category label, and if the images with the score in the range of [49-50 ] are classified as class [50], the images are classified as class [50]; the model calculates the similarity according to the output characteristics of the text encoder and the input characteristics of the image encoder, and obtains the final quality score q (x) in a weighted summation mode:

3. The novel multi-modal prompt learning-based unsupervised image quality assessment method as claimed in claim 1, wherein in step 2), in the text branch of the model, a learnable text prompt is introduced to solve the problem of prompt sensitivity of the CLIP model to downstream tasks, in particular:

a set of learnable text prompts is introduced to fully exploit the characterization capabilities of the text encoder to design a form of text input as follows: "[ X ]] ₁ [X] ₂ [X] ₃ ...[X] _M [class]；”“[X] ₁ [X] ₂ [X] ₃ ...[X] _M "used to denote a prefix of text input, where M denotes the number of learnable labels; in addition, at the image branch, a "depth visual cue" is introduced, which involves adding cues at each layer of the image encoder for enhancing the alignment between the perceived features of the image and the quality-level text features; by introducing prompts at a plurality of layers, the learning process of the image features is controlled more finely, so that the image quality assessment task is adapted better;

introducing a group of learnable marks P between a class token of the image and the block embedding of the image; at each layer of the image encoder, inputRepresented as [ Cls ] _i-1 ,P _i-1 ,E _i-1 ]Wherein Cls represents class token and E represents block embedding; through the ith transducer layer L _i Then, continuously introducing a new learnable mark P, and connecting with the output Cls and E;

[Cls _i ,○,E _i ]＝Li([Cls _i-1 ,P _i-1 ,E _i-1 ])#(2)

wherein O represents the input not as the next transducer layer; this design allows the introduction of a learnable marker at each layer of the image encoder, enriching the ability of the model to capture and represent important image features and quality characteristics, ultimately contributing to a more efficient and accurate image quality assessment.

4. The novel multi-modal prompt learning-based unsupervised image quality assessment method according to claim 1, wherein in step 3), the training of the model is divided into two phases: in the first model training stage, only training text prompts and image prompts, freezing other parameters, and enabling the migrated CLIP model to have the sensing and evaluation capability on the image quality under the condition of few training parameters; the loss function used is:

wherein i represents an ith picture, T and V represent output characteristics of a text encoder and an image encoder respectively, sim (,) represents cosine similarity of the two characteristics, B represents a minimatch, and A represents a set of all pictures in a small batch belonging to the same category as the picture i;

in the second model training stage, the prompts of the two branches are frozen, and only the image encoder is trained, so that the model can be gradually adapted to detail evaluation from coarse perception, and the comprehensive understanding and accurate prediction of the image quality are realized; the loss function used at this stage is:

introducing loss of fidelityTo consider pair-wise learning to rank model estimates; furthermore, use of smoothness +.>And cross entropy loss with tag smoothing +.>To optimize; wherein α and β are balance ∈ ->And->Coefficients of (2);

wherein p and p' represent the true probability distribution and the predicted probability distribution, respectively;

wherein I' _k ＝(1-ε)I _k +ε/C represents the value in the quality level target distribution, P _k Representing the predicted logits of category k.