CN116310375A

CN116310375A - Blind image quality assessment method based on visual attention mechanism

Info

Publication number: CN116310375A
Application number: CN202310003353.6A
Authority: CN
Inventors: 于天河; 孙岩; 程士成
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2023-01-03
Filing date: 2023-01-03
Publication date: 2023-06-23

Abstract

The invention discloses a blind image quality assessment method based on a visual attention mechanism. The method comprises the following steps: the original image is input into a feature extraction network A to extract high-level features and low-level features after size limitation; preprocessing the original image to generate an just-noticeable distortion image and a significant image, and inputting the just-noticeable distortion image and the significant image into a feature extraction network B to extract high-level features; the low-level features extracted from the original image are subjected to respective dimension reduction pooling modules to obtain feature vectors; fusing the high-level features extracted from the original image and the just-perceived distortion image with features obtained by dimension reduction pooling; and obtaining the quality fraction in a quality regression network according to the fused feature vector. The method combines visual attention mechanisms, uses different attention mechanisms in different feature extraction networks, and the extracted features are more in line with the attention characteristics of human eyes, so that the influence of low-layer features extracted from images on the image quality is considered, and the image quality evaluation is more accurate.

Description

Blind image quality assessment method based on visual attention mechanism

Technical Field

The invention relates to a blind image quality assessment method based on a visual attention mechanism, and belongs to the field of image processing.

Background

With the continuous development of information technology, digital information resources are exploded, and people can acquire complex and various information through electronic equipment. The image information can intuitively record objective things, and is more efficient compared with text information, voice information and the like. Image information is commonly found in various fields in daily life, and is one of the most widely applied and most efficient information media. However, noise and other interference are inevitably introduced in the process of image acquisition and transmission and the like, so that the image quality is reduced. Low quality images severely impact the human visual experience and the development of computer vision in various areas, and therefore how to accurately evaluate the quality of images is a fundamental and important issue.

In objective image quality assessment, blind image quality assessment methods have received considerable attention from researchers because of the involvement of reference images is not required. The Kang et al first applied a convolutional neural network (Convolutional Neural Network, CNN) to the image quality assessment, the model was a shallow CNN network, and the image was diced and then input to the network to obtain the quality score. Ma et al also designed a multi-task learning model based on a CNN network comprising two sub-networks, a quality prediction sub-network and a distortion class identification sub-network, respectively, wherein the two sub-network features extract part of parameter sharing, pre-train the distortion class identification sub-network, and then perform overall network training.

Currently, most methods for evaluating image quality using deep learning directly extract features on distorted images, and the human eye vision system does not look at each region of the image at the same time, but screens out and focuses on the regions of interest that are partially related. Both of these methods ignore the effect of the human visual attention mechanism on the image quality assessment, resulting in inaccurate prediction results. And for image quality assessment, only features extracted by using the deep network ignore the influence of low-level features, such as texture and gradient features, on quality to a certain extent.

Disclosure of Invention

The invention aims to disclose a blind image quality assessment method based on a visual attention mechanism, which uses the attention mechanism to make a network pay more attention to a part of an image with larger influence on image quality and takes account of low-layer characteristics extracted from the image, thereby improving the prediction precision of quality scores of distorted images.

A blind image quality assessment method based on visual attention mechanisms, comprising:

step 1, inputting an original image to a feature extraction network A to extract high-level features and low-level features after size limitation;

step 2, preprocessing the original image to generate an just noticeable distortion image and a significant image, and inputting the just noticeable distortion image and the significant image into a feature extraction network B to extract high-level features;

step 3, obtaining feature vectors from the low-level features extracted in the step 1 through respective dimension reduction pooling modules;

step 4, carrying out feature fusion on the high-level features extracted in the step 1 and the step 2 and the features obtained in the step 3;

and step 5, obtaining the quality score in a quality regression network according to the fused feature vector.

The invention is also characterized in that:

the size in step 1 is defined as the input picture size being limited to a range of n x n, and when the image is wider or taller than n, scaling the wider or taller than n to n, where n takes 512 pixels.

In the step 1, the original distorted image with limited size is taken as input, the feature extraction network A is based on a MobilenetV2 network, a mixed attention module is added in the last pouring residual of each bottleneck structure, the low-level features are feature graphs output by the second bottleneck structure and the fourth bottleneck structure, and the high-level features are feature graphs output by the last bottleneck structure of the network.

The added mixed attention module is formed by adopting a mode of firstly channel module and then space module. The channel attention specifically operates as follows:

C＝Mul(σ(a(K ₁ (GAP(m)),K ₂ (GMP(m)))),m)

wherein m is a feature diagram to be passed through the channel attention module, GAP and GMP are global max pooling and global average pooling operations on m, respectively, K ₁ And K ₂ To perform 1×1 adaptive convolution operations on GAP and GMP post-operation features, a is to be K ₁ And K ₂ The operated features are added by corresponding elements, sigma is a sigmod operation, mul is corresponding multiplication of channel weights after m and sigma operations, and C passing through a channel attention module is input into space attention. The spatial attention is the spatial attention portion of the CBAM attention module, which is a top-down data driven attention.

In step 2, the just-noticeable distortion map and the saliency map are taken as inputs, the saliency map and the just-noticeable distortion map are spliced into two-channel images, and the saliency map is taken as a space attention part of the just-noticeable distortion map. The specific process is as follows: a saliency map and an just noticeable distortion map of the image are extracted using the saliency extraction model and the just noticeable distortion model. Splicing the salient image and the just noticeable distortion image to obtain an image with two channels; inputting the spliced images into a designed feature extraction network B, wherein the network is used for removing a spatial attention module for the feature extraction network A in the step 2;

the specific process of the step 3 is as follows: and (3) respectively carrying out dimension reduction pooling on the low-level features extracted in the step (1) by a dimension reduction pooling module, wherein the dimension reduction pooling module comprises average pooling, 1 multiplied by 1 and SPP pooling, reducing the width and the height of the feature map by half by 2 multiplied by 2 average pooling with the step length of 2, reducing the number of channels to 10 by 1 multiplied by 1 convolution, and finally converting the feature map into one-dimensional feature vectors by SPP pooling.

The step 4 is specifically as follows: the method comprises the steps of twice feature fusion, firstly, channel splicing is carried out on the high-level features extracted in the step 1 and the step 2, the spliced features are sent into a self-adaptive average pool to obtain a feature vector, and then the obtained feature vector is spliced with the feature vector obtained in the step 3 to obtain the features of the quality regression network to be input finally.

The beneficial effects of the invention are as follows:

the invention provides a blind image quality assessment method based on a visual attention mechanism, which constructs two feature extraction networks, wherein the two paths of networks use the same channel attention, different spatial attention is used for different input images, the extracted features are more in line with the attention characteristics of human eyes, and the accuracy of image quality assessment is improved. The image quality evaluation requires low-level information such as gradient and texture of an image and high-level semantic information, and the low-level features extracted from the original image make up the defect that most of the current image quality evaluation methods for deep learning only use the high-level features, so that the accuracy of image quality evaluation is improved.

Drawings

Fig. 1 is a flow chart of a blind image quality assessment method based on visual attention mechanisms according to the present invention.

FIG. 2 is a diagram of a dimension reduction pooling module according to the present invention

Detailed Description

The invention provides a blind image quality assessment method based on a visual attention mechanism. In order to better understand the technical solution in the embodiments of the present invention and make the above objects, features and advantages of the present invention more obvious, the following technical solution is further described in detail with reference to the accompanying drawings:

the invention firstly provides a blind image quality assessment method based on a visual attention mechanism, as shown in fig. 1, and the specific method is as follows:

the step 1 specifically comprises the following steps:

the input picture size is limited to the n x n range, and when the image width or height is greater than n, the image width or height greater than n is scaled to n, where n takes 512 pixels. The image with limited size is input into a feature extraction network, the feature extraction network A is based on a Mobilene V2 network, and the Mobilene V2 network is a lightweight convolutional neural network and has the advantages of small parameter quantity, small calculation amount, high accuracy and the like, and can extract rich image features while keeping small calculation cost. And removing the last output layer and the pooling layer of the Mobilene V2, adding a mixed attention module in the last pouring residual error of each bottleneck structure of the Mobilene V2, taking the characteristics output by the second bottleneck structure and the fourth bottleneck structure as low-level characteristics, and taking the characteristic diagram output by the last bottleneck structure as high-level characteristics.

C＝Mul(σ(a(K ₁ (GAP(m)),K ₂ (GMP(m)))),m)

wherein m is a feature diagram to be passed through the channel attention module, GAP and GMP are global max pooling and global average pooling operations on m, respectively, K ₁ And K ₂ To perform 1×1 adaptive convolution operations on GAP and GMP post-operation features, a is to be K ₁ And K ₂ The operated features are added by corresponding elements, sigma is a sigmod operation, mul is corresponding multiplication of channel weights after m and sigma operations, and C passing through a channel attention module is input into space attention. The spatial attention is the spatial attention portion of the CBAM attention module. The mixed attention module is essentially task driven to assign weights, which is a top-down attention.

the step 2 is specifically as follows:

the just-noticeable distortion map and the saliency map are taken as inputs, the just-noticeable distortion map is obtained by the just-noticeable distortion model, and the just-noticeable distortion map reflects the sensitivity of human eyes to different distortions and the perceived distortion threshold value as the existing model. The saliency map is taken as a space attention part of the just noticeable distortion image, the saliency map expresses the attention of saliency, the attention is typically from bottom to top and driven by external stimulus, and the saliency map is obtained by a saliency extraction model and is an existing model.

Extracting a saliency map and an just-noticeable distortion map of the image using the saliency extraction model and the just-noticeable distortion model; splicing the salient image and the just noticeable distortion image to obtain an image with two channels; inputting the spliced images into a designed feature extraction network B, wherein the network is used for removing a spatial attention module for the feature extraction network A in the step 2;

the step 3 is specifically as follows:

taking the feature graphs with the second bottleneck structure and the fourth bottleneck structure in the network as low-level features, respectively carrying out dimension reduction pooling modules shown in fig. 2, reducing the width and the height of the feature graphs by half by 2×2 average pooling with the step length of 2, reducing the number of channels to 10 by 1×1 convolution, and finally converting the feature graphs into one-dimensional feature vectors by SPP pooling.

And 4, carrying out feature fusion on the high-level features extracted in the step 1 and the step 2 and the features obtained in the step 3.

The step 4 is specifically as follows:

the method comprises the steps of twice feature fusion, firstly, channel splicing is carried out on the high-level features extracted in the step 1 and the step 2, the spliced features are sent into a self-adaptive average pool to obtain a feature vector, and then the obtained feature vector is spliced with the feature vector obtained in the step 3 to obtain the features of the quality regression network to be input finally.

Step 5, obtaining quality scores in a quality regression network according to the fused feature vectors;

the step 5 is specifically as follows:

the quality regression network does not extract image features any more but forms a mapping relationship of the previously extracted features with the image quality. The quality regression network consisted of 4 fully connected layers, using a sigmod activation function, the loss function using a smoothl 1 loss function in combination with the ordering loss as the final loss function. The network parameters are updated using a back propagation mechanism. Compared with L1 and L2 loss functions, the Smooth L1 loss function has the characteristics of faster convergence, insensitivity to abnormal values and easy training. The ranking loss refers to the performance of the lifting model that the predicted image quality score of the network is the same as the actual quality score in order, and the use of the smoth L1 loss function in combination with the ranking loss function is more beneficial.

The Smooth L1 loss function formula is as follows:

the ordering penalty is as follows:

the total loss function formula is as follows:

L＝α×Smooth _L1 +β×L _rank

wherein x is _i True value, y, represented in the ith picture _i Is the predicted value of the i-th picture,

for the ordering penalty between picture i and picture j, α and β are the weights of the smoth L1 penalty and ordering penalty, respectively.

The invention relates to a blind image quality assessment method based on a visual attention mechanism, which is characterized in that features are extracted on an original image and an just-perceived distortion image respectively, the just-perceived distortion image reflects the distorted perceived features of a human eye visual system, the attention driven by the features from top to bottom is adopted when the features are extracted on the original image, the just-perceived distortion image loses part of the features while reflecting the distorted perceived features, the attention driven by external stimulus from bottom to top, namely a salient image is used for expressing a salient region of the blind image, two paths of networks for extracting the features use the same channel attention, and the extracted features are more in line with the attention characteristics of human eyes and improve the precision of image quality assessment. The method also combines the high-low layer characteristics extracted from the image, and improves the accuracy of image quality assessment.

Claims

1. A blind image quality assessment method based on a visual attention mechanism, characterized in that: the method is realized by the following steps:

2. The visual attention mechanism based blind image quality assessment method of claim 1, wherein: the size in the step 1 is limited to be limited in the range of n×n, when the width or height of the image is larger than n, the width or height of the image larger than n is scaled to n, and n is 512 pixels.

3. The visual attention mechanism based blind image quality assessment method of claim 1, wherein: in the step 1, the feature extraction network a is based on a mobiletv 2 network, a mixed attention module is added in the last pouring residual of each bottleneck structure, the low-level features are feature graphs output by the second bottleneck structure and the fourth bottleneck structure, the high-level features are feature graphs output by the last bottleneck structure of the network, and the channel attention in the added mixed attention module specifically operates as follows:

C＝Mul(σ(a(K ₁ (GAP(m)),K ₂ (GMP(m)))),m)

wherein m is a feature diagram to be passed through the channel attention module, GAP and GMP are global max pooling and global average pooling operations on m, respectively, K ₁ And K ₂ To perform 1×1 adaptive convolution operations on GAP and GMP post-operation features, a is to be K ₁ And K ₂ The operated characteristics are added by corresponding elements, sigma is a sigmod operation, mul is that channel weights after m and sigma operations are correspondingly multiplied, C which passes through a channel attention module is input into a space attention, and the space attention is a space attention part of a CBAM attention module.

4. The visual attention mechanism based blind image quality assessment method of claim 1, wherein: and in the step 2, the saliency map and the just noticeable distortion map are spliced into two-channel images, the saliency map is used as a space attention part of the just noticeable distortion map, and the feature extraction network B is a network space attention removal module in the step 1.

5. The visual attention mechanism based blind image quality assessment method of claim 1, wherein: the dimension reduction pooling module in the step 3 comprises average pooling, 1×1 convolution and SPP pooling, the low-level features extracted in the step 1 are subjected to downsampling by the average pooling, the number of channels is reduced by the 1×1 convolution, and finally the feature map is converted into one-dimensional feature vectors by the SPP pooling.

6. The visual attention mechanism based blind image quality assessment method of claim 1, wherein: the step 4 comprises twice feature fusion, namely, firstly, channel splicing is carried out on the high-level features extracted in the step 1 and the step 2, the spliced features are sent into a self-adaptive average pool to obtain a feature vector, and then, the obtained feature vector is spliced with the feature vector obtained in the step 3 to obtain the features which are finally input into a quality regression network.