CN114187261A

CN114187261A - Non-reference stereo image quality evaluation method based on multi-dimensional attention mechanism

Info

Publication number: CN114187261A
Application number: CN202111507792.8A
Authority: CN
Inventors: 沈丽丽; 李昕彤; 潘兆庆; 陈雄飞
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-12-07
Filing date: 2021-12-10
Publication date: 2022-03-15
Anticipated expiration: 2041-12-10
Also published as: CN114187261B

Abstract

The invention relates to a non-reference stereo image quality evaluation method based on a multidimensional attention mechanism, which comprises the following steps: preprocessing an original stereo image for training, converting the original stereo image into a gray image and dividing the gray image into non-overlapping small image blocks, giving a real quality fraction of the image to each image block, and randomly selecting a plurality of image blocks as the input of a network model; training a convolutional neural network based on multi-dimensional attention, wherein the method comprises the following steps: (1) extracting primary features from left and right views by using a group of CCP modules for convolution and pooling operation, and processing the left and right views to obtain a primary feature map; (2) sending the primary feature maps of the left view and the right view into a view fusion sub-network, and calculating a fusion feature map; (3) inputting the fusion feature map into a multi-scale feature enhancement sub-network based on multi-dimensional attention, and predicting the quality score of the image; (4) calculating a loss function of the network, and performing iterative training; and (5) carrying out image quality evaluation by using the trained network.

Description

Non-reference stereo image quality evaluation method based on multi-dimensional attention mechanism

Technical Field

The invention relates to the field of stereo image quality evaluation, in particular to a non-reference evaluation algorithm for simulating binocular competition and visual attention mechanism.

Background

The stereo image quality evaluation algorithm can be divided into subjective evaluation and objective evaluation according to different evaluation subjects. The subjective evaluation algorithm requires the tested personnel to score the image quality according to various given indexes under a certain experimental environment, and then the average score of the image is calculated. Usually, the subjective evaluation will result in both a subjective Opinion Score (MOS) and a Differential Opinion Score (DMOS). The objective evaluation algorithm simulates a human visual system by means of a mathematical model, and then evaluates the image quality. Since a person is the ultimate recipient of an image, subjective evaluation algorithms are typically more accurate. However, subjective evaluation has the disadvantages of time consumption, incapability of real-time evaluation, high cost and the like, and is easily influenced by the testee. Compared with subjective evaluation, objective quality evaluation based on the algorithm does not need a large amount of manual participation, and only needs to design a corresponding prediction model, and the quality score of the image can be obtained through the processes of stereo image feature extraction, model training and the like, so that the objective quality evaluation method becomes a research key point.

The objective stereoscopic Image Quality evaluation (SIQA) method can be classified into three types, i.e., Full Reference (FR), half Reference (RR), and No Reference (No Reference, NR), according to the degree of dependence on a Reference Image in Image Quality evaluation. In practical environments, since a reference image may not be available or is difficult to obtain, NR-SIQA that does not depend on the reference image has a wider application range and is gradually becoming the mainstream research direction.

Early NR-SIQA applied a sophisticated planar image quality evaluation algorithm directly to individual views of a stereoscopic image, and then expressed the quality score of the stereoscopic image using the average of the left and right views. However, these algorithms do not consider binocular vision characteristics, and thus cannot accurately evaluate the quality of a stereoscopic image. With the deep understanding of the mechanism of human brain vision, some methods based on parallax response and binocular vision characteristics are proposed. And partial methods are combined with the visual saliency model to further simulate a human visual information processing mechanism. However, due to the hierarchical structure of the Human Visual System (HVS) and its complexity, the performance of the current SIQA method based on manual feature extraction is not ideal.

With the rise of deep learning, in recent years, attempts have been made to solve the problem of image quality evaluation by using deep learning. Unlike manual feature extraction, deep learning methods typically use a Convolutional Neural Network (CNN) model to automatically extract features. Due to a large number of parameters and self-learning capability in the network, the SIQA method based on the CNN obtains accurate evaluation performance.

Disclosure of Invention

The invention provides a non-reference stereo image quality evaluation algorithm based on a multidimensional attention mechanism, which can better simulate binocular competition and the visual attention mechanism of a human visual system, and the technical scheme is as follows:

a non-reference stereo image quality evaluation method based on a multi-dimensional attention mechanism is characterized by comprising the following steps:

firstly, preprocessing an original stereo image for training, converting the original stereo image into a gray image and dividing the gray image into non-overlapping small image blocks, giving a real quality fraction of the image to each image block, and randomly selecting a plurality of image blocks as the input of a network model;

secondly, training a convolutional neural network based on multidimensional attention, wherein the method comprises the following steps:

(1) extracting primary features from left and right views by using a group of CCP modules for convolution and pooling operation, and processing the left and right views to obtain a primary feature map;

(2) sending the primary feature maps of the left view and the right view into a view fusion sub-network, and calculating a fusion feature map: the view fusion sub-network comprises a multidimensional attention module, and the module consists of a channel attention module and a space attention module; in the channel attention module, an input primary feature map passes through two same branches, channel dimensionality reduction is carried out in each branch, then global average pooling is carried out, the weight of each channel is obtained through a full-connection layer and a Sigmoid activation function, and the feature map of each channel is weighted to obtain a feature map weighted by channel attention; in a space attention module, performing dimension transformation on the feature maps weighted by the attention of two parallel channels, performing matrix multiplication operation, obtaining the weight of each view combining channel and space attention through a Softmax activation function, and weighting the primary feature maps of the left view and the right view by using the weight to obtain a fusion feature map;

(3) inputting the fused feature map into a multi-scale feature enhancement sub-network based on multi-dimensional attention, and predicting the quality scores of the images: extracting feature maps of the fusion feature map subjected to dimension transformation on three different scales by three groups of CCP modules which are operated in rolling and pooling on the basis of the multi-dimensional attention multi-scale feature enhancement sub-network, wherein the feature map on the minimum scale is called as an original deep fusion feature map; inputting the feature map on each scale into a multi-dimensional attention module, and reducing dimensions through a channel to obtain three dimension-reduced feature maps; the two feature maps after dimensionality reduction are subjected to channel weighting through a channel attention module, then subjected to dimensionality transformation, matrix multiplication operation and Softmax activation function processing to obtain a multidimensional attention weight, and the weight is used for weighting a third feature map in the three feature maps after dimensionality reduction to obtain a feature map based on a multidimensional attention mechanism; fusing the feature maps obtained on three scales based on the multi-dimensional attention mechanism by an up-sampling method, and performing feature extraction on the fused feature maps by using three CCP modules to obtain a deep multi-scale feature enhancement feature map based on the multi-dimensional attention; adding the deep multi-scale feature enhancement feature map based on multi-dimensional attention and the original deep fusion feature map to obtain an enhanced feature map, and sending the enhanced feature map into a full-connection layer to predict the image quality score;

(4) calculating a loss function of the network, and performing iterative training: after the predicted image quality fraction is obtained, calculating a loss function of the network, wherein the loss function adopts Root Mean Square Error (RMSE) and adds L2 regularization for preventing an overfitting phenomenon, so that the difference between the image quality fraction predicted by the network and the real image quality fraction is measured, and through multiple iterations in the training process, the network parameters are continuously updated to minimize the loss function, so that the image quality fraction predicted by the network is closer to the real fraction, and a trained network model is obtained.

And thirdly, evaluating the image quality by using the trained network.

Wherein the CCP module comprises two 3 x 3 convolutional layers and one pooling layer.

Further, in the step (2) of the second step, channel dimensionality reduction is performed in each branch through a convolution kernel with the size of 1 × 1, then global average pooling is performed, the weight of each channel is obtained through a full connection layer and a Sigmoid activation function, and a feature map of each channel is weighted by using Scale operation to obtain a feature map weighted by channel attention.

Further, in the step (3) of the second step, the feature map on each scale is input into a multidimensional attention module, and the channel dimensionality reduction is performed through three parallel 1 × 1 convolution operations to obtain three dimensionality-reduced feature maps.

Further, the method in the third step is as follows: and preprocessing the stereo image to be evaluated, inputting the preprocessed stereo image into a network, and averaging the quality scores of the image blocks output by the network to obtain the quality score of the whole image.

The technical scheme provided by the invention has the beneficial effects that: the invention fully utilizes HVS, calculates the weight of the left view and the right view through a multidimensional attention mechanism, and is used for weighting the left view and the right view to obtain a fused view, thereby simulating the binocular fusion and binocular competition mechanism of the HVS. By performing multi-scale feature extraction on the fusion view and performing feature enhancement on different scale features by using multi-dimensional attention so as to distribute weights to different scale information, the visual attention mechanism of the HVS can be simulated. The characteristics ensure that the method can be used in technical practice, such as in the transmission performance evaluation of new media such as 3D televisions, 3D movies and the like, the algorithm evaluation result has high consistency with the subjective evaluation result of human eyes, and has important value.

Drawings

FIG. 1 Algorithm Overall Block diagram

FIG. 2 Multi-dimensional attention Module Block diagram

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

The embodiment of the invention provides a non-reference stereo image quality evaluation algorithm based on a multi-dimensional attention mechanism, and the invention is further explained by combining the attached drawings. The invention is realized by the following steps:

first, the original stereo image used for training is preprocessed.

The original stereo image for training is converted into a grayscale image, the left view and the right view are divided into 220 32 × 32 image blocks, each image block is given a real quality fraction of the image, and 30% of the image blocks, namely 66 image blocks, are selected as the input of the network. The random selection mode can reduce the time complexity of image preprocessing and improve the generalization capability of the network.

In a second step, a multidimensional attention-based convolutional neural network is trained, the network comprising a multidimensional attention-based view fusion sub-network and a multi-scale feature enhancement sub-network. The multi-dimensional attention-based view fusion sub-network comprises a multi-dimensional attention module used for calculating the weight of the left view and the right view and weighting the left view and the right view to obtain a fusion view. The fusion characteristic diagram is subjected to multi-scale and attention weighting to obtain an enhanced characteristic diagram through a multi-scale characteristic enhancement sub-network based on multi-dimensional attention, and finally the quality score of the image is obtained by adopting a full connection layer.

(1) Primary features are extracted for left and right views using a set of CCP modules that perform convolution and pooling operations.

The left and right views are initially processed using the CCP module to obtain a 16 × 16 × 32 primary feature map. Wherein the CCP module includes two 3 x 3 convolutional layers and one pooling layer.

(2) And sending the primary feature maps of the left view and the right view into a view fusion sub-network, and calculating a fusion feature map.

A network structure of a multi-dimensional attention-based view converged sub-network is shown in fig. 1. The sub-network is composed of two parts, namely a channel attention module and a space attention module. In the channel attention module, an input primary feature map passes through two same branches, channel dimensionality reduction of the feature map is carried out in each branch by using a convolution kernel with the size of 1 × 1, a 1 × 1 × 16 feature map is obtained by using global average pooling, the weight of each channel is obtained through two full-connection layers and a Sigmoid activation function, and the feature map of each channel is weighted by Scale operation to obtain a 16 × 16 × 16 feature map with weighted channel attention. In the spatial attention module, dimension transformation is carried out on the parallel two-channel attention weighted feature map, matrix multiplication operation is carried out, and weights of the left view, the right view, the combined channel and the spatial attention are obtained through a Softmax activation function. And performing weighted fusion on the primary characteristic graphs of the left view and the right view and the weight, thereby simulating the binocular fusion and binocular competition mechanism of the HVS. The calculation formula is as follows:

where FM_C、FM_LAnd FM_RRespectively showing a fused feature map, a left view primary feature map and a right view primary feature map, W_LAnd W_RRespectively left and right view weights calculated based on a multidimensional attention mechanism,

and

respectively representing matrix addition and matrix multiplication.

(3) And inputting the fused feature map into a multi-scale feature enhancement sub-network based on multi-dimensional attention, and predicting the quality score of the image.

The network structure of the multi-scale feature enhancement subnetwork based on multi-dimensional attention is shown in fig. 1. Extracting feature maps of a 16 × 16 × 32 fusion feature map with changed dimensions on three scales by the sub-network through three groups of convolution and pooling operations (CCP), wherein the feature maps are respectively 8 × 8 × 64, 4 × 4 × 128 and 2 × 2 × 256; the 2 × 2 × 256 feature map is referred to as an original deep feature map. The feature map at each scale is passed through a multi-dimensional attention module, the structure of which is shown in FIG. 2.

In the module, the input feature map is subjected to channel dimensionality reduction through three parallel 1 × 1 convolution operations to obtain three dimensionality-reduced feature maps. And weighting the channels of the two feature maps through a channel attention module, carrying out dimension transformation, executing matrix multiplication operation and a Softmax activation function to obtain multidimensional attention weight, weighting the third feature map in the three dimension-reduced feature maps by using the weight, and carrying out 1 x 1 convolution operation to obtain the feature map based on the multidimensional attention mechanism.

8 × 8 × 64, 4 × 4 × 128 and 2 × 2 × 256 feature maps based on a multi-dimensional attention mechanism are obtained on three scales, and the three feature maps are fused by an up-sampling method to obtain a 16 × 16 × 32 feature map after attention weighting. The operation can assign corresponding weights to feature maps of different scales, so as to simulate the attention degree of HVS to objects of different sizes in the image. And performing deep feature extraction on the feature map subjected to attention weighting by using three CCP modules, adding the deep feature map and the original deep feature map to obtain an enhanced 2 x 256 feature map, and obtaining a predicted value of the image quality score through a full connection layer.

(4) A loss function of the network is calculated.

During the network training process, the loss function of the network model adopts Root Mean Square Error (RMSE) and adds L2 regularization for preventing the over-fitting phenomenon, and the loss function is calculated as follows:

where N denotes the number of image blocks, q_iTrue Differential Mean Opinion Score (DMOS), q representing an image_iRepresenting the predicted values of the network model, the second part of the formula is the L2 regularization term, α represents the regularization coefficient, and ω is the weight vector in the network training. The calculated loss function can reflect the difference between the prediction result and the true value, and the network parameters are continuously updated to minimize the loss function through multiple iterations in the training process, so that the image quality score predicted by the network is closer to the true score, namely the network performance is better.

And thirdly, evaluating the image quality by using the trained network.

(1) And preprocessing the stereo image to be evaluated.

The stereo image to be evaluated is converted into a gray image, the left view and the right view are divided into 220 32 × 32 image blocks, and 30% of the image blocks, namely 66 image blocks, are selected as the input of the network.

(2) And calculating the quality score of the stereo image to be evaluated.

The image block quality scores of the same stereo image output by the network are averaged to obtain the quality score of the whole stereo image, and the calculation formula is as follows:

where Q represents the quality score of the entire stereoscopic image. And obtaining the SROCC and the PLCC by using the final prediction result and the real DMOS value of the stereo image so as to evaluate the network performance.

The parameters of the whole network are detailed in table 1.

TABLE 1 network architecture parameters

Example 3

The feasibility of the protocol of example 1 was verified in conjunction with specific experiments, as described in detail below:

this experiment used LIVE 3D, two public 3D image databases of hydroloo IVC 3D to test the performance. Each database contains a number of images with different distortion types. The quality of the image is described by Mean Opinion Scores (MOSs) or Differential Mean Opinion Scores (DMOS), where a larger MOS value indicates a better image quality and a lower DMOS value indicates a better image quality.

In the process of measuring whether the objective evaluation algorithm has accuracy, monotonicity and consistency, the following two common indexes are generally adopted, which are respectively: the spearman rank order correlation coefficient SROCC and the pearson linear correlation coefficient PLCC. SROCC describes the monotonicity of an image quality assessment algorithm, and the expression is as follows:

in equation 4, the parameter d_iRepresenting the difference between the objective score of the ith image and its subjective quality score ranking. I then represents the total number of images contained in the database. PLCC is a linear correlation coefficient between objective scores obtained by an algorithm and subjective quality scores of images after nonlinear regression processing, and the calculation formula is as follows:

in the formula 5, q_iAnd S_FiRespectively representing the subjective score and the predicted value, mu, of the ith image_qAnd

respectively, represent the mean of the two. Both correlation coefficients have values in the range of-1 to 1, with larger values indicating better network performance.

The consistency of the score of the objective quality evaluation algorithm and the subjective score DMOS in the database is measured by adopting the spearman grade order correlation coefficient and the pearson linear correlation coefficient. The higher the correlation between the subjective score and the objective score, the better the performance of the algorithm.

In order to verify the performance of the invention, 7 mainstream non-reference stereo image quality evaluation algorithms are selected as comparison on a LIVE 3D database. These algorithms include 4 traditional evaluation algorithms (3D-AdaBoost, BVCDP, BSFML, and SA) and 3 CNN-based evaluation algorithms (DCNN, RM-CNN, and VSM-CNN). On the Waterloo IVC 3D database, since evaluation algorithms based on deep learning are rarely tested, we chose 4 traditional evaluation algorithms as comparisons, including SINQ, DBN, BSIQE and BVCDP. The results are shown in tables 2 and 3.

TABLE 2 LIVE 3D image database-based algorithmic Performance comparison

TABLE 3 comparison of Algorithm Performance based on Waterloo IVC 3D image database

TABLE 4 specific distortion type Performance comparison based on LIVE 3D Phase I image database

TABLE 5 specific distortion type Performance comparison based on LIVE 3D Phase II image database

Tables 4 and 5 show the results of the present invention for specific distortion types in LIVE 3D Phase i and LIVE 3D Phase ii databases. In each column, the best performing results are shown in bold. As can be seen from tables 4 and 5, the present invention outperforms all comparative methods in both SROCC and PLCC on multiple distortion type images, and the SROCC and PLCC mean values on these particular distortion types exceed 0.9. In general, compared with other networks, the method can adapt to various different distortion conditions, and has high consistency with human subjective evaluation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A non-reference stereo image quality evaluation method based on a multi-dimensional attention mechanism is characterized by comprising the following steps:

in the second step, a convolutional neural network based on multidimensional attention is trained. The method comprises the following steps:

And thirdly, evaluating the image quality by using the trained network.

2. The method of claim 1, wherein the CCP module includes two 3 × 3 convolutional layers and one pooling layer.

3. The method for evaluating the quality of the non-reference stereo image according to claim 1, wherein in the step (2) of the second step, channel dimensionality reduction is performed in each branch through a convolution kernel with the size of 1 x 1, then global average pooling is performed, the weight of each channel is obtained through a full connection layer and a Sigmoid activation function, and a Scale operation is used for weighting the feature map of each channel to obtain a channel attention weighted feature map.

4. The method for evaluating the quality of the non-reference stereo image according to claim 1, wherein in the step (3) of the second step, the feature map on each scale is input into a multi-dimensional attention module, and the three feature maps after dimension reduction are obtained by performing channel dimension reduction through three parallel 1 x 1 convolution operations.

5. The method for evaluating the quality of a reference-free stereoscopic image according to claim 1, wherein the third step is a method comprising: and preprocessing the stereo image to be evaluated, inputting the preprocessed stereo image into a network, and averaging the quality scores of the image blocks output by the network to obtain the quality score of the whole image.