CN108391121B

CN108391121B - No-reference stereo image quality evaluation method based on deep neural network

Info

Publication number: CN108391121B
Application number: CN201810375052.5A
Authority: CN
Inventors: 陈志波; 周玮; 李卫平
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2020-10-27
Anticipated expiration: 2038-04-24
Also published as: CN108391121A

Abstract

The invention discloses a no-reference stereo image quality evaluation method based on a deep neural network, which comprises the steps of simultaneously inputting left and right visual distortion image blocks into a double-flow deep neural network structure by considering the interaction principle of fusion between left and right vision and parallax information in a multilayer structure of a human visual system and using a no-reference 3D stereo image quality evaluation algorithm of the deep neural network based on end-to-end double-flow interaction, and combining discriminant feature extraction and regression learning into an end-to-end optimization process, thereby achieving the purpose of effectively predicting the perception quality of distorted stereo images.

Description

No-reference stereo image quality evaluation method based on deep neural network

Technical Field

The invention relates to the technical field of deep learning, in particular to a no-reference stereo image quality evaluation method based on a deep neural network.

Background

With the rapid popularization and development of 3D multimedia technologies including 3D movies, etc., 3D stereoscopic images have gradually entered people's daily lives, and viewing 3D stereoscopic images can create an immersive visual experience that 2D images lack. Meanwhile, due to the additional depth perception and asymmetric distortion existing between the left view image and the right view image, the 3D stereoscopic image quality evaluation is more challenging, namely a more complex binocular vision mechanism needs to be considered when a 3D stereoscopic image quality evaluation algorithm is designed.

Early full-reference 3D stereoscopic image quality evaluation algorithms originated from 2D image quality evaluation methods, such as article [1] (a. benoit, p.le call, p.campisi, and r.cousseau.quality assessment of stereoscopic images, 2008(1):659024,2009), and article [2] (j.you, l.xing, a.perkis, and x.wang.stereoscopic image for stereoscopic images based on 2D image quality measurements and analysis, of International workstation Video Processing and analysis, electronic devices, 2010, respectively, were used to evaluate the final image quality using the overall depth evaluation algorithm, 2010, USA, respectively. However, these depth learning methods for 2D images are not suitable for stereoscopic image quality evaluation because a binocular vision mechanism needs to be considered when evaluating stereoscopic image quality.

Later more sophisticated algorithms proposed the need to consider binocular visual characteristics of the human visual system, such as contrast masking effect [3] (p. gorley and n. hollliman. stereo Image quality metrics and compression. in proc. spie, volume 6803, page 680305,2008) and binocular binding behavior [4] (y. h. lin and j. l. wu. quality assessment of stereo 3D Image compression by binocular integration devices. ieee transformations on Image Processing,23(4): 1547-.

Since the original undistorted stereo image in the actual scene cannot be obtained usually, it is necessary to provide a non-reference 3D stereo image quality evaluation algorithm, which includes quality evaluation algorithms for specific distortion types and in a general sense, and mainly extracts discriminant features from the distorted stereo image manually based on human visual features, natural scene statistics, and the like, and then evaluates the perceptual quality of the stereo image through a Regression learning model such as Support Vector Regression (Support Vector Regression). Among them, the general non-reference stereo Image quality evaluation algorithm is more general as in article [5] (M.J.Chen, L.K.Cormac, and A.C.Bovik.No. reference quality assessment of national stereo images, IEEE Transactions on Image Processing,22(9): 3379-.

For the application of deep learning in quality evaluation, currently, much work is to apply deep learning to 2D image quality evaluation, and these technologies are mainly divided into training based on image blocks and training based on whole images according to a training strategy. The image block-based training method is to divide the original image into smaller image blocks and then to regress and learn the perceptual quality of each input image block through a deep neural network, such as the non-reference convolutional neural network (NR-CNN) [6] (L.Kang, P.Ye, Y.Li, and D.Doermann.Consortional neural network for no-reference image quality assessment. in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1733-. Another training approach based on the whole image is to train to evaluate the quality of the whole image by aggregating the features of the merged image blocks or the prediction scores of each image block.

However, the performance of the above-mentioned article [5] and the no-reference convolutional neural network (NR-CNN) approach [6] is low.

Disclosure of Invention

The invention aims to provide a no-reference stereo image quality evaluation method based on a deep neural network.

The purpose of the invention is realized by the following technical scheme:

a no-reference stereo image quality evaluation method based on a deep neural network comprises the following steps:

dividing left and right view images forming all distorted stereo images into non-overlapping distorted image blocks respectively to obtain a plurality of left and right view distorted image block pairs;

correspondingly inputting the left and right visual distortion image block pairs obtained by division into a pre-constructed double-current input interactive deep neural network, training a network model, and obtaining the one-dimensional perception quality of the corresponding left and right visual distortion image block pairs through regression learning;

when the quality of the distorted stereo image is predicted, the quality of each left-view and right-view distorted image block pair is predicted by using a trained network model, and then the quality of all the left-view and right-view distorted image block pairs in each distorted stereo image is averaged to obtain the quality of each distorted stereo image.

According to the technical scheme provided by the invention, by considering the interaction principle of fusion between left and right views and parallax information in a multilayer structure of a human eye vision system and using a non-reference 3D (three-dimensional) image quality evaluation algorithm of a deep neural network based on end-to-end double-flow interaction, left and right view distorted image blocks are simultaneously input into the double-flow deep neural network structure, and the distinguishing characteristic extraction and regression learning are combined into an end-to-end optimization process, so that the aim of effectively predicting the perception quality of the distorted three-dimensional image is fulfilled.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for evaluating quality of a non-reference stereo image based on a deep neural network according to an embodiment of the present invention;

fig. 2 is a flowchart of an embodiment of a method for evaluating quality of a reference-free stereo image based on a deep neural network according to the present invention;

FIG. 3 is a schematic diagram of interaction of a multi-layer network in a dual-stream input interactive deep neural network according to an embodiment of the present invention;

FIG. 4 is a distorted stereo image and its corresponding fusion image and difference image according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating that the distortion quality of a local image block can be effectively evaluated according to the present invention;

fig. 6 is a process of optimizing the loss function for all distortions on the LIVE database according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a no-reference stereo image quality evaluation method based on a deep neural network, which mainly comprises the following steps as shown in figure 1:

step 1, dividing left and right view images forming all distorted stereo images into non-overlapping distorted image blocks respectively to obtain a plurality of left and right view distorted image block pairs.

In the embodiment of the invention, the non-overlapping distorted image blocks can be divided into the size of 32x32, and the amount of training data can be effectively increased by obtaining a plurality of left-view and right-view distorted image block pairs.

In addition, in the embodiment of the present invention, the quality of the distorted stereoscopic image is also used as the label value of the training sample of all left-right view distorted image block pairs obtained after the distorted stereoscopic image is divided.

And 2, correspondingly inputting the left and right visual distortion image block pairs obtained by division into a pre-constructed double-current input interactive deep neural network, training a network model, and obtaining the one-dimensional perception quality of the corresponding left and right visual distortion image block pairs through regression learning.

Because the visual cortex of human eyes is a layered structure and the interaction of left and right vision exists in a multilayer visual area, the left and right sub-networks in the network frame designed by the method realize the interaction in a plurality of convolution layers, namely, the fusion images and the differential images corresponding to the feature maps are connected, the fusion capability and the parallax information of the left and right vision are respectively represented, and finally, the full connection layers of the sub-networks are also interactively connected, so that the one-dimensional perception quality corresponding to the input distorted image block pair is obtained through regression learning.

And 3, when the quality of the distorted stereo image is predicted, predicting the quality of each left-view and right-view distorted image block pair by using the trained network model, and averaging the quality of all the left-view and right-view distorted image block pairs in each distorted stereo image to obtain the quality of each distorted stereo image.

Because uniform distortion exists, in the embodiment of the present invention, the distorted stereoscopic image to be finally predicted may be calculated by averaging the qualities of all left and right distorted image block pairs in the distorted stereoscopic image.

The skilled person can understand that after the left-view and right-view distorted image blocks corresponding to the 'composition of all distorted stereo images' are used for training the network model, the trained network model can be used for performing quality evaluation on the 'distorted stereo images to be predicted'; the "distorted stereo image to be predicted" may be the aforementioned "all distorted stereo images are composed", or may be other distorted stereo images; meanwhile, the "each left-right view distorted image block" mentioned in step 3 is the left-right view distorted image block obtained by dividing the "distorted stereoscopic image to be predicted" in the same manner as in step 1.

For the sake of understanding, the following description is further made with reference to the accompanying drawings.

Fig. 2 shows a flow chart of the present invention. The interactive deep neural network framework based on double-flow input can perform end-to-end training learning feature expression by respectively inputting left and right distorted stereo image block pairs and regress to obtain a perception quality score.

In an embodiment of the present invention, the dual-stream input interactive deep neural network includes two identical sub-networks, which respectively correspond to input left-view and right-view distorted image blocks, that is, visual information streams representing left and right views, where each sub-network is obtained by adjusting a first convolution layer, a second convolution layer, a third convolution layer (that is, a third convolution layer, a fourth convolution layer and a fifth convolution layer), a third convolution layer, and a first and a second fully connected layers, respectively, by using AlexNet (document krishevsky, i.e., sutsche, and g.e. hinton, "imaging with horizontal constraint network works," in advance in new information processing systems,2012, pp.1097-1105). And the second convolutional layer and the fifth convolutional layer (namely the last convolutional layer in the three convolutional layers) and the second fully-connected layer adopt the fusion interaction of two sub-networks, and finally a quality score is obtained.

During training, the divided left and right distorted stereo images are respectively input to the two sub-networks, the parameters are propagated in the forward direction and then updated in the backward direction, and the mean square difference between the predicted value and the true value is minimized, so that the learning network parameters are continuously updated in an iterative manner.

In the dual-stream input interactive deep neural network, two convolutional layers (the second convolutional layer and the last convolutional layer in the third convolutional layer) of two sub-networks and one fully-connected layer (the second fully-connected layer) correspond to each other to perform the interactive connection of the sub-networks, as shown in fig. 3.

The interaction of the convolution layer is to firstly perform the following operation on the characteristic diagram of the corresponding left-right view network to obtain a fused image S⁺Sum-difference image S^-Re-connecting the fused image S⁺Sum-difference image S^-：

S⁺＝F_l+F_r；

S^-＝F_l-F_r；

Wherein, F_l、F_rThe left and right distortion image blocks are corresponding left and right view characteristic diagrams;

the interaction of the full-connection layer is to connect the last full-connection layer of each sub-network (namely, the second full-connection layer of each sub-network);

correspondingly inputting the left and right visual distortion image block pairs obtained by division into a pre-constructed double-flow input interactive deep neural network for forward propagation training, and finally calculating the Euclidean loss value according to a training sample label as a minimum objective function in the training process:

wherein, f (P)_li,P_ri(ii) a w) represents a left-right view distorted image block pair (P) under the parameter w_li,P_ri) W is a parameter of the double-current input interactive deep neural network, and needs to be updated iteratively, w' is an updated network parameter, F represents a 2 norm, y_iIs a left-right view distorted image block pair (P)_li,P_ri) Corresponding training sample label values;

it is seen from fig. 4 that the fusion and difference images for different distortion types have discriminability, so that the effective quality features can be learned through the dual-stream input interactive deep neural network proposed by the present invention.

Through the double-current input interactive deep neural network learning iteration, a trained network model can be obtained, and due to uniform distortion, the quality of all left-view and right-view distorted image block pairs in each distorted stereo image can be averaged, so that the quality of each distorted stereo image is obtained, and the calculation formula is as follows:

wherein, f (P)_li,P_ri(ii) a w') denotes a left-right view distorted image block pair (P) under the parameter w_li,P_ri) P denotes the number of left and right view distorted image block pairs in each distorted stereoscopic image.

The scheme of the invention can effectively evaluate the distortion quality of the local image block, thereby effectively predicting the quality of the distorted three-dimensional image. In fig. 5, (a), (b), (c), and (d) are JPEG-compressed symmetrically-distorted stereoscopic images, blurred symmetrically-distorted stereoscopic images, fast-fading symmetrically-distorted stereoscopic images, and white-noise symmetrically-distorted stereoscopic images, respectively, and the prediction qualities of the corresponding distorted image blocks are 46.128, 53.495, 66.804, and 74.013, with higher prediction values indicating lower visual perception qualities.

On the other hand, various tests are also performed based on the scheme of the embodiment of the invention.

As shown in fig. 6, for the above-mentioned scheme of the embodiment of the present invention, the process of optimizing the loss function for all distortions on the LIVE database can be seen to be well converged. In fig. 5, curve 1 (thick line) corresponds to LIVE database 1, and curve 2 (thin line) corresponds to LIVE database 2.

The performance comparison results of the above-mentioned solutions of the embodiments of the present invention and the solutions mentioned in the background art on the stereo image LIVE database for all distortions are shown in table 1.

Table 1 comparison of the performance of the present invention with other solutions on the stereo image LIVE database for all distortions

The results of comparing the performance of the related schemes of the present invention and the background art on symmetrically distorted and asymmetrically distorted stereo images are shown in table 2.

Name of algorithm	Symmetrical distortion	Asymmetric distortion
			Article [5]]	0.918	0.834
Non-reference convolutional neural network [6]	0.590	0.633
			The invention	0.979	0.927

TABLE 2 comparison of the Performance of the related schemes of the present invention and the background art on symmetrically distorted and asymmetrically distorted stereo images

The results of the performance comparison between the related schemes of the present invention and the background art on the cross database test are shown in table 3.

TABLE 3 comparison of Performance on Cross database testing of related protocols as mentioned in the present invention and background

The results of comparing the performance of the present invention with that of the prior art solutions for different distortion types are shown in table 4.

Table 4 comparison of the performance of the various schemes of the invention and the background art with respect to different distortion types

According to the comparison result, the performance of the scheme of the embodiment of the invention is far better than that of each scheme mentioned in the background technology.

According to the scheme of the embodiment of the invention, by considering the interaction principle of fusion between left and right views and parallax information in a multilayer structure of a human eye vision system, a non-reference 3D (three-dimensional) image quality evaluation algorithm of a deep neural network based on end-to-end double-flow interaction is used, distorted image blocks of the left and right views are simultaneously input into the double-flow deep neural network structure, and the judgment feature extraction and regression learning are combined into an end-to-end optimization process, so that the purpose of effectively predicting the perception quality of the distorted three-dimensional image is achieved.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A no-reference stereo image quality evaluation method based on a deep neural network is characterized in that the interaction principle of fusion between left and right views and parallax information in a multilayer structure of a human visual system is considered, and the method comprises the following steps:

when the quality of the distorted stereo image is predicted, predicting the one-dimensional perception quality of each left-view and right-view distorted image block pair by using a trained network model, and averaging the one-dimensional perception qualities of all the left-view and right-view distorted image block pairs in each distorted stereo image to obtain the one-dimensional perception quality of each distorted stereo image;

the double-current input interactive deep neural network comprises two identical sub-networks which are used for inputting left and right visual distortion image blocks respectively and correspondingly; each sub-network comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer, a third pooling layer, a first full-connection layer and a second full-connection layer which are arranged in sequence;

in the double-current input interactive deep neural network, the second convolution layer of the two sub-networks, the last convolution layer of the third convolution layer and the second full-connection layer are correspondingly connected with each other in an interactive mode;

S⁺＝F_l+F_r；

S^-＝F_l-F_r；

the interaction of the fully-connected layer is to connect the second fully-connected layers of the sub-networks.

2. The method according to claim 1, wherein the quality of the distorted stereo image is used as the label value of the training sample of all left-right view distorted image block pairs obtained after the distorted stereo image is divided.

3. The method according to claim 1, wherein the deep neural network-based no-reference stereo image quality evaluation method comprises,

wherein, f (P)_li,P_ri(ii) a w) represents a left-right view distorted image block pair (P) under the parameter w_li,P_ri) W is a parameter of the double-current input interactive deep neural network, and needs to be updated iteratively, w' is an updated network parameter, F represents a 2 norm, y_iIs a left-right view distorted image block pair (P)_li,P_ri) The corresponding training sample label value.

4. The method according to claim 1, wherein the quality of all left-right view distorted image block pairs in each distorted stereo image is averaged, and the calculation formula for obtaining the quality of each distorted stereo image is as follows: