CN108389189B

CN108389189B - Three-dimensional image quality evaluation method based on dictionary learning

Info

Publication number: CN108389189B
Application number: CN201810126932.9A
Authority: CN
Inventors: 李素梅; 常永莉; 韩旭; 侯春萍
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-02-08
Filing date: 2018-02-08
Publication date: 2021-05-14
Anticipated expiration: 2038-02-08
Also published as: CN108389189A

Abstract

The invention belongs to the field of image processing, and aims to obtain a stereo image quality evaluation method consistent with human eye subjective feeling.A reference stereo image pair is used for training a scale invariant feature transformation SIFT dictionary and a salient dictionary, wherein the SIFT dictionary is used for carrying out SIFT transformation on the reference stereo image pair, expressing an image by SIFT features, and carrying out dictionary training by using a feature notation method and a Lagrange duality method to obtain an SIFT dictionary; processing the sparse coefficient matrix to obtain the quality Q1 of the corresponding distorted stereo image and the quality Q2 of the corresponding distorted stereo image; and finally, combining the Q1 and the Q2 to obtain a final quality score Q of the distorted stereo image pair. The invention is mainly applied to the image processing occasion.

Description

Three-dimensional image quality evaluation method based on dictionary learning

Technical Field

The invention belongs to the field of image processing, and relates to research of a stereo image quality evaluation method, and application of some characteristics and sparse codes in a human visual system and a sparse dictionary in stereo image objective evaluation. In particular to a stereo image quality evaluation method based on dictionary learning.

Background

Stereoscopic images/videos can bring the visual experience of the viewers as being personally on the scene, and therefore, the generation, processing, transmission, display and quality evaluation of the stereoscopic images/videos have become a hot research problem of the stereoscopic imaging technology. However, certain noise is inevitably generated in each link of the stereoscopic imaging technology, so that a viewer generates a visual discomfort phenomenon. It is important to design a systematic and effective evaluation method for stereo image/video quality to illustrate the correctness of each link of stereo imaging technology. The visual discomfort phenomenon generated by the stereoscopic image itself is mainly studied herein. The quality evaluation of the stereo image comprises a subjective evaluation method and an objective evaluation method. Subjective evaluation is closer to the real feeling of human, but is time-consuming and labor-consuming. The objective evaluation method is not influenced by subjective factors of people, and has the advantages of good stability, easy operability and the like.

At present, scholars at home and abroad have already proposed a plurality of algorithms for objectively evaluating the quality of a stereo image. Document [1] Cardoso et proposes a disparity weighting method for evaluating the quality of a stereoscopic image, which treats an absolute difference map of left and right views as a disparity map and its pixel values as weight adjustment factors for a reference and distorted stereoscopic image pair. Document [2] Khan proposes a statistical model based on natural stereo image pairs. It is described how a statistical model leads to an estimated quality of a stereo image, while studying the statistical correlation between luminance and disparity sub-bands that directly describe the structural content. Due to the limitation of the stereo matching algorithm, the quality of the stereo image is evaluated directly by adopting parallax information, and the evaluation accuracy is low. In this article only the disparity map of the reference stereo image pair is considered, avoiding the inaccuracy of evaluating the disparity map of the distorted stereo image pair.

Documents [3-5] show that the perceptive information of neural coding is obtained by a small fraction of activated neurons, and the idea suggests sparse coding of images. Document [6] proposes an objective evaluation method for stereo image quality based on sparse representation. The three-dimensional images of different frequency bands are trained respectively to obtain corresponding dictionaries, sparse representation is carried out on the test images by using the obtained dictionaries, and finally, sparse feature similarity is fused to obtain the quality scores of the three-dimensional images. Document [7] proposes a full-reference stereo image quality evaluation method. And (3) simulating the human eye characteristics to learn a multi-scale dictionary, and calculating the sparse feature similarity and the global brightness similarity by taking the sparse energy and the sparse complexity as the basis of binocular combination to obtain a quality score. Document [8] proposes a full-reference stereo image quality evaluation method based on sparseness. And finally, integrating the quality scores of the distorted three-dimensional images according to the structural and non-structural characteristics by using the sparse coefficient matrixes obtained by corresponding dictionary pairs. Both of the above two methods based on dictionaries use only the left viewpoint of a reference stereo image pair as data for training a dictionary, the stereo image is a fusion of left and right views into one image in the human brain, and if only the left viewpoint image is used, right viewpoint image information that a part of left viewpoint images do not have may be lost, so the trained dictionary may not contain complete stereo information. Document [9] studies on the human visual system show that humans tend to pay attention to certain regions when viewing images, which are called salient regions, and this characteristic is called visual saliency. Document [10] shows that the human eye tends to pay attention to the central region of the image when viewing the image. At present, no document combines visual saliency features with a sparse dictionary to evaluate the quality of a stereo image. Therefore, the absolute difference map and some salient features are used to extract the saliency map of the stereoscopic image in combination with visual saliency and taking into account the depth information (including both left viewpoint information and right viewpoint information) of the stereoscopic image. In order to be more accordant with the visual characteristics of human eyes, the model optimizes the central deviation and the fovea characteristics to obtain a final remarkable stereo image as input data of a training dictionary.

Disclosure of Invention

In order to overcome the defects of the prior art and obtain a stereo image quality evaluation method consistent with human eye subjective feeling, the invention adopts the technical scheme that two dictionaries are trained by utilizing a reference stereo image pair based on a stereo image quality evaluation method of dictionary learning, a Scale-invariant Feature transform (SIFT) dictionary and a significant dictionary are transformed by utilizing the reference stereo image pair, the SIFT dictionary specifically carries out SIFT transformation on the reference stereo image pair, images are represented by SIFT features, and dictionary training is carried out by utilizing a Feature-sign method (Feature-sign) and a Lagrange duality method to obtain an SIFT dictionary; the salient dictionary is specifically that an initial stereoscopic vision salient image is obtained by combining an absolute difference image, the initial stereoscopic vision salient image is optimized by adopting the central offset and fovea characteristics conforming to a human visual mechanism to obtain a final salient image, the salient image of a reference stereoscopic image pair is extracted, and then the front m salient blocks of each salient image are selected as source data for training the salient dictionary to obtain the salient dictionary by using n multiplied by n overlapped blocks according to the size value of the variance; in the testing stage, after SIFT transformation is carried out on the reference and distorted stereo image pairs on one hand, sparse coding is carried out by combining a trained SIFT dictionary to obtain sparse coefficient matrixes of the reference and distorted stereo image pairs, and the sparse coefficient matrixes are processed to obtain the quality Q1 of the corresponding distorted stereo image; on the other hand, saliency processing is carried out on the reference and distorted stereo image pair to obtain a reference and distorted stereo salient image, all salient blocks which are not overlapped by n x n are used as input data to be combined with a salient dictionary to obtain a sparse coefficient matrix of the corresponding reference and distorted stereo image pair, and the sparse coefficient matrix is processed to obtain the quality Q2 of the corresponding distorted stereo image; and finally, combining the Q1 and the Q2 to obtain a final quality score Q of the distorted stereo image pair.

The quality of the stereo salient image is adopted to reflect the quality of the stereo image, the absolute difference image of the left view and the right view is adopted to represent depth information, firstly, the brightness, the chroma and the texture contrast characteristics of the image are extracted to represent the salient information, the absolute difference image is combined to obtain an initial characteristic salient image, and then, the initial characteristic salient image is optimized by utilizing the characteristics of central deviation and central concavity to obtain a final stereo salient image.

Further:

(1) center offset

An anisotropic gaussian kernel is used to model the factor cb (center bias) of central shift of attention from center to periphery:

CB (x, y) represents pixel point (x, y) to center point (x)₀,y₀) Offset information of (a), (b), (c), and (d)₀,y₀) Coordinates of the center point representing the distorted right viewpoint, (x, y) are coordinates of pixel points, sigma_hAnd σ_vRespectively representing the standard deviation of the image in the horizontal direction and the vertical direction, taking sigma_h＝1/3W,σ_v1/3H, where W and H represent the number of horizontal and vertical pixels of the image;

(2) central concave

The characteristic simulation is carried out by adopting a function shown in formula (2):

wherein e (x, y) represents the retinal departure of the pixel point (x, y)Heart degree in degrees; f is spatial frequency, which is in cycles/degree; c₀Is a contrast threshold; delta is a spatial frequency attenuation parameter; e.g. of the type₁Is a half resolution centrifugation constant;

the method comprises the following steps of searching a gray value which enables the difference between the foreground and the background of the three-dimensional saliency map to be maximum by using a maximum inter-class variance method, wherein the gray value is an optimal threshold value, dividing the three-dimensional saliency map into a saliency region and a non-saliency region by using the threshold value, further calculating the retina centrifugation degree e (x, y) by using the spatial distance relation between a pixel and a saliency pixel closest to the pixel, and if any pixel coordinate is (x, y) and a saliency pixel coordinate closest to the pixel coordinate is (x1, y1), calculating the centrifugation degree e (x, y) by using a formula (3):

wherein, W is the number of horizontal pixels of the image, v is the viewing distance, and the euclidean distance between the pixel point (x, y) and the pixel point (x1, y 1):

the dictionary training comprises the following specific steps: fixing one of the dictionary and the sparse coefficient matrix to solve the other, and then carrying out iteration to obtain a proper dictionary and sparse coefficient matrix:

in equations (4) and (5), X is the input signal, B is the complete dictionary, and S is the sparse coefficient matrix. I | · | purple wind_FIs the F-norm, λ is the regularization parameter, | · | | | luminance₁Is a₁Norm, B_iRepresenting the ith column of atoms in the dictionary, formula (4) is an L1 regularized optimization problem, and typical optimization methods comprise a least angle regression method (LARS) and a characteristic symbolThe Feature-sign method changes the undifferentiable concave problem into an unconstrained quadratic optimization problem qp (quadratic optimization) by guessing the sign of a sparse coefficient in each iteration step; formula (5) is a typical quadratic constraint least square optimization problem, and the optimization methods include a QCQP convex optimization method, an iterative gradient descent method and a Lagrange multiplier method.

In one example, the significant dictionary training step specifically includes processing an image by using 8 × 8 overlapped windows to obtain a plurality of 8 × 8 small image blocks, extracting the first 3000 8 × 8 significant blocks of each significant image by using variance information, vectorizing all the significant blocks and forming a matrix as input of dictionary training, changing each significant block into a column vector, performing iterative training according to feature-sign algorithm and lagrangian dual method to obtain a proper sparse significant dictionary, and selecting the size of the significant dictionary to be 64 × 128 because the size of the small image block is 8 × 8.

Specifically, in the training stage of the SIFT dictionary, a 16 × 16 neighborhood is taken as a sampling window by taking a feature point as a center, the relative directions of the sampling point and the feature point are classified into a direction histogram containing 8 bins after gaussian weighting, gradient histograms in 8 directions are calculated on 4 × 4 small blocks to form a seed point, 16 seed points of 4 × 4 are used for describing each key point, so that a 128-dimensional SIFT feature vector of 4 × 4 × 8 is obtained for one key point, and 128 × 1024 is selected according to the size of the SIFT dictionary in the feature dimension text.

After a significant dictionary and a Scale Invariant Feature Transform (SIFT) dictionary are obtained in a dictionary training stage, testing a distorted image by using the obtained dictionary, preprocessing a reference image and a test image, then directly performing sparse coding by using the corresponding dictionary obtained in the training stage, obtaining sparse coefficient matrixes of the reference stereo image pair and the distorted stereo image pair by using an Orthogonal Matching Pursuit (OMP) algorithm, selecting sparse changes of the images as the basis of stereo image quality evaluation, and respectively combining the sparse matrixes of the reference stereo image pair and the distorted stereo image pair of the two channels to obtain stereo image quality scores Q of the two channels₁And Q₂(ii) a Most preferablyAnd combining the quality scores of the two channels to obtain the quality score Q of the final distorted stereo image.

Specifically acquiring a quality evaluation Q value:

the quality of the distorted image is evaluated from a sparse matrix of the distorted image,

respectively representing a reference and a distortion stereo pair, and combining the reference and the distortion stereo pair with an SIFT dictionary for sparse coding after SIFT preprocessing;

respectively representing sparse coefficient matrixes for performing sparse coding on the reference stereo pair and the distortion stereo pair which are combined with a significance dictionary after significance preprocessing; o represents a reference image, d represents a distorted image, l represents a left image, r represents a right image, 1 represents a SIFT dictionary test channel, and 2 represents a salient dictionary test channel; the left view quality for the test of the SIFT dictionary channel is obtained from equations (6) (7) (8) (9):

s (i, j) represents a sparse coefficient similarity index, and the closer the value of S (i, j) is to 1, the smaller the distortion degree of the distorted image is,

and

respectively representing values at the reference left image sparse matrix and the distorted left image sparse matrix (i, j), t, c being constants, M representing the number of rows of the sparse matrix, N representing the number of columns of sparse coefficients, where α + β is 1; similarly, the right view quality of the SIFT dictionary test channel is obtained

For the left and right test stereo image quality scores of the salient dictionary test channel, the same formulas (6), (7), (8) and (9) are used to respectively obtain

And

combining the quality scores of the left view and the right view by using the weight geometry to obtain the quality score Q of the stereo image of the SIFT dictionary testing channel₁And stereo image quality score Q of salient dictionary test channel₂，

The mean square value of the distorted stereo image pair representing the SIFT dictionary test channel and the significant dictionary test channel respectively to the sparse matrix of the left and right distorted stereo images is obtained by the following formulas (10) and (11) for solving the weight:

according to the obtained weight and the formula (12), the mass fraction Q of the two channels is obtained₁And Q₂；

Finally, the quality fraction Q for the distorted stereoscopic image is Q₁And Q₂In combination, as in formula (13)

The invention has the characteristics and beneficial effects that:

the method is very effective for most of evaluation effects of different distortion types in two public LIVE libraries, the correlation between the evaluation result and the subjective evaluation result is strong, the goodness of fit between the obtained data and the subjective data is good, the quality of the stereo image can be well reflected, and the subjective feeling of human eyes is met.

Description of the drawings:

FIG. 1 is a flow chart of the algorithm herein. In the figure, (a) Dictionary Learning Stage; (b) and a Test Stage.

Fig. 2 is a process for acquiring a stereoscopic saliency map.

Detailed Description

In order to obtain a stereo image quality evaluation method consistent with human eye subjective feeling. The invention provides a stereo image quality evaluation method based on SIFT feature and saliency sparse dictionary learning. Two dictionaries, the SIFT dictionary and the salient dictionary, are trained using a reference stereo pair of images. The SIFT dictionary is used for carrying out SIFT transformation on the reference stereo image pair, representing the image by using SIFT characteristics and carrying out dictionary training by using Feature-sign and Lagrangian dual method to obtain the SIFT dictionary; and the saliency dictionary is used for obtaining an initial stereoscopic saliency map by combining the absolute difference map, and optimizing the initial stereoscopic saliency map by adopting the central offset and fovea characteristics conforming to the human visual mechanism to obtain a final saliency map. And extracting the saliency maps of the reference stereo image pair, and then selecting the first 3000 saliency blocks of each saliency map as source data for training the saliency dictionary to obtain the saliency dictionary by using 8 multiplied by 8 overlapped blocks according to the magnitude value of the variance. The algorithm is tested on two public LIVE libraries, and the PLCC value of the LIVEI reaches 0.9429, and the SROCC value reaches 0.9383; PLCC values on LIVE II reached 0.9116 and SROCC values reached 0.9036. The experimental result shows that the evaluation result of the algorithm has better correlation with the subjective evaluation result and is more in line with the perception of the human visual system.

The invention provides a three-dimensional image quality evaluation method based on dictionary learning, which comprises two stages: a dictionary training phase and a testing phase. In the dictionary training stage, two dictionaries are trained, one is a Scale-invariant feature transform (SIFT) dictionary, and the other is a salient dictionary. And the SIFT dictionary is obtained by carrying out SIFT on the reference stereo image pair, representing the image by using SIFT descriptor characteristics as input data of dictionary training, and carrying out dictionary training by using Feature-sign and Lagrange dual method. The salient dictionary is used for firstly extracting salient images of the reference stereo image pair and optimizing by utilizing the fovea centralis and the central deviation to obtain a final salient image; then, the first 3000 significant blocks of each significant image are selected as source data for training a significant dictionary by using 8 multiplied by 8 overlapped blocks according to the magnitude value of the variance, and a characteristic-sign method (Feature-sign) and a Lagrange dual method are used^[11]And performing dictionary training to obtain the significant dictionary. In the testing stage, after SIFT transformation is carried out on the reference and distorted stereo image pairs on one hand, sparse coding is carried out by combining a trained SIFT dictionary to obtain sparse coefficient matrixes of the reference and distorted stereo image pairs, and the sparse coefficient matrixes are processed to obtain the quality Q1 of the corresponding distorted stereo image; and on the other hand, the saliency processing is carried out on the reference and distorted stereo image pair to obtain a reference and distorted stereo salient image, all salient blocks which are not overlapped by 8 multiplied by 8 are used as input data and combined with the salient dictionary to obtain a sparse coefficient matrix of the corresponding reference and distorted stereo image pair, and the sparse coefficient matrix is processed to obtain the quality Q2 of the corresponding distorted stereo image. And finally, combining the quality scores obtained by the two channels to obtain the final quality score Q of the distorted stereo image pair.

As shown in fig. 1, the flow chart of the sparse representation stereo image quality evaluation algorithm is divided into two stages: a dictionary learning phase and a testing phase.

The individual steps will be analyzed in detail below:

1. three-dimensional significance detection model

Stereoscopic images have a larger amount of information than single-viewpoint images, it is impossible for human eyes to match all feature edges in a short time, most people pay attention to only those "important areas", then extract the boundaries of objects in these areas, and finally match these boundaries to form stereoscopic vision. According to the stereoscopic vision attention characteristic of human eyes, the observer can pay more attention to the content of the salient region in the image^[12]Thus, the quality of the stereoscopic saliency map is employed herein to reflect the quality of the stereoscopic images. The depth information needs to be considered for stereo images compared to planar images, so the algorithm uses an absolute difference map of left and right views to represent the depth information. Firstly, the brightness, the chroma and the texture contrast characteristics of the image are extracted to represent the salient information, and an initial characteristic salient map is obtained by combining an absolute difference map. And finally, optimizing the three-dimensional saliency map by using characteristics of central deviation, central concavity and the like to obtain a final three-dimensional saliency map.

(1) Center offset

By Center Bias (CB) characteristic, the human eye always tends to look for the visual fixation point from the Center of the image when viewing the image, and then the attention of the human eye decreases from the Center to the periphery^[10]. Using anisotropic Gaussian kernel functions in this context^[13]To simulate the central shift (CB) factor of the spread of attention from the center to the periphery:

CB (x, y) represents pixel point (x, y) to center point (x)₀,y₀) The offset information of (1). (x)₀,y₀) Coordinates of the center point representing the distorted right viewpoint, (x, y) are coordinates of pixel points, sigma_hAnd σ_vRespectively representing the standard deviation of the horizontal and vertical directions of the image, taken here

σ_h＝1/3W,σ_v1/3H, where W and H represent the number of horizontal and vertical pixels of the image.

(2) Central concave

It is well known that the density of retinal photoreceptors decreases rapidly from the fovea to the periphery^[10][14]. Therefore, when the image is mapped on the retina, the spatial frequency resolution of the human eye to the foveal region is higher, and the partial region is the region which is focused by the human eye and can better distinguish details, namely the salient region; as the eccentricity e increases, the spatial resolution of the human eye decreases. Studies have shown that the Contrast Sensitivity Function (CSF) can be expressed as a Function of the eccentricity e^[14]. This property is modeled herein by a function shown in equation (2):

wherein e (x, y) represents the retinal centrifugation degree of the pixel point (x, y) and the unit is degree (degree); f is the spatial frequency, which is in cycles/degree; c₀Is a contrast threshold; delta is a spatial frequency attenuation parameter; e.g. of the type₁Is the half resolution centrifugation constant. According to the fitting result of the experiment^[15]Taking δ to be 0.106, e₁＝2.3,C₀＝1/64。

And searching the gray value which enables the difference between the foreground and the background of the three-dimensional saliency map to be maximum by using a maximum inter-class variance method, wherein the gray value is the optimal threshold value. The threshold is used to divide the stereoscopic saliency map into salient regions and non-salient regions. Further, the retinal decentration e (x, y) is calculated using the spatial distance relationship between the pixel and the salient pixel closest thereto. Let any pixel coordinate be (x, y), and the nearest significant pixel coordinate be (x)₁,y₁). The degree of centrifugation e (x, y) is then determined by equation (3):

wherein, W is the horizontal pixel number of the image, v is the viewing distance, the text value is 5 (the unit is the image width), d is

Pixel point (x, y) and pixel point (x)₁,y₁) Euclidean distance of (c):

2 dictionary learning

The purpose of sparse representation is to approximate an input signal vector with a weighted linear combination of a small number of "base atoms". For dictionary training, if the dictionary B and the sparse coefficient S are solved simultaneously, the optimization problem is non-convex, but if one of the dictionary B and the sparse coefficient S is solved, the convex optimization problem is solved, so that one of the dictionary B and the sparse coefficient S is usually fixed to solve the other one during dictionary training, and then iteration is carried out to obtain a proper dictionary and sparse coefficient matrix.

In equations (4) and (5), X is the input signal, B is the complete dictionary, and S is the sparse coefficient matrix. I | · | purple wind_FIs the F-norm, λ is the regularization parameter, | · | | | luminance₁Is a₁Norm, B_iRepresenting the ith column of atoms in the dictionary. Equation (4) is a L1 regularized optimization problem, and a typical optimization method is a least angle regression method (LARS)^[16]Feature-symbol search method (Feature-sign search Algorithm)^[11]And the like. In each iteration step, the Feature-sign method changes the undifferentiable concave problem into an unconstrained Quadratic Optimization (QP) problem by guessing the sign of the sparse coefficient, so that the operation speed is improved, and the accuracy of solving the sparse coefficient in a redundant dictionary is also improved. Formula (5) is a typical quadratic constraint least square optimization problem, and typical optimization methods include a QCQP convex optimization method and an iterative gradient descent method^[17]Lagrange multiplier method (laggrangedual)^[11]. The iterative gradient descent method has the disadvantages of slow convergence speed and large time consumption, and the Lagrange dual method is an effective method based on the gradient descent method.

Sparse coding is carried out according to the idea of feature-sign algorithm, and dictionary learning is carried out by adopting a Lagrange dual method. The training samples are diverse and universal, and therefore all reference stereo pairs in the LIVE library are used as training samples herein. Therefore, the dictionary can automatically learn the depth information of the stereo image, so that the result of the algorithm is more suitable for subjective feeling of human eyes, and the trained dictionary is more accurate.

2.1 salient dictionary

As shown in figure 1(a) during the learning phase of the dictionary.

After processing the stereo images to obtain stereo saliency maps according to the flowchart of fig. 2, a saliency dictionary needs to be obtained by training the saliency maps. The significance map obtained by the method shows that some regions which are not significant have small influence on the stereo image quality, so that part of training data can be ignored during dictionary training, the speed of dictionary training is improved, and the quality of the significance dictionary obtained by training can be guaranteed. Therefore, consideration needs to be given to how to exclude some insignificant areas. Document [6] indicates that the larger the variance of an image, the more abundant the structural information it has. Processing the images by using 8 × 8 overlapped windows to obtain a plurality of 8 × 8 small image blocks, extracting the first 3000 8 × 8 significant blocks of each significant image by using variance information, vectorizing all the significant blocks and forming a matrix as input of dictionary training, changing each significant block into a column vector, and performing iterative training according to feature-sign algorithm and Lagrange dual method to obtain a proper sparse significant dictionary. Since the size of the small image blocks is 8 × 8, the size of the salient dictionary herein is chosen to be 64 × 128.

2.2 SIFT dictionary

As shown in figure 1(a) during the learning phase of the dictionary.

SIFT, which is a description used in the field of image processing. The description has scale invariance, can detect key points in the image and is a local feature descriptor. Taking a neighborhood of 16 multiplied by 16 as a sampling window by taking the feature point as the center, and classifying the relative directions of the sampling point and the feature point into a direction histogram containing 8 bins after Gaussian weighting. Gradient histograms of 8 directions are calculated on every 4 × 4 small block to form a seed point, and 16 seed points of 4 × 4 are used for describing each key point, so that a 128-dimensional SIFT feature vector of 4 × 4 × 8 is obtained for one key point. The size of the text SIFT dictionary is selected to be 128 × 1024 according to the feature dimension.

2.3. Coefficient of sparseness

Fig. 1(a) after a significant dictionary and a SIFT dictionary are obtained in a dictionary training stage, a distorted image is tested using the obtained dictionaries. As shown in fig. 1(b), after preprocessing the reference image and the test image, the corresponding dictionary obtained in the training stage is directly utilized for sparse coding, and OMP is used^[18]The algorithm obtains a sparse coefficient matrix for the reference stereo image pair and the distorted stereo image pair. For a given image, the sparse coding of neurons is different for images of different distortion types. Therefore, the sparsity variation of the image is selected as a basis for the evaluation of the stereoscopic image quality. Respectively combining the sparse matrixes of the reference stereo image pair and the distorted stereo image pair of the two channels to obtain the stereo image quality scores Q of the two channels₁And Q₂(ii) a And finally, combining the quality scores of the two channels to obtain the quality score Q of the final distorted stereo image.

2.4. Quality evaluation Q

And evaluating the quality of the distorted image according to the sparse matrix of the distorted image.

respectively representing sparse coefficient matrixes for performing sparse coding on the reference stereo pair and the distortion stereo pair which are combined with a significance dictionary after significance preprocessing; o denotes a reference image, d denotes a distorted image, lRepresenting the left image, r the right image, 1 the SIFT dictionary test channel, and 2 the salient dictionary test channel.

The left view quality for the test of the SIFT dictionary channel is obtained from equations (6) (7) (8) (9):

s (i, j) represents a sparse coefficient similarity index, and the closer the value of S (i, j) is to 1, the smaller the distortion degree of the distorted image is.

And

respectively representing the values at the reference left image sparse matrix and the distorted left image sparse matrix (i, j), t, c being constants. M denotes the number of rows of the sparse matrix and N denotes the number of columns of sparse coefficients, where α + β is 1; similarly, the right view quality of the SIFT dictionary test channel is obtained

And

combining the quality scores of the left view and the right view by using the weight geometry to obtain the quality score Q of the stereo image of the SIFT dictionary testing channel₁And stereo image quality score Q of salient dictionary test channel₂。

And respectively representing the mean square value of the distorted stereo image of the SIFT dictionary testing channel and the salient dictionary testing channel to the sparse matrix of the left distorted stereo image and the right distorted stereo image. The solution to the weights is given by equations (10) (11):

Tables 1 and 2 are data results for different indices obtained for different distortion types by the algorithm and document [7] herein. Table 3 gives the overall performance comparison results for the different algorithms. The data bolded in the table is the best data among all data. As can be seen from the data in tables 1 and 2, the evaluation effect of the algorithm on different distortion types in two public LIVE libraries is most effective, and the PLCC and SROCC values in the tables are most over 0.9, which shows that the correlation between the evaluation result of the method and the subjective evaluation result is strong. Although the algorithms herein generally evaluate the effect on individual distortion types, the algorithms herein are also very effective in general. As can be seen from table 3, the experimental results of the algorithm are more effective than those of other methods according to the comparison between different algorithms, so the goodness of fit between the data obtained by the algorithm and the subjective data is better. Therefore, the method provided by the invention can well reflect the quality of the stereo image and is in line with the subjective feeling of human eyes.

TABLE 1 LIVE I database Performance comparison of two different methods

TABLE 2 LIVE II database Performance comparison of two different methods

TABLE 3 Overall Performance comparison of different evaluation methods

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

1. Three-dimensional significance detection model

2. Dictionary learning

2.1 salient dictionary

As shown in figure 1(a) during the learning phase of the dictionary.

2.2 SIFT dictionary

As shown in figure 1(a) during the learning phase of the dictionary.

3. Coefficient of sparseness

4 results of the experiment

4.1 test database and Performance indicators

The test database herein uses two public LIVE databases. The LIVEI database contains 365 symmetrically distorted stereo image pairs and 20 original stereo image pairs, with five distortion types, JPEG, JP2K, Gbillur, WN, and FF.

The LIVE II database contains 360 symmetric and asymmetric distorted stereo image pairs, 8 original stereo image pairs, and the distortion types include JPEG, JP2K, Gblur, WN, and FF.

Two evaluation indexes, PLCC (personnel area correlation) and SROCC (statistical rank order correlation) are used herein to evaluate the performance of the proposed model. The closer the values of PLCC and SROCC are to 1, the better the correlation between the objective evaluation method and the subjective evaluation method is. Because the subjective DMOS value is different from the interval of the objective algorithm quality fraction, the correlation coefficient needs to be subjected to nonlinear logic mapping when being calculated, and five-parameter fitting is adopted to obtain the subjective predicted value of the three-dimensional image quality of the model.

4.2 Performance comparison

5. Conclusion

The three-dimensional image quality evaluation method based on the sparse dictionary is provided, two dictionaries are trained, and finally the quality of two channels is combined to obtain a final distortion three-dimensional image quality score. In order to ensure the completeness of the dictionary, the original reference stereo image pair is used as a data source for dictionary training, and the salient dictionary and the SIFT dictionary are obtained through image preprocessing training. And adding the visual saliency into the stereo image quality evaluation, simultaneously optimizing by using the characteristics of the fovea centralis and the central deviation to obtain a final saliency map, and further training to obtain a saliency dictionary. The experimental result shows that the performance of the algorithm is good, and compared with the subjective evaluation value, the model prediction result is accurate.

Reference to the literature

[1]J.V.de Miranda Cardoso,C.D.M.Regis,and M.S.de Alencar,“Disparity weighting applied to full-reference and no reference stereoscopic image quality assessment,”in 2015IEEE International Conference on Consumer Electronics(ICCE),pp.477-480,Jan 2015.

[2]S.K.Md,B.Appina,and S.S.Channappayya,“Full-reference stereo image quality assessment using natural stereo scene statistics,”IEEE Signal Processing Letters,vol.22,pp.1985-1989,Nov 2015.

[3]B.A.Olshausen et al.,“Emergence of simple-cell receptive field properties by learning a sparse code for natural images,”Nature,vol.381,no.6583,pp.607-609,1996.

[4]B.A.Olshausen and D.J.Field,“Sparse coding of sensory inputs,”Current opinion in neurobiology,vol.14,no.4,pp.481-487,2004.

[5]P.Reinagel,“How do visual neurons respond in the real world？,”Current opinion in Neurobiology,vol.11,no.4,pp.437-442,2001.

[6]Li Kemeng,Shao Feng et al.The Method of Evaluating the Objective Quality of Stereoscopic Images Based on Sparse Representation[J].Journal of Optoelectronics laser,2014,25(11):

2227-2233.

Lekemeng, Shaofeng, etc. stereoscopic image objective quality evaluation method based on sparse representation [ J ]. photoelectron laser, 2014,25(11):2227-2233.

[7]F.Shao,K.Li,W.Lin,G.Jiang,M.Yu,and Q.Dai,“Full-referencequality assessment of stereoscopic images by learning binocular receptive field properties,”IEEE Transactions on Image Processing,vol.24,pp.2971–2983,Oct 2015.

[8]S.K.Md and S.S.Channappayya,“Sparsity Based Stereoscopic Image Quality Assessment,”2016 50th Asilomar Conference on Signals,Systemsand Computers,pp.1858-1862,2016.

[9]Tsotsos J K，Culhane S M，Wai W Y K，et al.Modelling Visual Attention via Selective Tuning.Artificial Intelligence，Oct.1995，78(1):507-545.

[10]Tseng P，Carmi R，Camerson I G M，et al.Quantifying center bias of observers in free viewing of dynamic natural scenes[J].Journal of Vision，2009，9(7):1-16.

[11]Lee H,Battle A,Raina R,et al.Efficient sparse coding algorithms[J].Advances in neural information processing systems,2007,19:801.

[12]WANG F.Visual saliency detection based on context and background.Dalian University of Technology,2013

(Wangfei. visual saliency detection based on context and background. university of great graduate, 2013)

[13]Le Meur O.,Le Callet,P.,Barba,et al.A coherent computational approach to model bottom-up visual attention[J].Pattern Analysis and Machine Intelligence,IEEE Transactions on,2006,28(5):802-817.

[14]ZhouWang,Ligang Lu,Alan Conrad Bovik.Foveation Scalable Video Coding With Automatic Fixation Selection[J].IEEE TRANSACTIONS ON IMAGE PROCESSING,FEBRUARY 2003,12(2):243-254.

[15]W.S.Geisler and J.S.Perry,A real-time foveated multiresolutionsystem for low-bandwidth video communication,Proc.SPIE,vol.3299,pp.294–305,July 1998.

[16]Efron B,Hastie T,Johnstone I,et al.Least angle regression[J].The Annals of statistics,2004,32(2):407-499.

[17]Censor Y,Zenios S A.Parallel optimization:Theory,algorithms,and applications[M].Oxford University Press on Demand,1997.

[18]R.Rubinstein,M.Zibulevsky,and M.Elad,“Efficient implementationof the k-svd algorithm using batch orthogonal matching pursuit,”CSTechnion,vol.40,no.8,pp.1–15,2008.

Claims

1. A three-dimensional image quality evaluation method based on dictionary learning is characterized in that two dictionaries are trained by utilizing a reference three-dimensional image pair, a Scale-invariant Feature transform (SIFT) dictionary and a significant dictionary are used, the SIFT dictionary specifically carries out SIFT transformation on the reference three-dimensional image pair, SIFT features are used for representing images, and dictionary training is carried out by using a Feature-sign method and a Lagrange duality method to obtain an SIFT dictionary; the salient dictionary is specifically that an initial stereoscopic vision salient image is obtained by combining an absolute difference image, the initial stereoscopic vision salient image is optimized by adopting the central offset and fovea characteristics conforming to a human visual mechanism to obtain a final salient image, the salient image of a reference stereoscopic image pair is extracted, and then the front m salient blocks of each salient image are selected as source data for training the salient dictionary to obtain the salient dictionary by using n multiplied by n overlapped blocks according to the size value of the variance; in the testing stage, after SIFT transformation is carried out on the reference and distorted stereo image pairs on one hand, sparse coding is carried out by combining a trained SIFT dictionary to obtain sparse coefficient matrixes of the reference and distorted stereo image pairs, and the sparse coefficient matrixes are processed to obtain the quality Q1 of the corresponding distorted stereo image; on the other hand, saliency processing is carried out on the reference and distorted stereo image pair to obtain a reference and distorted stereo salient image, all salient blocks which are not overlapped by n x n are used as input data to be combined with a salient dictionary to obtain a sparse coefficient matrix of the corresponding reference and distorted stereo image pair, and the sparse coefficient matrix is processed to obtain the quality Q2 of the corresponding distorted stereo image; and finally, combining the Q1 and the Q2 to obtain a final quality score Q of the distorted stereo image pair.

2. The method as claimed in claim 1, wherein the quality of the stereoscopic saliency map is used to reflect the quality of the stereoscopic image, the absolute difference map of the left and right views is used to represent the depth information, first, the luminance, chrominance and texture contrast features of the image are extracted to represent the saliency information, the initial feature saliency map is obtained by combining the absolute difference map, and then, the initial feature saliency map is optimized by using the center shift and fovea characteristics to obtain the final stereoscopic saliency map.

3. The stereo image quality evaluation method based on dictionary learning according to claim 1, characterized by further comprising:

(1) center offset

(2) central concave

wherein e (x, y) represents the retinal centrifugation degree of the pixel point (x, y) and the unit is degree; f is spatial frequency, which is in cycles/degree; c₀Is a contrast threshold; delta is a spatial frequency attenuation parameter; e.g. of the type₁Is a half resolution centrifugation constant;

searching a gray value which enables the difference between the foreground and the background of the three-dimensional saliency map to be maximum by using a maximum inter-class variance method, wherein the gray value is an optimal threshold value, dividing the three-dimensional saliency map into a saliency region and a non-saliency region by using the threshold value, further calculating the retina centrifugation degree e (x, y) by using the spatial distance relation between a pixel and a saliency pixel closest to the pixel, and assuming that the coordinate of any pixel is (x, y) and the coordinate of a saliency pixel closest to the pixel is (x, y)₁,y₁) The eccentricity e (x, y) is represented by the formula3) Obtaining:

wherein W is the number of horizontal pixels of the image, v is the viewing distance, pixel (x, y) and pixel (x)₁,y₁) Euclidean distance of (c):

4. the stereo image quality evaluation method based on dictionary learning as claimed in claim 1, wherein the dictionary training comprises the following steps: fixing one of the dictionary and the sparse coefficient matrix to solve the other, and then carrying out iteration to obtain a proper dictionary and sparse coefficient matrix:

in equations (4) and (5), X is the input signal, B is the complete dictionary, and S is the sparse coefficient matrix. I | · | purple wind_FIs the F-norm, λ is the regularization parameter, | · | | | luminance₁Is a₁Norm, B_iRepresenting the ith column of atoms in a dictionary, wherein formula (4) is an L1 regularized optimization problem, and typical optimization methods include a least angle regression method (LARS), a Feature-sign search method Feature-sign (Feature-sign search Algorithm), wherein in each iteration step, the Feature-sign method changes an irrelative concave problem into an unconstrained secondary optimization problem QP (quadratic optimization) by guessing the sign of a sparse coefficient; formula (5) is a typical quadratic constraint least square optimization problem, and the optimization methods include a QCQP convex optimization method, an iterative gradient descent method and a Lagrange multiplier method.

5. The method for evaluating the quality of the stereo image based on the dictionary learning as claimed in claim 1, wherein in one example, the step of training the salient dictionary specifically includes processing the image by using 8 × 8 overlapped windows to obtain a plurality of 8 × 8 small image blocks, extracting the first 3000 8 × 8 salient blocks of each salient image by using variance information, vectorizing all the salient blocks and forming a matrix as the input of dictionary training, changing each salient block into a column vector, and performing iterative training according to feature-sign algorithm and lagrange dual method to obtain a proper sparse salient dictionary, wherein the size of the small image block is 8 × 8, so that the size of the salient dictionary in this text is selected to be 64 × 128.

6. The method as claimed in claim 5, wherein in the training stage of the SIFT dictionary, a 16 × 16 neighborhood is taken as a sampling window with a feature point as a center, the relative directions of the sampling point and the feature point are classified into a histogram of directions including 8 bins after Gaussian weighting, a histogram of gradients in 8 directions is calculated on each 4 × 4 block to form a seed point, 16 seed points of 4 × 4 are used for describing each key point, so that a 128-dimensional SIFT feature vector of 4 × 4 × 8 is obtained for one key point, and 128 × 1024 is selected according to the size of the SIFT dictionary in the feature dimension.

7. The dictionary learning-based stereo image quality evaluation method as claimed in claim 5, wherein after a significant dictionary and a SIFT dictionary are obtained in a dictionary training stage, the obtained dictionary is used for testing a distorted image, after a reference image and a test image are preprocessed, sparse coding is directly carried out by using the corresponding dictionary obtained in the training stage, sparse coefficient matrixes of the reference stereo image pair and the distorted stereo image pair are obtained by using an OMP algorithm, sparsity change of the images is selected as a basis of stereo image quality evaluation, and the sparse matrixes of the reference stereo image pair and the distorted stereo image pair of two channels are respectively combined to obtain stereo image quality scores Q of the two channels₁And Q₂(ii) a And finally, combining the quality scores of the two channels to obtain the quality score Q of the final distorted stereo image.

8. The dictionary learning-based stereoscopic image quality evaluation method as claimed in claim 5, wherein the quality evaluation Q value specifically obtaining step:

s (i, j) represents a sparse coefficient similarity index,and the closer the value of S (i, j) is to 1, the smaller the degree of distortion of the distorted image is,

and

And