CN106875389B

CN106875389B - Stereo video quality evaluation method based on motion significance

Info

Publication number: CN106875389B
Application number: CN201710100339.2A
Authority: CN
Inventors: 李素梅; 段志成; 丁学东; 常永莉
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-02-23
Filing date: 2017-02-23
Publication date: 2020-06-30
Anticipated expiration: 2037-02-23
Also published as: CN106875389A

Abstract

The invention belongs to the field of image processing, and aims to evaluate the quality of a three-dimensional video more accurately and effectively and promote the development of a three-dimensional imaging technology to a certain extent. The invention, three-dimensional video quality evaluation method based on motion significance, estimate its corresponding motion information from t frame, t +1 frame of two adjacent frames of image of left viewpoint or right viewpoint of the reference three-dimensional video at first, receive the motion significance weight distribution map of the frame of image; then, evaluating a corresponding t frame image of a left viewpoint or a right viewpoint of the distorted stereoscopic video by using a quality Structure Similarity (SSIM) of a 2D image quality evaluation method; then weighting the quality distribution of the corresponding frame by using the motion significance weight distribution obtained in the previous step so as to obtain the overall quality of the frame image; then obtaining the quality of the single-frame stereo image; and finally, averaging the quality of the stereo images of all the frames of the stereo video to obtain the overall quality of the stereo video. The invention is mainly applied to image processing.

Description

Stereo video quality evaluation method based on motion significance

Technical Field

The invention belongs to the field of image processing, relates to improvement and optimization of a stereo video quality evaluation method, and particularly relates to application of motion salient segmentation and structural similarity in stereo video quality objective evaluation. In particular to a stereo video quality evaluation method based on motion saliency.

Background

With the development of 3D technology in recent years, stereoscopic video is becoming a main medium for visual information dissemination. However, the network transmission bandwidth resource is limited, which makes the compression of the stereoscopic video a necessary choice, and practice proves that the asymmetric compression of the stereoscopic video can achieve higher compression efficiency. Meanwhile, compression inevitably affects the quality of the stereoscopic video, and the quality of the stereoscopic video directly affects the visual, physiological and psychological health of the viewer. Therefore, the method is of great importance for researching visual factors influencing the watching experience of audiences and researching an effective and reliable stereoscopic video quality evaluation method.

The stereoscopic video quality evaluation methods can be divided into two categories: subjective evaluation method and objective evaluation method. A method of evaluating the quality of the contents of the body by a large number of people is a subjective evaluation method. Because the method is directly evaluated by people, the method can accurately reflect the visual experience of the people, but the subjective quality evaluation is time-consuming and labor-consuming and cannot be applied to real-time application^[1]。

The objective quality evaluation method realizes the quality evaluation of the three-dimensional video through an algorithm, and can effectively make up for the defects of a subjective evaluation method. Since the stereoscopic video technology has been developed in recent years, there are very few methods for directly evaluating the quality of stereoscopic video, and most researchers in the related field directly apply the conventional quality evaluation method for 2D images or videos to the quality evaluation of stereoscopic video. Typical 2D image quality evaluation methods include peak signal to noise ratio (PSNR), visual signal to noise ratio (VSNR), Visual Information Fidelity (VIF), and quality Structure Similarity (SSIM).

Peak signal-to-noise ratio (PSNR) is the most typical image quality evaluation method based on image difference analysis, which is homologous to Mean Square Error (MSE) method, but the method cannot accurately reflect the characteristics of the human visual system^[2,3]. Visual signal-to-noise ratio (VSNR) method was studied by chandler et al in series^[4,5,6,7]It is later proposed that this method is a typical Human Visual System (HVS) based method. Visual Information Fidelity (VIF) model was proposed by Sheikh et al in 2006^[8,9]The method introduces description about information quantity in an information theory, provides an image model according with the statistical rule of natural images, and defines the information quantity of the images on the basis. The difference between the information amounts of the original image and the test image is used to evaluate the image quality. The similarity of mass and structure (SSIM) is obtained from Wang Z^[2,3,10,11]The image quality evaluation method proposed later. The method focuses on the structural similarity of two images, and specifically SSIM considers three factors: brightness dependence, contrast dependence andand structural dependencies. The method is a relatively excellent method in the field of 2D image quality evaluation at present. Although these conventional image quality evaluation methods can reflect the quality of a single-viewpoint single-frame image in a stereoscopic video to some extent, these methods only evaluate a planar image.

Literature reference^[12]A video quality evaluation algorithm combining brightness information, motion information and SSIM is provided. Literature reference^[13]Although the two methods consider the influence of motion information in the video on video quality evaluation, the motion information is calculated based on a local pixel area of 8 × 8, and the calculation mode based on a square area with a fixed size cannot well reflect the motion situation of a real object in a three-dimensional scene.

The development of the stereo video quality evaluation algorithm has important significance for the development of stereo videos.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a significant region segmentation method based on motion information, and a novel stereo video quality evaluation method is constructed on the basis of the method by combining human binocular vision characteristics and SSIM. The algorithm provided by the invention can evaluate the quality of the three-dimensional video more accurately and effectively, and meanwhile, the development of the three-dimensional imaging technology is promoted to a certain extent. The adopted technical scheme is that the method for evaluating the quality of the three-dimensional video based on the motion significance comprises the steps of firstly estimating corresponding motion information from the t frame and the t +1 frame of two adjacent frames of images of a left viewpoint or a right viewpoint of a reference three-dimensional video, and utilizing the obtained motion information to guide the corresponding single frame t frame image to carry out region segmentation so as to obtain a motion significance weight distribution map of the frame image; then, evaluating a corresponding t frame image of a left viewpoint or a right viewpoint of the distorted stereoscopic video by using a quality Structure Similarity (SSIM) of a 2D image quality evaluation method to obtain a quality distribution map of the frame image; then weighting the quality distribution of the corresponding frame by using the motion significance weight distribution obtained in the previous step so as to obtain the overall quality of the frame image; then, fusing the quality of left and right views in the stereo video by using binocular competition and binocular inhibition characteristics in a Human Visual System (HVS) so as to obtain the quality of a single-frame stereo image; and finally, averaging the quality of the stereo images of all the frames of the stereo video to obtain the overall quality of the stereo video.

The specific steps of carrying out region segmentation are that a motion vector corresponding to each pixel in an image is calculated by utilizing an optical loss motion estimation algorithm; then extracting the edge contour of the object in the image by using an edge extraction algorithm; and finally, calculating the mean value of the motion amounts corresponding to all pixels in the region surrounded by the edge contour of each obtained object, taking the mean value as the motion amount corresponding to the object, and representing different objects in the image according to the motion amount so as to carry out segmentation.

SSIM mainly evaluates the degree of similarity of two images, which is mainly measured by three indicators: brightness similarity, contrast similarity, and structural similarity;

the brightness similarity is calculated by the following formula:

x and Y in equation (1) represent a local pixel block of the reference image and a local pixel block of the distorted image, u, respectively_XAnd u_YRespectively representing the mean value of the local blocks of pixels of the reference image and the mean value of the local blocks of pixels of the distorted image, C₁Is a constant;

the contrast similarity is calculated by the following formula:

also in equation (2), X and Y represent local pixel blocks of the reference image and local pixel blocks of the distorted image, respectively, σ_XAnd σ_YRespectively representing the standard deviation of the local pixel blocks of the reference image and the standard deviation of the local pixel blocks of the distorted image, C₂Is a constant；

The structural similarity is calculated by the following formula:

also in equation (3), X and Y represent local pixel blocks of the reference image and local pixel blocks of the distorted image, respectively, σ_XAnd σ_YRespectively representing the standard deviation of the local pixel blocks of the reference image and the standard deviation, sigma, of the local pixel blocks of the distorted image_XYRepresenting the covariance of the local blocks of pixels of the reference image and of the distorted image, C₃As a constant, the final SSIM quality evaluation result is calculated by the following equation:

SSIM(X,Y)＝[L(X,Y)]^α·[C(X,Y)]^β·[S(X,Y)]^γ(4)

wherein α, β and γ are non-negative constants for adjusting relative importance of three indexes of brightness similarity, contrast similarity and structural similarity, and a typical value thereof is α ═ β ═ γ ═ 1, then equation (4) becomes the following form:

the method comprises the steps of obtaining an SSIM value of a (X, Y) position by selecting a pixel area of 11 × 11 with a pixel (X, Y) as a center from a corresponding position in a reference image and a corresponding position in a distorted image respectively through calculation by using an equation (5), and finally sliding a calculation window pixel by pixel to obtain the SSIM value of all pixels, wherein the formulas (1), (2) and (3) show that when the two images are completely the same, L (X, Y), C (X, Y) and S (X, Y) reach the maximum value of 1 so that the SSIM reaches the maximum value of 1, and along with the aggravation of the damage degree of the distorted image, L (X, Y), C (X, Y) and S (X, Y) are reduced so that the SSIM is correspondingly reduced, and the value range of the SSIM is 0-1.

Specifically, the overall quality of the t-th frame picture of the left viewpoint or the right viewpoint of the stereoscopic video is obtained by weighted average of the motion saliency weight distribution of the frame picture on the SSIM quality distribution of the frame picture, if the size of a single frame image of the stereoscopic video is P × Q, the motion saliency weight distribution of the frame picture and the SSIM quality distribution of the frame picture are both matrices of P × Q, and values at positions (x, y) in the matrices respectively correspond to the motion saliency weight and the SSIM quality value of a pixel at the position (x, y) of the frame picture, and the overall quality Qtl and Qtr of the t-th frame picture of the left viewpoint or the right viewpoint are calculated by the following equations:

ql (t) and qr (t) in equations (6) and (7) represent the overall quality Qtl and Qtr of the tth frame picture of the left and right viewpoints, respectively; MSl (x, y, t) and MSr (x, y, t) represent motion saliency weights at the tth frame image position (x, y) of the left and right viewpoints, respectively; SSIMl (x, y, t) and SSIMr (x, y, t) respectively represent SSIM scores at the t-th frame image position (x, y) of the left and right viewpoints;

combining the characteristics of binocular competition and binocular inhibition in a human visual system, and finally calculating the quality of the t-th frame of stereo image in the stereo video by using the following equation:

in equation (8), q (t) is the quality of the t-th frame of the stereoscopic image, and fl (t) and fr (t) are the two-dimensional spatial frequency of the t-th frame of the left viewpoint of the stereoscopic video and the two-dimensional spatial frequency of the t-th frame of the right viewpoint of the stereoscopic video, respectively, and the two-dimensional spatial frequencies of the images are calculated by the following equations:

in equations (9), (10) and (11), F is the two-dimensional spatial frequency of the image, F_hFor the line-frequency, i.e. spatial frequency, horizontal component of the image, F_vFor column frequency, i.e. spatial frequency vertical component, P and Q are the number of rows and columns of the image, respectively, and I (P, Q) is the luminance value of the pixel in the P-th row and Q-th column of the image;

the final stereoscopic video quality is obtained by the calculation of equation (12).

The invention has the characteristics and beneficial effects that:

the experimental results and data comparison show that the results obtained by the MS _ VQM, SSIM _ VQM and MSSIM _ VQM methods are better consistent with the subjective evaluation results. This demonstrates that the method based on structural similarity is more consistent with the human viewing experience in terms of stereo video quality assessment. The consistency of the result obtained by the objective evaluation method VIF _ VQM and the subjective evaluation result is the worst, and the result obtained by the method MS _ VQM and the subjective evaluation result have the best consistency; the evaluation performance of the objective evaluation methods PSNR _ VQM and VSNR _ VQM is slightly better than that of VIF _ VQM, while the evaluation performance of the objective evaluation methods SSIM _ VQM and MSSIM _ VQM is better than that of the objective evaluation methods PSNR _ VQM and VSNR _ VQM, and is only inferior to that of the objective quality evaluation method MS _ VQM mentioned herein. Through the performance comparison of the various stereoscopic video quality evaluation methods, the method based on the structural similarity is closer to subjective evaluation made by human beings in the stereoscopic video quality evaluation aspect, and in the stereoscopic video quality evaluation method based on the structural similarity, the evaluation result of MS _ VQM provided by the invention is closest to the subjective evaluation result of human beings, which shows that the region segmentation method based on the motion significance is beneficial to improving the reliability and the accuracy of the stereoscopic video quality evaluation method, and simultaneously shows that the selection of a proper binocular fusion method is important for improving the performance of the stereoscopic video quality evaluation method.

Description of the drawings:

fig. 1 is a flow chart of a stereoscopic video quality evaluation method based on motion saliency and binocular fusion.

Fig. 2 airplan motion vector diagram.

Fig. 3 airplan motion saliency region plot.

Fig. 4 illustrates a ballroom motion vector diagram.

Fig. 5 illustrates a ballroom motion saliency region map.

Figure 6 woshou motion vector diagram.

Fig. 7 woshou motion saliency region map.

Fig. 8 airplan segmentation plot for significant region.

Fig. 9 airplan significance region distribution map.

Fig. 10 is a segmentation of the ballroom salient region.

Fig. 11 ballroom salient region distribution diagram.

FIG. 12 is a segmentation of the woshou significance region.

Figure 13 woshou significance region distribution map.

Fig. 14 airplan QP ═ 40SSIM quality distribution map. Fig. 15 ballrom QP ═ 44SSIM quality distribution map.

Fig. 16 woshou QP ═ 36SSIM quality distribution map.

FIG. 17 VIF _ VQM scattergram.

FIG. 18 PSNR _ VQM scatterplot.

FIG. 19 VSNR _ VQM scatterplot.

FIG. 20 SSIM _ VQM scattergram.

FIG. 21 MSSIM _ VQM scatter plot.

FIG. 22 MS _ VQM scattergram.

Detailed Description

The invention provides a stereo video quality evaluation method based on motion saliency and binocular fusion. Firstly, carrying out region segmentation on an image according to motion significance, and then endowing the segmented region with corresponding significance weight according to the size of the motion significance; secondly, evaluating the areas obtained by segmentation by using an SSIM image quality evaluation method so as to obtain local quality, and weighting the local quality of the corresponding areas by using the motion significance weight so as to obtain the global quality of the single-frame image; and finally, corresponding weights are given to the left viewpoint and the right viewpoint by combining binocular competition and binocular suppression phenomena in a human visual system to obtain the objective quality of the stereoscopic video.

The method comprises the following specific steps:

firstly, estimating corresponding motion information from two adjacent frames of images (a t frame and a t +1 frame) of a left viewpoint (a right viewpoint) of a reference stereo video, and guiding a corresponding single frame of image (the t frame of image) to perform region segmentation by using the obtained motion information to obtain a motion significance weight distribution map of the frame of image; then, evaluating a corresponding frame (t frame image) of a left viewpoint (right viewpoint) of the distorted stereoscopic video by using an excellent 2D image quality evaluation method SSIM to obtain a quality distribution map of the frame image; then weighting the quality distribution of the corresponding frame by using the motion significance weight distribution obtained in the previous step so as to obtain the overall quality of the frame image; then, fusing the quality of left and right views in the stereo video by using binocular competition and binocular inhibition characteristics in a Human Visual System (HVS) to obtain the quality of a single-frame stereo image; and finally, averaging the quality of the stereo images of all the frames of the stereo video to obtain the overall quality of the stereo video.

Region segmentation based on motion saliency

We have such an experience in daily life: when we observe the surrounding environment, we will first notice the moving object, and the faster the speed, the more attention we can get, which is the result of the long-term evolution of human beings. Motion information is crucial for stereoscopic video quality assessment. The invention firstly utilizes a classical optical flow motion estimation algorithm^[18]To estimate motion information between adjacent frames. The algorithm was proposed by Hom and Schunck in 1981^[18]The algorithm calculates the optical flow motion vector based on the assumption that the slight motion and brightness of the optical flow are constant.

Fig. 2, fig. 4, and fig. 6 are diagrams of motion vectors of first frames of left viewpoints of the stereo videos airplan, ballrom, and woshou, respectively, in which blue arrows represent motion vectors of corresponding pixel regions, directions of the arrows represent motion directions of the pixel regions, and lengths of the arrows represent magnitudes of motion amounts, i.e., degrees of motion saliency. Since most regions in the screen have a relatively small amount of motion, the amount of motion of these regions is mostly close to zero, and such regions with small amount of motion are rarely noticed by the viewer, such regions with small amount of motion that are not easily noticed are called insignificant regions; in contrast, a small region in a screen has a relatively large amount of motion, and such a region tends to attract much attention of the viewer, so that a region having a large amount of motion and being easily noticed is referred to as a saliency region. Fig. 3, 5, and 7 are illustrations of motion saliency areas corresponding to airplane, ballrom, and woshou, respectively, where black areas represent non-saliency areas and colored areas represent saliency areas.

It can be seen from fig. 2, 4 and 6 that the motion vectors of adjacent areas in the same picture are not continuous in the direction and magnitude changes, and the motion vectors are different in different parts of the same object. For example, in the image airplane, the texture region of the aircraft head has a large motion vector, and the motion vector of the smooth region of the aircraft head is almost zero. In the image woshou, a partial area of the wall surface serving as the background has a large motion vector, a partial area has a small motion vector, and the variation of the intensity of the motion vector is randomly distributed and is not uniform. Both of the above points describe the instability of the optical loss motion estimation algorithm, and the same phenomenon can be observed in fig. 3, 5, and 7. For example, the head of the airplane in fig. 3 should belong to the salient region, but some of the regions are labeled as non-salient regions, and some of the background regions should belong to the non-salient regions and are randomly labeled as salient regions. The same phenomenon can be observed in fig. 5 and 7. The above observed phenomena obviously do not conform to the rules of human perception, i.e. the same part of the same object should have the same or similar amount and direction of motion. These problems are caused by the instability of the optical flow vector motion estimation algorithm, which affects the accuracy and reliability of the stereo video quality evaluation method. Therefore, the improvement of the extraction method of the motion saliency area is important for improving the performance of the stereo video quality evaluation method. The region segmentation method based on motion saliency can obtain a motion saliency image consistent with human cognition. The region segmentation method based on motion saliency proposed by the present invention is discussed in detail below.

Although the optical flow loss motion estimation algorithm cannot well represent different objects in the image according to the magnitude of the motion quantity, the optical flow loss motion estimation algorithm can reflect the motion quantity of the area where some objects are located in the image to a certain extent. Therefore, in order to obtain a motion saliency image which accords with human cognitive rules and can reflect different motion amounts, the motion saliency image only needs to be obtained by segmenting different objects in the image according to edge contours of the objects and then obtaining the motion amount of an area where each object is located. The region segmentation method based on motion saliency provided by the invention is provided under the guidance of the idea, and the specific implementation mode is as follows: firstly, calculating a motion vector corresponding to each pixel in an image by using an optical loss motion estimation algorithm; then extracting the edge contour of the object in the image by using an edge extraction algorithm; and finally, calculating the mean value of the motion amounts corresponding to all pixels in the region surrounded by the edge contour of each obtained object, and taking the mean value as the motion amount corresponding to the object.

As shown in fig. 8, 10, and 12, which are segmentation maps of regions based on motion saliency corresponding to airlane, ballrom, and woshou, respectively, and fig. 9, 11, and 13, which are distribution maps of regions of motion saliency corresponding to airlane, ballrom, and woshou, respectively. Wherein, the brighter region in the subgraphs of fig. 8, fig. 10 and fig. 12 represents the region with larger motion amount, which has stronger significance and is easy to be noticed by the viewer, and should be given a larger weight; the darker areas in the opposite graph represent areas with less movement, are less conspicuous, are less likely to be noticed by the viewer, and should be given less weight.

Fig. 9, 11, and 13 are motion saliency region distribution diagrams corresponding to fig. 8, 10, and 12, respectively, in which the more vivid the color is, the stronger the corresponding motion saliency is; conversely, the darker the color in the figure, and even the black, the less pronounced its corresponding motion is. As can be seen from the motion salient region segmentation map and the motion salient region distribution map: the motion salient regions in the image can well reflect the shape and the outline of the object, the distribution of the salient regions does not show randomness any more, but is distributed according to the relative motion quantity of the object in the image, and the motion quantity of the object is expressed in different brightness degrees, so that the image has strong layering. The region segmentation method based on motion significance is beneficial to improving the reliability and accuracy of the stereo video quality evaluation method

Two-mass structural similarity SSIM

Although SSIM is a quality evaluation method for 2D images only and cannot be used directly to evaluate the quality of a stereoscopic video, it is based on that the basic composition unit of a stereoscopic video, i.e., each frame in a stereoscopic video, is also a 2D image, and we propose a bottom-up stereoscopic video quality evaluation method in this document. The method mentioned herein is to obtain the quality score of each frame in the stereoscopic video through SSIM.

SSIM mainly evaluates the degree of similarity of two images, which is mainly measured by three indicators: brightness similarity, contrast similarity, and structural similarity.

The brightness similarity is calculated by the following formula:

x and Y in equation (1) represent a local pixel block of the reference image and a local pixel block of the distorted image, u, respectively_XAnd u_YRespectively representing the average value of the local pixel blocks of the reference image and the average value of the local pixel blocks of the distorted image. When denominator in the equation

If it is too small, the stability of the calculation result is affected, and in order to prevent this, a constant C is set₁In this contextLet C₁＝(K₁*L)^2,K₁＝0.01,L＝255。

The contrast similarity is calculated by the following formula:

also in equation (2), X and Y represent local pixel blocks of the reference image and local pixel blocks of the distorted image, respectively, σ_XAnd σ_YRespectively representing the standard deviation of the local pixel block of the reference image and the standard deviation of the local pixel block of the distorted image. Constant C₂And C₁The same effect, the invention makes C₁＝(K₂*L)^2,K₂＝0.01,L＝255

The structural similarity is calculated by the following formula:

also in equation (3), X and Y represent local pixel blocks of the reference image and local pixel blocks of the distorted image, respectively, σ_XAnd σ_YRespectively representing the standard deviation of the local pixel blocks of the reference image and the standard deviation, sigma, of the local pixel blocks of the distorted image_XYRepresenting the covariance of the local pixel blocks of the reference image and the local pixel blocks of the distorted image. Constant C₃And C₁The same effect as in (A), let C₃＝C₂And/2, calculating the final SSIM quality evaluation result by the following equation:

SSIM(X,Y)＝[L(X,Y)]^α·[C(X,Y)]^β·[S(X,Y)]^γ(4)

where α, β, γ are non-negative constants that regulate the relative importance of the three indices of brightness similarity, contrast similarity, and structural similarity, and a typical value thereof is α - β - γ -1, equation (4) may be changed to the following form:

the method comprises the steps of obtaining an SSIM value of a pixel (X, Y) position by selecting a pixel area 11 × 11 with the pixel (X, Y) as the center from a corresponding position in a reference image and a corresponding position in a distorted image respectively through calculation by using an equation (5), and finally sliding a calculation window pixel by pixel to obtain the SSIM value of all pixels, wherein the formulas (1), (2) and (3) show that when the two images are completely the same, L (X, Y), C (X, Y) and S (X, Y) reach the maximum value 1 so that the SSIM reaches the maximum value 1, and along with the aggravation of the damage degree of the distorted image, the L (X, Y), C (X, Y) and S (X, Y) are reduced so that the SSIM is correspondingly reduced, and the value range of the SSIM is 0-1.

Three-dimensional video quality Q

As shown in fig. 1, if the size of a single frame image of a stereoscopic video is P × Q, the motion saliency weight distribution of the frame image and the SSIM quality distribution of the frame image are both matrices of P × Q, and the values at the positions (x, y) in the matrices correspond to the motion saliency weight and the SSIM quality value of the pixel at the position (x, y) in the frame image, respectively, the overall quality Qtl and Qtr of the t-th frame image of the left and right viewpoints are calculated by the following equations:

ql (t) and qr (t) in equations (6) and (7) represent the overall quality Qtl and Qtr of the tth frame picture of the left and right viewpoints, respectively; MSl (x, y, t) and MSr (x, y, t) represent motion saliency weights at the tth frame image position (x, y) of the left and right viewpoints, respectively; SSIMl (x, y, t) and SSIMr (x, y, t) represent SSIM scores at the t-th frame image position (x, y) of the left and right viewpoints, respectively.

The distorted stereo video is mostly obtained by an asymmetric compression coding mode, namely, the left view point and the right view point adopt different methods when an H.264 coding method is utilizedQuantizing the parameters, thereby saving bandwidth and improving coding efficiency^[20]. The most important feature of stereoscopic video compared with 2D video is the difference between left and right viewpoints, which is just because the difference enables stereoscopic video to provide stereoscopic impression to viewers. A large number of subjective experiments prove that when the quality of left and right viewpoints is greatly different, human eyes are easily affected by the viewpoints with serious distortion^[19]Therefore, for the quality evaluation of the distorted stereoscopic video obtained by the asymmetric coding method, a high weight should be given to a viewpoint with a high distortion, and a low weight should be given to a viewpoint with a low distortion. The definition of the image can better reflect the distortion degree, and the definition of the image is lower when the image distortion is more serious, and vice versa. The two-dimensional spatial frequency can well reflect the definition of an image, and the higher the resolution of the image is, the clearer the spatial frequency of the image is, the larger the spatial frequency is^[21]. Therefore, the distortion is smaller when the two-dimensional spatial frequency is larger in the left and right viewpoints, and smaller weight should be given; conversely, the smaller the two-dimensional spatial frequency, the greater the distortion, the greater the weight should be given. And finally, calculating the quality of the t-th frame of the stereo image in the stereo video by using the following equation by combining the characteristics of binocular competition and binocular inhibition in the human visual system:

in equation (8), q (t) is the quality of the t-th frame of the stereoscopic image, and fl (t) and fr (t) are the two-dimensional spatial frequency of the t-th frame of the left viewpoint of the stereoscopic video and the two-dimensional spatial frequency of the t-th frame of the right viewpoint of the stereoscopic video, respectively. The two-dimensional spatial frequency of the image is calculated by the following equation:

in equations (9), (10) and (11), F is the two-dimensional spatial frequency of the image, F_hAnd F_vThe line frequency (spatial frequency horizontal component) and the column frequency (spatial frequency vertical component) of the image, P and Q are the number of rows and columns of the image, respectively, and I (P, Q) is the luminance value of the pixel in the P-th row and the Q-th column of the image.

As shown in fig. 1, the overall quality of the stereoscopic video is obtained by averaging the quality of all frames.

The present invention will be described in further detail with reference to the accompanying drawings and specific examples.

Three sections of stereoscopic video sequences, namely airplan. yuv (with the resolution of 480 × 270), balloon. yuv (with the resolution of 640 × 480) and woshou. yuv (with the resolution of 512 × 384) are selected, single-viewpoint compression of 7 grades is respectively carried out on the left viewpoint and the right viewpoint of each section of original video by adopting the H.264 standard, then the left viewpoint and the right viewpoint of the distorted video are fused into the stereoscopic video, and finally 3 × 7 × 7 147-segment distorted stereoscopic video is obtained, and three segments of reference stereoscopic video (original stereoscopic video) are added to obtain 150 segments of stereoscopic video.

Fig. 14-16 show SSIM quality distribution maps corresponding to one frame of image in a distorted stereoscopic video corresponding to airplan, ballrom, and woshou, respectively. Fig. 14 is a single-frame SSIM quality distribution diagram corresponding to the stereo video airplan when the quantization step QP is 40; fig. 15 is a single-frame SSIM quality distribution diagram corresponding to the stereo video ballrom when the quantization step QP is 44; fig. 16 is a single-frame SSIM quality distribution diagram corresponding to the stereo video woshou when the quantization step QP is 36. In the inventionThe involved distorted stereoscopic video is obtained by h.264 compression coding, wherein QP refers to an important parameter in the compression coding system, namely quantization parameter. The larger the QP value is, the larger the quantization step size is, the more serious the distortion is; conversely, a smaller QP value represents a smaller quantization step size, and the corresponding distortion is smaller. The previous research shows that when the quantization parameter QP is less than 24, the quality of the stereo video cannot be reduced by human eyes; when the quantization parameter QP is greater than 48, the human eye has difficulty in perceiving the content of the stereoscopic scene^[19]. Therefore, the quantization parameter of the distorted stereo video is in the range of 24-48.

White in the SSIM quality distribution plots in fig. 14-16 represents a high quality score, indicating that the region is less distorted; black represents a low quality score, indicating that the region is severely distorted. Through the SSIM quality distribution map, it can be found that: the distortion is mainly concentrated in the edge contour and complex texture region of the object, and the smooth region distortion is relatively small. This shows that the distortion of the picture has strong correlation with the shape of the object in the scene, so the quality evaluation method based on scene segmentation is more reasonable.

In the aspect of objective quality evaluation of a stereoscopic video, most of the conventional stereoscopic video quality evaluation methods are improvements from a 2D image quality evaluation method, that is, a 2D image quality evaluation method is directly applied to evaluate each frame image of a single viewpoint in the stereoscopic video, and then the 2D quality of all frames in the stereoscopic video is averaged to obtain the final stereoscopic video quality. Typical 2D image quality evaluation methods include Visual Information Fidelity (VIF), peak signal to noise ratio (PSNR), visual signal to noise ratio (VSNR), quality Structure Similarity (SSIM), multi-scale quality structure similarity (MSSIM), and the like. The invention respectively applies the above mentioned 2D image quality evaluation method to evaluate each frame image of the left and right viewpoints of the stereo video, then averages the quality of all the frames of the left and right viewpoints to obtain the quality of the left and right viewpoints, and finally takes the average value of the quality of the left and right viewpoints as the final stereo video quality. The present invention takes the results of these five different stereoscopic video quality evaluation methods as comparison data, and names these objective methods PSNR _ VQM, VSNR _ VQM, SSIM _ VQM, MSSIM _ VQM and VIF _ VQM, respectively. The invention provides a method for evaluating the objective quality of a stereoscopic video MS _ VQM, and relates to six different methods for evaluating the quality of the stereoscopic video. The figure shows scatter diagrams between the results of the six objective evaluation methods and the results of the subjective evaluation method, respectively.

As can be seen from fig. 17-22: the MS _ VQM, SSIM _ VQM and MSSIM _ VQM methods have better consistency with the subjective evaluation results. This demonstrates that the method based on structural similarity is more consistent with the human viewing experience in terms of stereo video quality assessment.

The invention utilizes Pearson Correlation Coefficient (PCC), Spearman grade correlation coefficient (SPCC) and Root Mean Square Error (RMSE) as a measure method for evaluating the consistency of results subjectively and objectively. Pearson correlation coefficient, spearman scale correlation coefficient, and root mean square error between the objective quality score obtained by each objective quality evaluation method and the MOS value obtained by the subjective evaluation method are shown in Table 1.

As can be seen from table 1: the consistency of the result obtained by the objective evaluation method VIF _ VQM and the subjective evaluation result is the worst, and the result obtained by the MS _ VQM and the subjective evaluation result have the best consistency; the evaluation performance of the objective evaluation methods PSNR _ VQM and VSNR _ VQM is slightly better than that of VIF _ VQM, while the evaluation performance of the objective evaluation methods SSIM _ VQM and MSSIM _ VQM is better than that of the objective evaluation methods PSNR _ VQM and VSNR _ VQM, and is only second to that of the objective quality evaluation method MS _ VQM provided by the present invention. Through the performance comparison of the various stereoscopic video quality evaluation methods, the method based on the structural similarity is closer to subjective evaluation made by human beings in the stereoscopic video quality evaluation aspect, and in the stereoscopic video quality evaluation method based on the structural similarity, the evaluation result of the MS _ VQM provided by the method is closest to the subjective evaluation result of the human beings, which shows that the region segmentation method based on the motion significance is beneficial to improving the reliability and the accuracy of the stereoscopic video quality evaluation method, and simultaneously shows that the selection of a proper binocular fusion method is important for improving the performance of the stereoscopic video quality evaluation method.

Reference to the literature

[1]Kalpana Seshadrinathan,Rajiv Soundararajan,Alan C.Bovik,LawrenceK.Cormack,“Study of Subjective and Objective Quality Assessment of Video”,IEEE Transactions On Image Processing Language presentation,Vol.19,Issue6.2010.

[2]Wang,Z.,A.C.Bovik,H.R.Sheikh,et al.“Image quality assessment:fromerror visibility to structural similarity.”IEEE Transactions on ImageProcessing.2004,13(4):p.600-612.

[3]Wang,Z.and A.C.Bovik.“Mean squared error:love it or leave it？A newlook at signal fidelity measures.”Signal Processing Magazine,IEEE.2009,26(1):p.98-117.

[4]Chandler,D.M.,M.A.Masry,and S.S.Hemami.“Quantifying the visualquality of wavelet-compressed images based on local contrast,visual masking,and global precedence.”Signals,Systems and Computers,2003.IEEE Press,2003.p.1393-1397 Vol.2.

[5]Chandler,D.M.and S.S.Hemami.“VSNR:A Wavelet-Based Visual Signal-to-Noise Ratio for Natural Images”.IEEE Transactions on ImageProcessing.2007,16(9):p.2284-2298.

[6]Chandler,D.and S.Hemami.Online supplement to"vsnr:A visualsignalto-noise ratio for natural images based on near-threshold andsuprathreshold vision".Retrieved July.2010,15(3):p.12-17.

[7]Chandler,D.M.and S.S.Hemami.“Effects of natural images on thedetectability of simple and compound wavelet subband quantizationdistortions”.J.Opt.Soc.Am.A.2003,20(7):p.1164-1180.

[8]Sheikh,H.R.,A.C.Bovik,and G.de Veciana.“An information fidelitycriterion for image quality assessment using natural scene statistics.”IEEETransactions on Image Processing.2005,14(12):p.2117-2128.

[9]Sheikh,H.R.and A.C.Bovik.“Image information and visual quality.”IEEE Transactions on Image Processing.2006,15(2):p.430-444.

[10]Wang,Z.and A.C.Bovik.“A Universal Image Quality Index.SignalProcessing Letters”,IEEE.2002,9(3):p.81-84.

[11]Wang,Z.,A.C.Bovik,and L.Lu.“Why is image quality assessment sodifficult？Acoustics,Speech,and Signal Processing.”2002,4(1):p.IV.

[12]Wang,Z.,L.Lu.,and A.C.Bovik“Video quality assessment based onstructural distortion measurement”,2004,19(1):1-9.

[13]Lu guo-qing,Li jun-li,Chen gang,et al.“methord of video qualityassessment based on visual regions-of-interest”.Computer Engineering,2009,35(10):217-219.

[14]Banitalebi Dehkordi A,Pourazad MT,Nasiopoulos P(2013)3D videoquality metricfor 3D video compression.11th IEEE IVMSP Workshop:3D Image/Video Technologies and Applications,Seoul

[15]Banitalebi Dehkordi A,Pourazad MT,Nasiopoulos P(January 2013)BAhuman visual system based 3D video quality metric,ISO/IEC JTC1/SC29/WG11,Doc.M27745,Geneva,Switzerland

[16]Hanhart P,De Simone F,Ebrahimi T(2012)BQuality assessment ofasymmetric stereo pair formed from decoded and synthesized views,^In Qualityof Multimedia Experience(QoMEX),2012 Fourth International Workshop On,pp.236–241

[17]Jin L,Boev A,Pyykko SJ,Haustola T,Gotchev A BNovel stereo videoquality metric,^Mobile3DTV project report,available:http://sp.cs.tut.fi/ mobile3dtv/results/

[18]Horn B K,Schunck B G,Determining optical flow[C]//1981TechnicalSymposium East.International Society for Optics and Photonics,1981:319～331.

[19] Maden yang, prunus mume, maruzer, etc. "objective evaluation of stereoscopic video quality based on motion and parallax information" [ J ], photoelectron-laser, 2013, 24 (10): 2002-2009

[20]Nukhet OZBEK,Gizem ERTAN,OktayKARAKUS,“Perceptual qualityevaluation of asymmetric stereo video coding for efficient 3D rate scaling”,Turk J ElecEng&Comp Sci,(2014)22:663–678

[21] Zhan, Anping, Zhang Qiweng, etc., binocular stereo video minimum recognizable distortion model and its application in quality evaluation, [ J ], bulletin of electronics and information, 2012, 34 (3): 698 to 70.

Claims

1. A three-dimensional video quality evaluation method based on motion saliency is characterized in that corresponding motion information is estimated from the t frame and the t +1 frame of two adjacent frames of images of a reference three-dimensional video left viewpoint or right viewpoint, the obtained motion information is used for guiding the corresponding t frame image of a single frame to be subjected to region segmentation to obtain a motion saliency weight distribution graph of the frame image, then a 2D image quality evaluation method quality Structure Similarity (SSIM) is used for evaluating the corresponding t frame image of a distorted three-dimensional video left viewpoint or right viewpoint to obtain a quality distribution graph of the frame image, then the quality distribution of the corresponding frame is weighted by the motion saliency weight distribution graph obtained in the last step to obtain the overall quality of the frame image, then binocular competition and binocular suppression characteristics in a Human Visual System (HVS) are used for fusing the left and right view quality in a three-dimensional video to obtain the quality of a single-frame three-dimensional image, finally the three-dimensional image quality of all frames of the three-dimensional video is averaged to obtain the overall quality of the three-dimensional video, and specifically, if the overall quality distribution of the motion saliency weight of the t frame is calculated by the motion saliency weight distribution of the overall quality distribution of the frame (SSIM) of the frame and the motion saliency weight distribution graph corresponding to obtain the overall quality distribution graph of the overall quality distribution of the frame (SSI) at the overall quality distribution graph of the overall quality graph corresponding to obtain the overall quality distribution of the overall quality graph of the three-dimensional video frame (SSI) 24, where the motion saliency map) and the overall quality distribution graph, where the overall quality distribution graph is calculated by the motion saliency map of the overall quality of the motion saliency map of the overall quality:

2. The method for evaluating the quality of the stereoscopic video based on the motion saliency as claimed in claim 1, wherein the specific steps of performing the region segmentation are to calculate the motion vector corresponding to each pixel in the image by using a loss-of-light motion estimation algorithm; then extracting the edge contour of the object in the image by using an edge extraction algorithm; and finally, calculating the mean value of the motion amounts corresponding to all pixels in the region surrounded by the edge contour of each obtained object, taking the mean value as the motion amount corresponding to the object, and representing different objects in the image according to the motion amount so as to carry out segmentation.

3. The method as claimed in claim 1, wherein SSIM mainly evaluates the similarity of two images, and the similarity of two images in SSIM is mainly measured by three indicators: brightness similarity, contrast similarity, and structural similarity;

the brightness similarity is calculated by the following formula:

the contrast similarity is calculated by the following formula:

also in equation (2), X and Y represent local pixel blocks of the reference image and local pixel blocks of the distorted image, respectively, σ_XAnd σ_YRespectively representing the standard deviation of the local pixel blocks of the reference image and the standard deviation of the local pixel blocks of the distorted image, C₂Is a constant;

the structural similarity is calculated by the following formula:

SSIM(X,Y)＝[L(X,Y)]^α·[C(X,Y)]^β·[S(X,Y)]^γ(4)