WO2010103112A1

WO2010103112A1 - Method and apparatus for video quality measurement without reference

Info

Publication number: WO2010103112A1
Application number: PCT/EP2010/053211
Authority: WO
Inventors: Xiaodong Gu; Zhibo Chen; Debing Liu; Feng Xu
Original assignee: Thomson Licensing
Priority date: 2009-03-13
Filing date: 2010-03-12
Publication date: 2010-09-16

Abstract

This invention relates to video quality measurement without having the information of the original signal. The method for measuring video quality of a video received in a data stream comprises steps of extracting from the data stream low level stream information such as QP or bitrate, estimating video quality based on artefacts, comprising determining blur, determining blockiness and determining noise, and combining the low level stream information and the artefacts based estimated video quality, wherein a video quality measure is obtained.

Description

METHOD AND APPARATUS FOR VIDEO QUALITY MEASUREMENT WITHOUT REFERENCE

Field of the invention

This invention relates to a method and apparatus for video quality measurement.

Background

This invention relates to video quality measurement without having the information of the original signal. This is called "no-reference" (NR) , and it is an essential technique in applications such as video quality monitoring, adaptive coding & streaming, etc.

Due to the increasing transmission of digital video contents over broadband and wireless networks, quality monitoring of multimedia data is becoming an important matter. From a quality of experience perspective, it is desirable to evaluate content quality at the receivers. Since the original signals are usually not available at the receiver, quality scores should be provided without having knowledge of the original signals (no-reference metrics, NR) or using very limited information about them. The limited information may be transmitted through a side channel (reduced-reference metrics, RR) .

Engelke and Zepernick¹ provided a detailed analysis of contemporary NR quality metrics. The task of NR quality metrics is very complex, as no information about the original, undistorted signal is available. Therefore, a NR method is an absolute measure of features and properties (such as blockiness and blur) in the distorted signal, which have to

¹ U. Engelke, H. Zepernick: "Perceptual-based Quality Metrics for Image and Video Services: A Survey", 3^rd EURO-NGI Conference on Next Generation Internet Networks, 2007, p.190-197 be related to perceived quality. And then the perceptual quality prediction is achieved by quantifying different features and combining them in a certain way. The weights for feature quantification are often derived from subjective experiments to find better accordance to perceived quality²'³.

There are recent works⁴'⁵ on Peak-Signal-to-Noise-Ratio (PSNR) estimation based on the stream information without the presence of the original signal (NR-PSNR) . The NR-PSNR is estimated based on the statistical properties of DCT coefficients by supposing the coding parameters from the decoded pictures. Others⁶ take encoding parameters into account, which does not reveal current transmission characteristics .

Summary of the Invention

Being a statistical analysis of the video stream, NR-PSNR does not take a Human Vision Model (HVS) into account for video quality estimation. It has been found that it is possible to improve video quality estimation by co-operation between transmitted video streams and reconstructed distorted videos. The artefacts based quality estimation scheme on the reconstructed distorted videos is directly related to the HVS property, and hence more accurate in predicting subjective viewer perception. However, the artefact detection algorithm itself is not so perfect, which leads to an un-steady predicting accuracy. The NR-PSNR

² Z. Wang, H. R. Sheikh, A.C. Bovik: "No-Reference Perceptual Quality Assessment of JPEG Compressed Images", in Proc.of IEEE International Conference on Image Processing, vol.1 , Sept.2002, p.477-480

³ M. CQ. Farias, S. K. Mitra:"No-Reference Video Quality Metric based on Artifact Measurements", in Proc. of IEEE International Conference on Image Processing, vol.3, Sept.2005, p.141-144

⁴ D. S. Turaga, Y. Chen, J. Caviedes: "No-Reference PSNR Estimation for Compressed Pictures", Signal Processing & Image Communication, vol.19, 2004, p.173-184

⁵ A. lchigaya, M. Kurozumi, N. Hara, Y. Nishida, E. Nakasu: "A Method of Estimating Coding PSNR using Quantized DCT Coefficients", in: IEEE Transactions on Circuits and Systems for Video Technology, vol.16, No.2, Feb.2006

⁶ Zhang et al., US2008/0037864A1 scheme based on the transmitted video streams is much steadier in performance, but lacks correlation with HVS property. In this invention, we provide an improved solution by fusing the two schemes, NR-PSNR and HVS, for better performance. In one aspect of the invention, a method for measuring video quality of a video received in a data stream comprises steps of extracting from the data stream low level stream information such as QP and/or bitrate, estimating video quality based on artefacts, and combining the low level stream information and the artefacts based estimated video quality, wherein a video quality measure is obtained. The estimating comprises determining blur, determining blockiness and determining noise.

In another aspect of the invention, an apparatus for measuring video quality of a video received in a data stream comprises extractor means for extracting from the data stream low level stream information such as QP or bitrate, video quality estimator means for estimating video quality based on artefacts, and combiner means for combining the low level stream information and the artefacts based estimated video quality, wherein a video quality measure is obtained. The video quality estimator means, e.g. a processor, and comprises first determining means for determining blur, second determining means for determining blockiness and third determining means for determining noise. At least the quality estimation accuracy is improved by adding NR-PSNR obtained from low level stream information to the feature space of an artificial neural network for esti- mating quality, together with blur, blockiness and noise. Further objects, features and advantages of the invention will become apparent from a consideration of the following description and the appended claims when taken in connection with the accompanying drawings. Brief description of the drawings

Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in

Fig.l distorted images with same mean-square-error (MSE); Fig.2 a correlation between the prediction accuracy of model

FUSE' and the coefficient α;

Fig.3 an image encoded by H.264 with QP=29; Fig.4 an image encoded by H.264 with QP=37; Fig.5 a flow chart of the method; and

Fig.6 a block diagram of an apparatus for video quality measurement .

Detailed description of the invention

1.1 Artefacts' based Quality Estimation Scheme

First, an artefacts' based quality estimation scheme is described. Three features are extracted in this scheme: blur, blockiness, and noise.

Image blur is defined as the spread of edges in the spatial domain. If an edge is spread over more pixels (in width), the image is more blurry than if the edge was spread over less pixels. According to the NR blur detection algorithm proposed in the literature⁷, firstly an edge detector (e.g. vertical Sobel filter) is applied in order to find edges; secondly, each row of the image is scanned and for pixels corresponding to an edge location, the start and end positions of the edge are defined as the local extreme locations closest to the edge, the edge width is given by the difference between the end and start positions, and is identified as the local blur measure for this edge location;

⁷ P. Marziliano, F. Dufaux, S. Winkler, T. Ebrahimi: "A No-Reference Perceptual Blur Metric", in Proc. of IEEE International Conference on Image Processing, 2002, p.57-60 Finally, the global blur measure for the whole image is obtained by averaging the local blur values over all edge locations .

Image blockiness is defined as the ratio between the difference at block boundary and the difference inside the block. Noise measurement is estimated by the intensity variation of the blocks.

A dataset DS={ (blur₁, blockinesSx, noise_1; MOS₁) | i=l,2,...} is then collected with enough samples, where blur_x, blockinesSx, and noise! are the artefacts measure of the sample while MOS₁ is the manually labelled quality level.

Since the relationship between human perception and the measurements of the video distortions are difficult to express in a definite manner, a quality prediction model M is built with properly selected machine learning tools, such as Artificial Neural Network (ANN)⁸.

For evaluating the performance of this learning model, a test set is used with extracted features and labelled quality level. E.g. Pearson Correlation (PC) can be adopted for the purpose, defined as the correlation coefficient between model predicted and manually labelled scores, which provides an evaluation of prediction accuracy of the model.

1.2. Statistical based NR-PSNR Scheme Second, a NR-PSNR scheme based on statistics is described. NR-PSNR is proposed based on the below observations: The Mean Square Error (MSE) in the pixel domain is equivalent to the MSE in the DCT domain, because DCT is a normalized orthogonal transformation. When a standard DCT is applied on a natural image or video sequence, the distribution of its AC coefficients shows high

X.H. Jiang, F. Meng, J. B. Xu, W. Zhou: "No-Reference Perceptual Video Quality Measurement for High Definition Videos based on Artificial Neural Network", in: International Conference on Computer and Electrical Engineering, Dec.2008, p.424-427 regularity. This can be modelled by a Probability Distribution Function (PDF) . From such models, Laplacian is the most popular one.

Although the coefficients values have changed after quanti- zation and de-quantization, the distribution trend of the coefficients keeps the same. That is, the distributions of the original DCT coefficients and the de-quantized DCT coefficients can be modelled by a same PDF, as has been proved by some studies. According to these observations, it is possible to estimate the original DCT coefficients distribution by the quantized DCT coefficients. Therefore, the quantized DCT coefficients are de-quantized, using the QP. The DCT coefficient distribution is measured, since it is the same before quantization and after de-quantization. In the next step, the MSE in the DCT domain caused by quantization can be estimated which is equivalent to the MSE in the pixel domain, and therefore PSNR (NR-PSNR) is estimated.

1.3. Performance Comparison

An artefacts based quality estimation scheme is of much higher correlation with the human perception than statistical based schemes (MSE, PSNR) , since it is directly related to the HVS property, as already pointed out in former studies. Fig.l, which shows distorted images with same MSE, is an example from literature⁹: image (a) is the original image, while (b) - (e) are images distorted with same MSE by various kinds of artefacts: b) luminance shift, c) noise, d) blockiness and e) blur respectively.

Though images (b) - (e) are of same MSE (thus, PSNR), viewer perception greatly varies.

⁹ Z. Wang, A.C. Bovik: "Mean Squared Error: Love it or Leave it ? - A New Look at Signal Fidelity Measures", IEEE Signal Processing Magazine, vol.26, No.1 , Jan.2009, p.98-117 However, such superiority of the artefacts based quality estimation schemes are always abated by the following facts: First, it is impossible to take all kinds of artefacts into account. Those missed components will weaken the superiority since they will be counted in MSE in a steady way. Second, and most important of all, the NR artefacts detection algorithm itself is not so accurate. Taking the algorithms mentioned in section 1.1 as an example, the NR blockiness detection algorithm cannot identify texture on a block boundary, i.e. differ between texture on a block boundary and the blockiness effect; further, the NR blur detection algorithm does not take the source edge width into account .

All these facts resulted in the uncertainty of performance stability of artefacts based quality estimation scheme. On the contrary, statistical based quality estimation schemes (NR-PSNR) have high performance stability while being of much lower correlation with human perception. The present invention provides a way to utilize the strong points of both the two schemes in combination.

2. User study

Results have been verified based on a user study. 30 video shots (all of SD resolution, 720x480 or 720x576, composed by 240 images) are selected in the user study, including movie, cartoon, sports, news, and advertisement, etc. Each video shot is encoded with H.264 to get six processed videos with QP set to 24, 29, 34, 37, 40, and 45 respectively, the GOP structure is IBBP... and GOP size is set to 36.

From all these processed videos, 1288 720x480 images and 1260 720x576 images are randomly selected to generate a dataset DS=DS₇20x480 U DS₇20x576 for the user study. The features (blur, blockiness, noise, NR-PSNR) for each image in DS are extracted according to the method described in Section 1. Ten random subjects (users) are involved in the user study, each is about 3 times the height of TV set away from the 60" display where each image of DS is displayed. The users are asked to give a MOS (Mean Opinion Score) for each displayed image, that is, to identify it into one out of five levels:

Tab.l: MOS levels

By average values for all subjects while picking out clear outliers, the MOS score for each image is then defined.

3. Pooling strategies According to the discussion in Sect.l, artefacts based quality estimation is quite intuitive in expressing user experience when browsing videos, and hence should be of high accuracy in estimating viewer perception. However, the estimating accuracy is subdued for some inevitable reasons: first, it is impossible to include all possible artefacts into the scheme, and second the artefact detection algorithm itself is not quite accurate. As a result, the known artefacts based quality estimation scheme is lacking performance stability. On the contrary, NR-PSNR is not expected to provide a very accurate viewer perception since HVS is not considered a statistical based quality estimation scheme. However, NR-PSNR is of high performance stability. One aspect of the invention is to fuse the two kinds of quality estimation schemes for better performance, better prediction accuracy and/or performance stability. In the following, it is shown that it is advantageous to fuse the two kinds of quality estimation scheme. 3 . 1 . Linear Pooling

Linear pooling is a basic strategy to fuse the artefacts based quality estimation scheme and NR-PSNR:

FUSE'= (1-α) -ANN_Artefacts + α-NR-PSNR where α is a constant, ANN_Artefacts is the artefacts based quality estimation model, FUSE' is the model to fuse ANN_Artefacts and NR-PSNR with linear pooling strategy. We use dataset DS to train the coefficient α. Fig.2 shows the correlation between the prediction accuracy of model FUSE' (Y axis) and the coefficient α (X axis) . Judged by the result, the prediction accuracy reaches its maximum (Pearson Correlation = 0.89, as listed in Tab.2) at about α=0.3. That means, the accuracy of prediction is optimized at α=0.3. One interesting thing to be noted is that l-α=0.7 is nearly the prediction accuracy of the blur and blockiness detection algorithms that were adopted.

3.2. Training by Artificial Neural Network

Generally an ANN-based algorithm includes the following steps: at first, video features are extracted from videos, and subjective assessments are performed on these videos to obtain the MOS score. These features and the corresponding MOS are used for the training step in order to make the ANN learn knowledge of the "distortion" (in our study, 80% of the images in the dataset are used for training and the remaining 20% images are used for testing) . After the training, the model will work as a NR model that assesses video quality like human observers in some aspects. In artefacts based quality estimation scheme based on ANN training, the features selected are the appointed artefacts: blur, blockiness, and noise in our study.

In order to fuse the scheme with NR-PSNR, a generalization to this scheme is to make NR-PSNR another extracted feature in model training. For identification, the model trained with artefacts features is called ANN Artefacts while the model trained with both, the artefacts features and NR-PSNR, is called FUSE. According to the results reported in Tab.2, the quality estimation accuracy is improved by putting NR- PSNR into the feature space.

Tab.2: Pearson Correlation of ANN_Artefacts and FUSE for different dataset

Fig.3 and Fig.4 show two examples of quality estimation, featuring the following parameters:

Fig.3: ANN_Artefacts=2.62 ; NR-PSNR=37.04 ; FUSE=3.32; MOS=4.0 Fig.4: ANN_Artefacts=2.92 ; NR-PSNR=33.06; FUSE=2.50; MOS=2.5

Fig.3 is selected from a video sequence encoded by H.264 and QP=29 while Fig.4 is selected from a video sequence encoded with QP=37 (QP=Quantization parameter) . Fig.3 is of higher user perception, as shown by its higher MOS value. However the artefacts based quality estimation scheme (FUSE) gives similar scores to both the two images, mainly because the blur detection algorithm fails to identify the special shooting technique in creating Fig.3. This led to a more inaccurate blur score. The problem is settled to some extent with the presence of NR-PSNR in FUSE. It is an example to show that the model becomes more stable with the presence of NR-PSNR, especially when the artefact detection algorithm does not give an accurate score. 4. Conclusion

Two fusion strategies are proposed to improve the performance of artefacts based quality estimation schemes with the help of stream information, in both accuracy and stability. The invention is a pioneer work in adopting stream information to improve the performance of NR quality measurement. Its importance becomes clear considering the fact that stream information is provided in most application scenarios .

In the examples, NR-PSNR is selected as the representative of stream information for simplicity. Other low level, basic stream information can be used instead.

Fig.5 shows a flow chart of the method for measuring video quality of a video received in a data stream S_in. The first step is extracting 1 from the data stream low level stream information such as QP or bitrate. The second step, which can be done before, after or at the same time as the first step, is estimating 2 video quality QM2 based on artefacts. This may comprise determining blur 21, determining blockiness 22 and determining noise (PSNR) 23. The third step is combining 3 the low level stream information and the artefacts based estimated video quality, wherein a video quality measure QM_out is obtained. Particular embodiments are described below.

In one embodiment of the invention, the low level stream information comprises bitrate or quantization parameter (QP) . In one embodiment, the step of combining the low level stream information and the artefacts based estimated video quality comprises steps of calculating a NR-PSNR value from the low level stream information, and calculating a weighted sum of said NR-PSNR value and a value resulting from said artefacts based estimated video quality. The weighted sum can be calculated according to

FUSE'= (1-α) -ANN_Artefacts + α-NR-PSNR.

In one embodiment, the calculating of a NR-PSNR value from low level stream information comprises de-quantizing received DCT coefficients of a block using at least QPs of the low level stream information, calculating a NR-PSNR value from the de-quantized DCT coefficients, determining a distribution of the de-quantized DCT coefficients in the block, and calculating a mean-square-error (MSE) of the determined distribution, wherein the mean-square-error represents said NR-PSNR value.

In one embodiment, a dataset DS={ ( NR-PSNR₁, blur_x, noise_1; blockinesSx, MOS₁) | i=l,2,...,N} is collected with N sample images, where blockinesSi, blur_x and noise! are the artefacts measure of a sample image, NR-PSNR₁ is said NR-PSNR value calculated from low level stream information, and MOS₁ is a manually labelled quality level of the sample image, and the method further comprises training a machine learning tool with the dataset. In one embodiment, determining blur comprises linewise scanning, detecting edges and measuring the width of the detected edges. In one embodiment, said measuring the width of the detected edges comprises determining start and end positions of an edge, wherein the width is given by the difference between the start and end positions. In one embodiment, the start and end positions of the detected edge are local extreme locations of luminance that are closest to the edge. In one embodiment, the edge width is used as the local blur measure for a current edge location, further comprising the step of averaging the local blur measures over all edge locations to obtain a global blur measure for the whole image. In one embodiment, the step of determining blockiness comprises determining a ratio between the difference at block boundary and the difference inside the block. In one embodiment, the step of determining noise comprises determining a luminance intensity variation of a block, e.g. by determining a standard deviation of luminance values within a block.

Fig.6 shows a block diagram of an apparatus 60 for video quality measurement of a video received in a data stream, according to one aspect of the invention. The apparatus 60 for measuring video quality comprises extractor means 61 for extracting from the data stream S_in low level stream information, video quality estimator means 62 for estimating video quality QM2 based on artefacts, wherein the video quality estimator means 62 comprises first determining means 621 for determining blur, second determining means 622 for determining blockiness and third determining means 623 for determining noise, and first combiner means 63 for combining the low level stream information and the artefacts based estimated video quality, wherein a video quality measure QM out is obtained. Measures of the determined blur, noise and blockiness can be combined in a second combiner means 624, which results in a combined measure QM2 of blur, noise and blockiness. The second combiner means 624 is optional.

In one embodiment, the low level stream information comprises quantization parameter or bitrate. In one embodiment, the first combiner means 63 for combining the low level stream information and the artefacts based estimated video quality comprises first calculator means 631 for calculating an NR-PSNR value from the low level stream information, and second calculator means 632 for calculating a weighted sum of said NR-PSNR value and a value (such as ANN_Artefacts) resulting from said artefacts based estimated video quality. In one embodiment, the first calculator means 631 for calculating a NR-PSNR value from said low level stream information comprises a de-quantizer, and further distribution determining means 6311 for determining a distribution of de-quantized DCT coefficients in a block, and MSE calculator means 6312 for calculating a mean-square- error (MSE) of the determined distribution, wherein the mean-square-error is said NR-PSNR value. In one embodiment, the first determining means 621 for determining blur comprises scanning means for linewise scanning an image, an edge detector for detecting edges and a measurement module for measuring the width of the detected edges. In one embodiment, said measurement module determines start and end positions of an edge, and provides the width as being the difference between the start and end positions.

In one embodiment, the start and end positions of the detected edge are local extreme locations of luminance that are closest to the edge. In one embodiment, the edge width is used as the local blur measure for a current edge location, and the measurement module further comprises averaging means for averaging the local blur measures over all edge locations, thus obtaining a global blur measure for the whole image. In one embodiment, the second determining means 622 for determining blockiness determines a ratio between the difference at block boundary and the difference inside the block. In one embodiment, the third determining means 623 for determining noise determines a luminance intensity variation within a block.

While there has been shown, described, and pointed out fundamental novel features of the present invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the apparatus and method described, in the form and details of the corresponding devices and in their operation, may be made by those skilled in the art without departing from the spirit of the present invention. Although the present invention has been disclosed with regard to only certain stream parameters, one skilled in the art would recognize that the method and devices described herein may be applied to any stream parameters. It is expressly intended that all combinations of those elements that perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Substitutions of elements from one described embodiment to another are also fully intended and contemplated.

It will be understood that the present invention has been described purely by way of example, and modifications of detail can be made without departing from the scope of the invention .

Each feature disclosed in the description and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination. Features may, where appropriate be implemented in hardware, software, or a combination of the two. Connections may, where applicable, be implemented as wireless connections or wired, not necessarily direct or dedicated, connections. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.

Claims

Cl aims

1. A method for measuring video quality of a video received in a data stream, comprising steps of - extracting (1) from the data stream low level stream information; estimating (2) video quality based on artefacts, comprising determining blur, determining blockiness and determining noise (PSNR); and - combining (3) the low level stream information and the artefacts based estimated video quality, wherein a video quality measure is obtained.

2. Method according to claim 1, wherein said low level stream information comprises quantization parameter.

3. Method according to claim 1 or 2, wherein the step of combining the low level stream information and the artefacts based estimated video quality comprises steps of calculating a NR-PSNR value from the low level stream information, and calculating a weighted sum of said NR-PSNR value and a value (ANN Artefacts) resulting from said artefacts based estimated video quality.

4. Method according to one of the claims 1-3, further comprising steps of de-quantizing received DCT coefficients of a block, using said low level stream information; calculating a NR-PSNR value from the de-quantized DCT coefficients; determining a distribution of the de-quantized DCT coefficients in the block; and calculating a mean-square-error (MSE) of the determined distribution, wherein the mean-square- error is said NR-PSNR value.

5. Method according to claim 3, wherein a dataset DS={ ( NR-PSNR₁, blurx, blockinesSi, noise_1; MOS₁) I i=l,2,...,N} is collected with N sample images, where blockinesSi, blurx and noise! are the artefacts measure of a sample image, NR-PSNRi is said NR-PSNR value calculated from low level stream information, and MOSi is a manually labelled quality level of the sample image, further comprising a step of training a machine learning tool with the dataset.

6. Method according to one of the claims 1-5, wherein determining blur comprises linewise scanning, detecting edges and measuring the width of the detected edges.

7. Method according to claim 6, wherein said measuring the width of the detected edges comprises determining start and end positions of an edge, wherein the width is given by the difference between the start and end positions .

8. Method according to claim 6 or 7, wherein the start and end positions of the detected edge are local extreme locations of luminance that are closest to the edge, and the edge width is used as the local blur measure for a current edge location, further comprising a step of averaging the local blur measures over all edge locations to obtain a global blur measure for the whole image.

9. Method according to one of the claims 1-8, wherein the step of determining blockiness comprises determining a ratio between luminance differences at a block boundary and luminance differences inside the block.

10. Method according to one of the claims 1-9, wherein the step of determining noise comprises determining a luminance intensity variation of a block.

11. An apparatus for measuring video quality of a video received in a data stream, comprising extractor means (61) for extracting from the data stream (S in) low level stream information; video quality estimator means (62) for estimating video quality based on artefacts, wherein the video quality estimator means comprises first determining means (621) for determining blur, second determining means (622) for determining blockiness and third determining means (623) for determining noise; and combiner means (63) for combining the low level stream information and the artefacts based estimated video quality, wherein a video quality measure (QM_out) is obtained.