CN111784694A

CN111784694A - No-reference video quality evaluation method based on visual attention mechanism

Info

Publication number: CN111784694A
Application number: CN202010841520.0A
Authority: CN
Inventors: 史萍; 侯明; 潘达
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-10-16
Anticipated expiration: 2040-08-20
Also published as: CN111784694B

Abstract

The invention discloses a no-reference video quality evaluation method based on a visual attention mechanism. The invention utilizes the perception effect of human eyes when observing the distorted video, namely the motion information in the video can attract the attention of the human eyes, so that the human eyes can pay more attention to the region to influence the judgment of the overall quality of the video. In addition, motion has a masking effect, and distortion of a motion region is not easily perceived by human eyes. The invention designs a visual attention mechanism model to simulate the process of human eye perception distortion, represents the motion information of a video frame pixel by pixel through an optical flow field to serve as a visual attention diagram, and acts the visual attention diagram on a deep neural network, thereby improving the performance of a video quality evaluation model.

Description

No-reference video quality evaluation method based on visual attention mechanism

Technical Field

The invention relates to a no-reference video quality evaluation method based on a visual attention mechanism, and belongs to the technical field of digital video processing.

Background

With the development of 5G network facilities and digital media, video is more and more common in people's lives. And certain distortion can be generated in the processes of acquisition, compression and transmission of the video, so that the watching experience of people is influenced. In order to improve the quality of Video services, Video providers need to evaluate the Video quality, which is called Video Quality Assessment (VQA).

Video quality evaluation can be classified into a subjective evaluation method and an objective evaluation method. In subjective evaluation, an observer subjectively scores video quality, but the subjective evaluation has large workload, long time consumption and inconvenience; the objective evaluation method is that a computer calculates the quality index of the video according to a certain algorithm, and the evaluation method can be divided into three evaluation methods of Full Reference (FR), half Reference (RR) and no Reference (NoReference, NR) according to whether the Reference video is needed during evaluation:

(1) a full reference video quality evaluation method. The FR algorithm is to compare the difference between a video to be evaluated and a reference video and analyze the distortion degree of the video to be evaluated, thereby obtaining the quality evaluation of the video to be evaluated, under the condition that a given lossless video is used as the reference video. Common FR methods are: video quality evaluation based on video pixel statistics (mainly including peak signal-to-noise ratio and mean square error), video quality evaluation based on deep learning, and video quality evaluation based on structural information (mainly structural similarity). The FR algorithm is by far the most reliable method in objective video quality evaluation.

(2) A semi-reference video quality evaluation method. The RR algorithm is to extract partial characteristic information of a reference video as a reference, and compare and analyze the video to be evaluated so as to obtain the quality evaluation of the video. The common RR algorithm is mainly: an original video characteristic based method and a Wavelet domain statistical model based method.

(3) A no-reference video quality evaluation method. The NR algorithm refers to a method of performing quality evaluation on a video to be evaluated without a lossless video as a reference video. The commonly used NR algorithm is mainly: a method based on natural scene statistics and a method based on deep learning.

Disclosure of Invention

The invention provides a no-reference objective quality evaluation method aiming at the problem of poor no-reference video quality evaluation performance in the existing video quality evaluation.

The technical scheme adopted by the invention is a no-reference video quality evaluation method based on a visual attention mechanism, which comprises the following steps:

and step 1, extracting video frames.

For a video, after frame extraction, the frame is used as an input unit of the visual attention mechanism model.

Step 1.1, extracting video frames, extracting the video frames at intervals of 4 frames, and discarding other video frames as redundancy;

step 1.2, discarding the last frame of the extracted video frame, because the frame can not calculate the optical flow field;

and 2, generating optical flow field data.

An optical flow field of the video data is generated using an open source model PWC-Net.

Step 2.1, building a PWC-Net model, and using an open-source trained model;

step 2.2, forming a video frame pair by each video frame and the next video frame as the input of the PWC-Net;

and 2.3, calculating input PWC-Net of each group of video frames to obtain optical flow field data of all the video frames.

And 3, preprocessing the optical flow field data.

And carrying out threshold truncation normalization on the optical flow field data generated by the PWC-Net, and taking the amplitude.

Step 3.1, setting threshold values Tx (default 140) and Ty (default 160) respectively for X, Y channels of optical flow field data, and discarding optical flow data values outside the threshold values and setting the optical flow data values as the threshold values;

step 3.2, dividing all values of the optical flow field data X, Y channel by Tx and Ty respectively, and carrying out normalization;

step 3.3, calculating amplitude values of all optical flow field data

As an optical flow amplitude map;

and 3.4, scaling the optical flow field amplitude diagram to one fourth of the original size under the condition of unchanging the aspect ratio.

And 4, building and training a visual attention mechanism model.

A visual attention mechanism network based on ResNet50 was constructed and trained.

Step 4.1, reconstructing a ResNet50 network, adding a visual attention mechanism module after the second group of convolution layers of the ResNet50, namely multiplying the optical flow field amplitude diagram obtained in the step 3 by the characteristic diagram at the moment in a bit mode; the output of the visual attention mechanism module serves as the input of the ResNet50 third set of convolutional layers;

step 4.2, training data are arranged, the video frames generated in the step 1 and the corresponding optical flow amplitude maps generated in the step 3 are input into a model, and the labels are the quality scores of the videos;

and 4.3, training a visual attention mechanism network, and training by using MSELoss.

And 5, evaluating the quality of the video.

And performing frame extraction and optical flow calculation on a section of video, and performing quality evaluation.

Step 5.1, extracting video frames from the video to be detected according to the step 1;

step 5.2, generating an optical flow field amplitude map of the video frame to be detected by using the steps of step 2 and step 3;

and 5.3, performing quality evaluation by using the visual attention mechanism network trained in the step 4, and obtaining a quality score for each video frame.

And 5.4, averaging the quality scores of all the video frames to obtain the overall quality score of the video.

Compared with the prior art, the invention has the following advantages:

(1) the present invention takes advantage of the distortion perception characteristics of the human eye for video motion regions to improve VQA performance. In the process of sensing video distortion by human eyes, the motion information attracts the attention of human eyes, so that the human eyes can pay attention to the region more easily to influence the judgment of the overall video quality. On the other hand, the motion has a masking effect, and the distortion generated by the motion area is not easily perceived by human eyes. If the motion area can be screened out, the human eye vision system can be better simulated, so that the VQA model is more accurate.

(2) The invention uses PWC-Net to generate an optical flow field, can better extract a video motion area and better represent the visual perception characteristic in VQA. The optical flow field may describe motion information in the video on a pixel-by-pixel basis, which may better represent the attention view of the visual attention mechanism in VQA. PWC-Net is a high speed, high accuracy degree of depth learning model, relative to traditional method, can produce the optical flow field of higher quality more high-efficiently.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of a visual attention mechanism model of the present invention based on ResNet 50;

Detailed Description

The method is described in detail below with reference to the figures and examples.

The flow chart of an embodiment is shown in fig. 1, and comprises the following steps:

step S10, extracting video frames;

step S20, generating an optical flow field;

step S30, preprocessing optical flow field data;

step S40, building and training a visual attention mechanism model;

step S50, carrying out quality evaluation on the video;

the extracted video frame adjusting step S10 of an embodiment further includes the steps of:

step S100, extracting video frames, selecting the video frames at equal intervals, and directly discarding other video frames due to redundancy;

in step S110, the last frame of the extracted video frame is discarded because the optical flow field cannot be calculated.

The optical flow field data preprocessing adjustment step S20 of the embodiment further includes the steps of:

step S200, building a PWC-Net model, and using an open-source trained model;

step S210, forming a video frame pair by each frame of video and a frame behind the video frame as the input of PWC-Net;

step S220, input PWC-Net is calculated for each video frame, and optical flow field data of all the video frames is obtained.

The optical flow field data preprocessing adjustment step S30 of the embodiment further includes the steps of:

step S300, respectively setting threshold values Tx and Ty for X, Y channels of optical flow field data, and discarding optical flow data values outside the threshold values and setting the optical flow data values as the threshold values;

step S310, all values of the optical flow field data X, Y channel are divided by Tx and Ty respectively for normalization;

step S320, calculating amplitude values M of all optical flow field data;

step S330, scaling the optical flow field amplitude map to one fourth of the original size under the condition of unchanged aspect ratio.

The establishing and training visual attention mechanism model adjusting step S40 of an embodiment further includes the steps of:

step S400, a ResNet50 network is reconstructed, a visual attention mechanism module is added after the second group of convolution layers of the ResNet50, namely the optical flow field amplitude diagram obtained in the step S30 is multiplied by the characteristic diagram at the moment in a bit mode;

step S410, training data are sorted, a model is input into an independent video frame and an optical flow field corresponding to the independent video frame, and a label is used for marking the quality score of the video;

and step S420, training a visual attention mechanism network, and performing training by using MSE Loss.

The video quality evaluation adjustment step S50 according to the embodiment further includes the steps of:

step S500, extracting video frames from the video to be tested according to the step S10;

step S510, generating an optical flow field amplitude map of the video frame to be detected by using the steps S20 and S30;

step S520, the quality evaluation is carried out by using the visual attention mechanism network trained in the step S40, and each video frame obtains a quality score;

step S530, the quality scores of all the video frames are averaged to obtain the overall quality score of the video.

The results of experiments using the present invention are given below.

Table 1 shows the performance results on various VQA databases using the present invention.

Note: SRCC (Spearman rank correlation coefficient)

PLCC (Pearson linear correlation coefficient)

Table 1 results of testing the present invention in various VQA databases

Database with a plurality of databases	LIVE	CISQ	KoNVid-1k
				SRCC	0.824	0.801	0.801
PLCC	0.829	0.829	0.814

Claims

1. A no-reference video quality evaluation method based on a visual attention mechanism is characterized in that: the method comprises the following steps of,

step 1, extracting video frames from a video;

step 2, generating optical flow field data for the extracted video frame by using an open source model PWC-Net;

step 3, preprocessing the optical flow field data to obtain a zoomed optical flow field amplitude map;

step 4, building and training a visual attention mechanism model, specifically building and training a visual attention mechanism model based on ResNet50, wherein the visual attention mechanism model is used for scoring the quality of each extracted video frame;

and 5, extracting frames of the video to be evaluated according to the step 1, performing quality scoring on the extracted video frames to be evaluated by utilizing the trained visual attention mechanism model, and performing quality scoring on all the frames to average to obtain the integral quality score of the video.

2. The method according to claim 1, wherein the method comprises: the step of extracting video frames from the video described in step 1 is specifically as follows,

step 1.1, extracting video frames at intervals of 4 frames, and discarding other video frames as redundancy;

and step 1.2, discarding the last frame of the extracted video frame.

3. The method according to claim 1, wherein the method comprises: the step of preprocessing the optical flow field data described in step 3 is as follows,

step 3.1, setting thresholds Tx and Ty respectively for X, Y channels of optical flow field data, setting the value of the X channel as the threshold Tx and the value of the Y channel as the threshold Ty for optical flow field data of which the X channel is greater than the threshold Tx or the Y channel is greater than the threshold Ty;

step 3.2, dividing all values of the optical flow field data X, Y channel cut by the threshold in the step 3.1 by Tx and Ty respectively, and performing normalization;

the calculation process of the optical flow field amplitude map in the step 3 is as follows: calculating amplitude values M of all optical flow field data after normalization

As an optical flow field amplitude map.

4. The method according to claim 1, wherein the method comprises: the visual attention mechanism model in the step 4 refers to a modified ResNet50 network, wherein the modification specifically refers to adding a visual attention mechanism module after the second group of convolutional layers of the ResNet50, namely, the scaled optical flow field amplitude diagram obtained in the step 3 is multiplied by the output characteristic diagram of the second group of convolutional layers of the ResNet50 in a bit mode, and the output of the visual attention mechanism module is used as the input of the ResNet50 third group of convolutional layers.

5. The method according to claim 1, wherein the method comprises: step 4, model training, wherein training data input by the model is the video frame obtained in the step 1 and the corresponding optical flow field amplitude diagram generated in the step 3, and a label is a quality score of the training video;

and 4, training the model by adopting MSE Loss as a Loss function.