CN110677639B

CN110677639B - Non-reference video quality evaluation method based on feature fusion and recurrent neural network

Info

Publication number: CN110677639B
Application number: CN201910938025.9A
Authority: CN
Inventors: 史萍; 侯明; 潘达; 应泽峰; 韩明良
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-06-11
Anticipated expiration: 2039-09-30
Also published as: CN110677639A

Abstract

The invention discloses a no-reference video quality evaluation method based on feature fusion and a recurrent neural network. The neural network used by the invention directly takes the video segment as input and adopts the feature fusion network, and the design can better extract the direct relation of the video frame, thereby more accurately obtaining the overall quality evaluation index of the video. The feature fusion network can process multiple frames at one time and obtain a feature with low dimensionality, namely the feature scale is greatly reduced relative to the data volume, and the total time can be greatly reduced for a whole video during operation.

Description

Non-reference video quality evaluation method based on feature fusion and recurrent neural network

Technical Field

The invention relates to a no-reference video quality evaluation method based on feature fusion and a recurrent neural network, and belongs to the technical field of digital video processing.

Background

Video, as a complex source of visual information, implies a large amount of valuable information. The Quality of Video directly affects the subjective feeling and information acquisition of people, and other Video tasks such as Video compression can be fed back and measured, and research on Video Quality Assessment (VQA) has also been widely regarded in recent years.

Video quality evaluation can be classified into a subjective evaluation method and an objective evaluation method. In subjective evaluation, an observer subjectively scores video quality, but the subjective evaluation has large workload, long time consumption and inconvenience; the objective evaluation method is that a computer calculates the quality index of the video according to a certain algorithm, and the evaluation method can be divided into three evaluation methods, namely Full Reference (FR), half Reference (RR) and No Reference (No Reference, NR), according to whether the Reference video is needed during evaluation:

(1) a full reference video quality evaluation method. The FR algorithm is to compare the difference between a video to be evaluated and a reference video, and analyze the distortion degree of the video to be evaluated, thereby obtaining the quality evaluation of the video to be evaluated, under the standard that an ideal video is given as the reference video. Among the common FR methods are: video quality evaluation based on video pixel statistics (mainly including peak signal-to-noise ratio and mean square error), video quality evaluation based on deep learning, and video quality evaluation based on structural information (mainly structural similarity). The FR algorithm is by far the most reliable method in objective video quality evaluation.

(2) A semi-reference video quality evaluation method. The RR algorithm is to extract partial characteristic information of a reference video as a reference, and compare and analyze the video to be evaluated so as to obtain the quality evaluation of the video. The common RR algorithm is mainly: an original video characteristic based method and a Wavelet domain statistical model based method.

(3) A no-reference video quality evaluation method. The NR algorithm is a method for evaluating the quality of a video to be evaluated under the condition of no ideal video. The commonly used NR algorithm is mainly: a method based on natural scene statistics and a method based on deep learning.

In the process of acquiring, processing, transmitting and recording videos, due to the fact that an imaging system, a processing method, a transmission medium, recording equipment and the like are not complete, and due to the fact that objects move, noise interference and the like cause video distortion and video degradation, the quality of a section of distorted video is often measured, the quality measurement of the video is directly obtained from the distorted video without using a reference video of the distorted video, and the objective evaluation of the quality of the non-reference video is called.

CN201811071199.1 discloses a no-reference image quality evaluation method based on a hierarchical feature fusion network, which mainly solves the problems of low precision and low speed in the prior art. The implementation scheme is as follows: selecting a reference image from the MSCOCO data set and establishing a polluted image database by adding noise; mean value removing and cutting are carried out on the training set images and the test set images simultaneously; designing a hierarchical feature fusion network model for performing end-to-end joint optimization according to a hierarchical processing mechanism from local features to global semantics of a human visual system; training the hierarchical feature fusion network model by utilizing a training set and a test set; performing mean value removal and cutting processing on an image to be evaluated, and inputting the processed image into a trained hierarchical feature fusion network model to obtain an image quality prediction score; therefore, the accuracy and speed of the non-reference quality evaluation are improved, and the method can be used for image screening, compression and video quality monitoring.

CN201810239888.2 discloses a full-reference virtual reality video quality evaluation method based on a convolutional neural network, which comprises the following steps: video preprocessing: obtaining a VR differential video by utilizing a left view video and a right view video of the VR video, uniformly extracting frames from the differential video, and giving non-overlapping blocks to each frame, wherein video blocks at the same position of each frame form a VR video patch; establishing two convolution neural network models with the same configuration; training a convolutional neural network model: by utilizing a gradient descent method, taking VR video patches as input, matching each patch with an original video quality score as a label, inputting the patches into a network in batches, and fully optimizing weights of each layer of the network after multiple iterations to finally obtain a convolutional neural network model for extracting virtual reality video features; extracting features by using a convolutional neural network; and obtaining a local score by using a support vector machine, obtaining a final score by using a score fusion strategy, and improving the accuracy of the objective evaluation method.

The invention aims to perform reference objective quality evaluation on video quality by adopting feature fusion and a recurrent neural network.

Disclosure of Invention

The invention provides a no-reference objective quality evaluation method aiming at the problem of poor quality evaluation performance of no-reference video in the existing video quality evaluation.

The invention adopts the technical scheme that a no-reference video quality evaluation method based on feature fusion and a recurrent neural network comprises the following steps:

step 1, obtaining a video segment from a video.

For a video, video segments are obtained through frame extraction, cropping and combination, and the video segments are used as input of the VQA model.

Step 1.1, extracting video frames, selecting the video frames at intervals of 4, and directly discarding other video frames due to redundancy;

step 1.2, cutting video frames, cutting each video frame into 280 x 280 image blocks in a window cutting mode, and setting one frame capable of cutting M image blocks;

and 1.3, combining the clipped image blocks, randomly taking N starting points in a video sequence, continuously taking T frames at the same position of the image blocks along the time direction, taking T as 8 to obtain a T multiplied by 280 video segment, wherein the T multiplied by 280 video segment is used as a minimum unit input by an VQA model, and a video segment is obtained into an M multiplied by N video segment.

And 2, building and training a feature fusion network.

Constructing and training a Resnet 50-based feature fusion network, inputting a video segment obtained in the step 1, and outputting a 1024-dimensional feature vector:

step 2.1, transforming Resnet50 into a feature fusion network, inputting [ (Batch-Size × T) × Channel × 280 × 280], adjusting to [ (Batch-Size × 1) × (Channel × T) × 280 × 280] after the 2 nd Bottleneck Layer of Resnet50, and realizing feature fusion;

step 2.2, preparing training data, taking the video segment generated in the step 1 as the input of the network, and taking the label of the video segment as the quality score of the whole video;

and 2.3, training the feature fusion network, adding the input dimension of the full connection layer to the tail of the feature fusion network to be 1, inputting the feature fusion network to be a video segment, outputting a label to be a quality score, and training by using MSE Loss.

And 3, obtaining the feature vector representation of the video.

And generating a 1024-dimensional feature vector for each video segment through the trained feature fusion network, and further forming video features.

Step 3.1, discarding the last full connection layer of the trained feature fusion network, and outputting a 1024-dimensional vector;

step 3.2, using the trained feature fusion network obtained in the step 3.1 after discarding the full connection layer to generate a feature vector of each video segment;

and 3.3, combining the characteristics of the video, and correspondingly segmenting the positions along the time axis direction to obtain the characteristics of M multiplied by N multiplied by 1024 as the video characteristics.

And 4, building and training a recurrent neural network.

And (3) building and training an LSTM recurrent neural network, inputting the video characteristics of a certain division position output in the step (3), namely the Nx 1024 video characteristics, and outputting the video quality scores.

Step 4.1, constructing an LSTM recurrent neural network, wherein the network comprises 2 layers of LSTM structures, the size of a first hidden layer is 2048, the size of a second hidden layer is 256, and then connecting a full-connection layer with the output of 1;

and 4.2, training data are sorted, and feature vectors of N video segments of one video segment are sorted into Nx 1024 to be used as the input of the recurrent neural network.

And 4.3, training a recurrent neural network, using the video quality score as a label, and using MSELoss for training.

And 5, evaluating the quality of the video.

And segmenting a section of video, sampling, extracting characteristics and evaluating quality.

Step 5.1, cutting the video to be tested into video segments according to the step 1;

step 5.2, using the feature fusion network trained in the step 2 to extract the features of the video segment cut in the step 5.1;

and 5.3, performing quality evaluation by using the recurrent neural network trained in the step 4, and obtaining M local quality scores from a video.

And 5.4, averaging the M local quality scores to obtain the overall quality score of the video.

Compared with the prior art, the invention has the following advantages:

(1) the conventional VQA technical method based on deep learning often uses a frame level network to evaluate the quality of a frame level first, and then obtains the quality score of the whole video according to the result of each frame. The neural network used by the invention directly takes the video segment as input, adopts the feature fusion network, and uses the circulating neural network to fuse the features of the video segment. The design of the invention can better extract the direct relation of the video frames, thereby more accurately obtaining the overall quality evaluation index of the video.

(2) Compared with a neural network used by a traditional image, the characteristic fusion network used by the invention has the advantages that the correlation of the content between video frames can be more fully extracted due to the characteristic fusion design on the time axis, and the characteristics obtained by the network can better represent the overall characteristics of the video.

(3) Compared with a recurrent neural network which uses frame-level characteristics as input and is used in a traditional video task, the recurrent neural network used in the invention inputs the characteristics of a video segment, so that the range of network detection quality is wider, and the overall quality evaluation of the video is more accurate.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of a feature fusion network and recurrent neural network architecture according to the present invention;

Detailed Description

The method is described in detail below with reference to the figures and examples.

Provided is an implementation mode.

The flow chart of an embodiment is shown in fig. 1, and comprises the following steps:

step S10, extracting a cutting video segment;

step S20, building and training a feature fusion network;

step S30, obtaining the feature vector representation of the video;

step S40, building and training a recurrent neural network;

step S50, carrying out quality evaluation on the video;

the extract cropped video segment adjusting step S10 of an embodiment further comprises the steps of:

step S100, extracting video frames, selecting the video frames at equal intervals, and directly discarding other video frames due to redundancy;

step S110, cutting video frames, cutting each video frame into image blocks in a window cutting mode, and setting one frame capable of cutting M image blocks;

and step S120, combining the cut image blocks, randomly taking N starting points in a video sequence, and continuously taking T frames at the same position of the image blocks along the time direction to obtain a video segment, wherein the video segment is used as a minimum unit for VQA model input, and M multiplied by N video segments can be obtained from a video segment.

The step S20 of building and training a feature fusion network according to the embodiment further includes the following steps:

step S200, modifying Resnet50 as a feature fusion network to realize feature fusion;

step S210, preparing training data, and setting labels for the video segments generated in the step S10, wherein the labels are the quality scores of the videos;

and S220, training the feature fusion network, adding the input dimension of the full connection layer to the end of the network to be 1, inputting the video segment of S210, and training by using MSE Loss, wherein the output label is a quality score.

The feature vector representation adjusting step S30 of the obtained video of the embodiment further includes the steps of:

step S300, discarding the last full connection layer of the trained feature fusion network, and outputting a 1024-dimensional vector;

step S310, generating a feature vector of each video segment by using the fusion network of S300;

step S320, combining the features of the video, and corresponding to the split position along the time axis direction to obtain the features of M × N × 1024 as the video features.

The step of building and training the recurrent neural network adjustment S40 of the embodiment further includes the following steps:

s400, building an LSTM recurrent neural network, wherein the network comprises 2 layers of LSTM structures, the size of a first hidden layer is 2048, the size of a second hidden layer is 256, and then connecting a full-connection layer with the output of 1;

step S410, training data are arranged, and feature vectors of N sections of video bands obtained in step S320 are arranged into Nx 1024 to be used as input of a recurrent neural network;

and step S420, training a cyclic neural network, using the video quality score as a label, and using MSE Loss for training.

The video quality evaluation adjustment step S50 according to the embodiment further includes the steps of:

step S500, cutting the video to be tested into video segments according to the step S10;

step S510, feature extraction is carried out on the video segments cut in the step 5.1 by using the feature fusion network trained in the step S20;

step S520, the quality evaluation is carried out by using the recurrent neural network trained in the step S40, and M local quality scores are obtained in a section of video;

step S530, averaging the M local quality scores to obtain an overall quality score of the video.

The results of experiments using the present invention are given below.

Table 1 shows the performance results on various VQA databases using the present invention. (without adding pretraining)

Table 1 results of testing the present invention in various VQA databases

Database with a plurality of databases	LIVE	CISQ	KoNVid-1k
				SRCC	0.784	0.751	0.762
PLCC	0.799	0.779	0.784

Claims

1. A no-reference video quality evaluation method based on feature fusion and a recurrent neural network is characterized in that: the method comprises the following steps of,

step 1, obtaining a video segment from a video;

for a video, video segments are obtained through frame extraction, clipping and combination and are used as input of an VQA model;

step 2, building and training a feature fusion network;

building and training a Resnet 50-based feature fusion network, inputting the video segment obtained in the step 1, and outputting a 1024-dimensional feature vector; constructing and training a Resnet 50-based feature fusion network, inputting a video segment obtained in the step 1, and outputting a 1024-dimensional feature vector:

step 2.3, training the feature fusion network, adding the input dimension of a full connection layer to the tail of the feature fusion network to be 1, inputting the feature fusion network to be a video segment, outputting a label to be a mass fraction, and training by using MSE Loss;

step 3, obtaining the feature vector representation of the video;

generating a 1024-dimensional feature vector for each video segment through the trained feature fusion network, and further forming video features;

step 4, building and training a recurrent neural network;

building and training an LSTM recurrent neural network, inputting the video characteristics of a certain segmentation position output in the step 3, and outputting the video characteristics as a quality score of the video;

constructing an LSTM recurrent neural network, wherein the network comprises a 2-layer LSTM structure, the size of a first hidden layer is 2048, the size of a second hidden layer is 256, and then connecting a full-connection layer with the output of 1;

training data are sorted, and feature vectors of N sections of video bands obtained in the step S320 are sorted into Nx 1024 to be used as input of a recurrent neural network;

training a cyclic neural network, using the video quality score as a label, and using MSE Loss to train;

step 5, evaluating the quality of the video;

2. The method according to claim 1, wherein the method comprises the following steps: the steps of deriving a video segment from a video are as follows,

and 1.3, combining the clipped image blocks, randomly taking N starting points in a video sequence, continuously taking T frames at the same position of the image blocks along the time direction, taking 8 by default to obtain a T multiplied by 280 video segment, wherein the video segment is used as a minimum unit input by an VQA model, and the M multiplied by N video segments are obtained from a video segment.