CN113822856A

CN113822856A - End-to-end no-reference video quality evaluation method based on layered time-space domain feature representation

Info

Publication number: CN113822856A
Application number: CN202110945647.1A
Authority: CN
Inventors: 杨峰; 周明亮; 沈文昊; 咸伟志; 江蔚; 纪程; 隋修宝
Original assignee: Nanjing Zhongke Inverse Entropy Technology Co ltd
Current assignee: Nanjing Zhongke Inverse Entropy Technology Co ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-12-21

Abstract

The invention discloses an end-to-end non-reference video quality evaluation method based on layered time-space domain feature representation. Firstly, preprocessing a video: dividing an original video into non-overlapping time segments, and carrying out block cutting on the time segments, wherein the regions at the same position in each segment form a video block which is used as the input of a neural network; secondly, training the neural network: carrying out feature extraction on an input video clip, and outputting a series of space-time feature maps extracted in stages; then, inputting the characteristic diagrams of each stage to a convolutional neural network and a cyclic neural network to obtain stage quality characteristic vectors with the same dimensionality; and finally, respectively calculating the quality scores of all stages, and calculating the global quality score of the video sequence by combining the attention model. The invention utilizes the three-dimensional convolution layer to form a feature extractor, and the network can effectively extract space-time features so as to detect the degradation mode of the video.

Description

End-to-end no-reference video quality evaluation method based on layered time-space domain feature representation

Technical Field

The invention relates to the technical field of video quality evaluation in video coding, in particular to an end-to-end no-reference video quality evaluation method based on layered time-space domain feature representation.

Background

Over the past few years, the demand for video services has increased exponentially. Cisco predicts that in the next few years, video-generated data traffic will be 80-90% of the total network data traffic. With the development of communication technology, 2/3 mobile data is transmitted to various multimedia mobile devices to meet the needs of consumers. Such a flexible digital lifestyle requires consumers to enjoy high-quality multimedia contents at any time regardless of their location.

Digital Video is subject to various distortions during acquisition, processing, compression, storage and transmission, causing a visual quality degradation, and the purpose of Video Quality Assessment (VQA) is to predict the perceived quality of Video. A good quality evaluation method can not only automatically and accurately evaluate the quality of the video, but also monitor and guide parameter updating and optimization algorithms in real time, thereby better serving video transmission.

In a macroscopic view, the video quality evaluation methods are divided into three types: full Reference (FR), half Reference (RR), No Reference (No Reference, NR). The no-reference quality evaluation method of the video is a method for evaluating the objective quality of a distorted video when an original lossless video cannot be used, so that the research difficulty of the method is large: 1) the distortion of the single frame image is poor in average precision; 2) lack of perception of motion-induced spatial distortion; 3) the interaction between spatio-temporal artifacts is difficult to estimate.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to effectively fuse video time-space domain characteristics and establish a non-reference video quality evaluation method with excellent performance based on deep learning and hierarchical time-space domain characteristic representation. The evaluation method is more accurate and efficient, reasonably utilizes semantic information of the neural network middle layer, and can be jointly optimized as an end-to-end whole framework.

In order to achieve the purpose, the technical scheme of the invention is as follows: an end-to-end no-reference video quality evaluation method based on layered time-space domain feature representation comprises the following steps:

preprocessing a video: preprocessing a video: dividing an original video into non-overlapping time segments, and cutting the time segments into blocks, wherein the areas at the same position in each segment form a video block;

training a neural network to obtain a first network model with a video space-time feature extraction function, and performing staged feature extraction on an input video time segment by using the first network model;

training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of all stages to the second network model to obtain feature vectors of the same dimensionality of all stages;

and respectively calculating the quality scores of all stages by using a third network model which is obtained by training and has the function of calculating the video quality scores, and calculating the global quality scores of the video sequence by combining an attention model in the third network model.

Preferably, the first network model includes J three-dimensional convolutional layers, J generalized divisor normalization layers, and J maximum pooling layers, and each stage includes a three-dimensional convolutional layer, a generalized divisor normalization layer, and a maximum pooling layer connected in sequence; the first network model employs a linear rectification unit as an activation function.

Preferably, the second network model includes J branches, each branch includes a plurality of airspace convolution layers, a plurality of gated circulation units and a uniform pooling layer, the activation function adopts a linear rectification unit, and the features of each stage are respectively input into the branches to obtain a mass feature vector of each branch, where J is the total number of stages.

Preferably, the third network model comprises J fully-connected layers for stage mass regression, an attention model consisting of two fully-connected layers, a linear rectification unit and an S-shaped growth curve unit, and a fully-connected layer for global mass regression.

Compared with the prior art, the invention has the following remarkable advantages:

the method can effectively fuse the video time-space domain characteristics, reasonably utilize the semantic information of the neural network middle layer, and can be used as an end-to-end overall framework for common optimization, and the accuracy and reliability of the method are superior to those of other current video objective quality evaluation methods; the invention utilizes the three-dimensional convolution layer to form a feature extractor, and the network can effectively extract space-time features so as to detect the degradation mode of the video.

Drawings

FIG. 1 is a basic flow diagram of the present invention;

Detailed Description

The following describes the detailed implementation of the present invention with reference to the accompanying drawings.

The invention relates to an end-to-end no-reference video quality evaluation method represented by layered space-time characteristics, as shown in figure 1, comprising the following steps:

step 1: preprocessing a video: dividing an original video into non-overlapping time segments, and cutting the time segments into blocks, wherein the areas at the same position in each segment form a video block which is used as the input of a neural network;

specifically, in order to sufficiently extract temporal and spatial information of videos, each video is segmented into video segments with a time length of 8; meanwhile, each frame of the video is also uniformly and non-overlapping cut into image blocks. According to the difference of the resolution, each video can extract a plurality of video blocks with the time length of 8 and the space size of 256 multiplied by 3, which respectively represent the height, the width and the channel number of the video frame.

Step 2: training a neural network to obtain a first network model with a video space-time feature extraction function, and performing staged feature extraction on an input video time segment by using the first network model;

specifically, the first network model includes J three-dimensional convolution layers, four generalized divisor normalization layers, and four maximum pooling layers, and each stage includes a three-dimensional convolution layer, a generalized divisor normalization layer, and a maximum pooling layer connected in sequence; the first network model employs a linear rectification unit (ReLU) as an activation function. The resulting outputs of each stage are as follows:

X_j＝CNN_j(X_j-1),j∈[1,2,…,J],

wherein CNN_jThree-dimensional convolutional layers representing the j stage, X_jRepresents the output of the J stage, J being the total number of stages.

And step 3: training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of all stages to the second network model to obtain feature vectors of the same dimensionality of all stages;

specifically, each branch of the second network model includes a plurality of airspace convolution layers, a plurality of gating circulation units and a uniform pooling layer, and the activation function adopts a linear rectification unit to finally obtain a quality feature vector of each branch.

Specifically, the specific process of inputting the feature map of each stage into the second network model to obtain the feature vector of the same dimension of each stage is as follows:

step 31: performing spatial feature fusion by using a plurality of sequentially connected airspace convolution layers to obtain features with consistent dimensions:

wherein phi_j(. to) represents a series of spatial convolution layers with a kernel size of 3 x 3, zero padding and a step size of 2 x 2,

a characteristic diagram with the time slice serial number k in the j stage is shown,

and representing the characteristic vector with the time slice serial number k after the spatial domain information is fused in the j stage.

Step 32: given frame level features

Using a Global maximum pooling layer (denoted GP)_max) To obtain efficient features and reduce spatial redundancy. Meanwhile, a Gated Round Unit (GRU) is used to refine the frame-level features by integrating the time information:

wherein

And representing the characteristic vector with the sequence number k of the time slice after the time domain information is fused in the jth stage.

Step 33: obtaining the feature vector of the stage after the spatio-temporal information fusion by using uniform pooling

Step 34: and repeating the steps 31-33 for each stage to obtain the quality feature vector with the same dimension for each stage.

And 4, step 4: and respectively calculating the quality scores of all stages by using a third network model which is obtained by training and has the function of calculating the video quality scores, and calculating the global quality scores of the video sequence by combining an attention model in the third network model.

Specifically, the third network model comprises four full-connection layers for stage mass regression, an attention model consisting of two full-connection layers, a linear rectification unit and an S-shaped growth curve unit, and a full-connection layer for global mass regression.

Further, the specific process of calculating the global quality score of the video sequence is as follows:

step 41: respectively inputting the quality characteristic vectors of all stages into a full-connection layer to obtain the quality scores of all stages of the video:

wherein FC_j(. h) a fully connected layer representing the quality feature vector input for stage j, q_jIs the mass fraction at this stage.

Step 42: since the learned model tends to overfit a specific scene in the training set, the attention model is used to obtain a corresponding weight vector, and a feature having a greater influence on the perceptual quality is obtained. The attention model consists of two full-connection layers, a linear rectifying unit and an S-shaped growth curve unit, and the calculation mode is as follows:

H^W＝Sigmoid(FC_w2(ReLu(FC_w1(H)))),

wherein the content of the first and second substances,

representing a connection operation, FC_w1(. o) and FC_w2Both represent fully connected layers, Sigmoid (. cndot.) and ReLu (. cndot.) represent linear rectification functions and S-shaped growth curve functions, respectively, H, H^WRespectively representing the feature vector and the weight vector.

Step 43: inputting the global quality feature vector into a full-connection layer to obtain a global quality score:

Q＝FC(H⊙H^W).

wherein, the lines indicate that the parity elements are correspondingly multiplied.

The above-mentioned embodiment only represents one embodiment of the present invention, and the description is more specific and detailed, but the embodiment does not represent the limitation of the invention patent scope. It should be noted that, for the person skilled in the art, several variations and modifications are possible without departing from the inventive concept, and that such presently unforeseen alternatives or modifications to the present disclosure are intended to be within the scope of the present disclosure.

Claims

1. An end-to-end no-reference video quality evaluation method based on layered time-space domain feature representation is characterized by comprising the following steps:

2. The end-to-end no-reference video quality evaluation method based on the hierarchical time-space domain feature representation according to claim 1, wherein the first network model comprises J three-dimensional convolutional layers, J generalized divisor normalization layers and J maximum pooling layers, and each stage comprises the three-dimensional convolutional layers, the generalized divisor normalization layers and the maximum pooling layer which are connected in sequence; the first network model adopts a linear rectifying unit as an activation function, and J is the total number of stages.

3. The end-to-end no-reference video quality evaluation method based on the hierarchical time-space domain feature representation according to claim 1 or 2, wherein the output features of the first network model at each stage are as follows:

X_j＝CNN_j(X_j-1),j∈[1,2,…,J],

4. The method according to claim 1, wherein the second network model comprises J branches, each branch comprises a plurality of space-domain convolutional layers, a plurality of gated cyclic units and a uniform pooling layer, the activation function adopts a linear rectification unit, and the features of each stage are respectively input into the branches to obtain the quality feature vector of each branch, wherein J is the total number of stages.

5. The end-to-end no-reference video quality evaluation method based on the hierarchical time-space domain feature representation according to claim 1 or 4, characterized in that the specific process of inputting the feature map of each stage to the second network model to obtain the feature vector of the same dimension of each stage is as follows:

step 31: performing spatial feature fusion by using a plurality of airspace convolution layers which are connected in sequence to obtain features with consistent dimensions:

wherein phi_j(·) Representing a plurality of spatial convolution layers,

representing a characteristic vector with the time slice serial number k after the spatial domain information is fused in the jth stage;

step 32: given frame level features

Using global max pooling layer GP_maxTo obtain efficient features and reduce spatial redundancy while using gated cyclic units GRU to refine the frame-level features by integrating temporal information:

wherein

Representing a characteristic vector with the sequence number k of the time slice after time domain information fusion in the jth stage;

Wherein K is the total number of the time segments;

step 34: and repeating the steps 31-33 for each stage feature to obtain the quality feature vector with the same dimension of each stage.

6. The method according to claim 1, wherein the third network model comprises J fully-connected layers for stage quality regression, an attention model consisting of two fully-connected layers, a linear rectification unit and an S-shaped growth curve unit, and a fully-connected layer for global quality regression.

7. The end-to-end reference-free video quality evaluation method based on the hierarchical time-space domain feature representation according to claim 1 or 6, characterized in that the specific process of calculating the global quality score of the video sequence is as follows:

wherein FC_j(. h) a fully connected layer representing the quality feature vector input for stage j, q_jIs the mass fraction of the stage;

step 42: inputting the mass fractions of all stages into an attention model to obtain corresponding weight vectors and obtain characteristics which have larger influence on the perception quality, wherein the calculation mode is as follows:

H＝h₁⊕h₂⊕…⊕h_j…⊕h_J,

H^W＝Sigmoid(FC_w2(ReLu(FC_w1(H)))),

wherein ≧ represents a connection operation, FC_w1(. o) and FC_w2(. cndot.) represents the fully connected layer, Sigmoid (. cndot.) and ReLu (. cndot.) represent the linear rectification function and the S-shaped growth curve function, respectively, H, H^WRespectively representing the feature vector and the weight vector;

Q＝FC(H⊙H^W).