CN113822856B

CN113822856B - End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation

Info

Publication number: CN113822856B
Application number: CN202110945647.1A
Authority: CN
Inventors: 杨峰; 周明亮; 沈文昊; 咸伟志; 江蔚; 纪程; 隋修宝
Original assignee: Nanjing Zhongke Inverse Entropy Technology Co ltd
Current assignee: Nanjing Zhongke Inverse Entropy Technology Co ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2024-06-21
Anticipated expiration: 2041-08-16
Also published as: CN113822856A

Abstract

The invention discloses an end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation. First, a video is preprocessed: dividing an original video into non-overlapping time segments, and cutting the time segments, wherein the regions with the same positions in each segment form a video block which is used as the input of a neural network; secondly, training the neural network: extracting features of the input video segments, and outputting a series of space-time feature graphs extracted in stages; then, inputting the characteristic diagrams of each stage into a convolutional neural network and a cyclic neural network to obtain stage quality characteristic vectors with the same dimension; and finally, calculating the quality scores of all stages respectively, and calculating the global quality score of the video sequence by combining the attention model. The invention utilizes the three-dimensional convolution layer to form the characteristic extractor, and the network can effectively extract the space-time characteristic so as to detect the degradation mode of the video.

Description

End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation

Technical Field

The invention relates to the technical field of video quality evaluation in video coding, in particular to an end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation.

Background

The demand for video services has grown exponentially over the past few years. Cisco corporation predicts that in the next few years, video-generated data traffic will account for 80-90% of the total network data traffic. With the development of communication technology, 2/3 of mobile data is transmitted to various multimedia mobile devices to meet the demands of consumers. This flexible digital lifestyle requires that consumers enjoy high quality multimedia content at any time wherever they are.

Digital Video can suffer from various distortions during acquisition, processing, compression, storage, and transmission, causing degradation in visual quality, with the purpose of Video quality assessment (Video QualityAssessment, VQA) to predict the perceived quality of the Video. A good quality evaluation method not only can automatically and accurately evaluate the quality of the video, but also can monitor and guide parameter updating and optimizing algorithms in real time, thereby better serving the video transmission.

Macroscopically, the video quality evaluation methods are divided into three types: full Reference (FR), half Reference (Reduced Reference, RR), no Reference (No Reference, NR). The non-reference quality evaluation method of the video is a method for evaluating the objective quality of the distorted video when the original lossless video is not available, so that the research difficulty of the method is high: 1) The distortion of the single frame image is poor in average value taking precision; 2) Lack of perception of motion-induced spatial distortion; 3) Interactions between spatio-temporal artifacts are difficult to estimate.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to effectively fuse video time-space domain characteristics and establish a non-reference video quality evaluation method with excellent performance based on deep learning and layered time-space domain characteristic representation. The evaluation method is more accurate and efficient, reasonably utilizes semantic information of the neural network intermediate layer, and can be used as an end-to-end overall architecture for co-optimization.

In order to achieve the above purpose, the technical scheme of the invention is as follows: an end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation comprises the following steps:

Preprocessing video: preprocessing video: dividing an original video into non-overlapping time slices, and cutting the time slices into blocks, wherein the regions with the same positions in each slice form a video block;

Training a neural network to obtain a first network model with a video space-time feature extraction function, and carrying out staged feature extraction on an input video time segment by using the first network model;

training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of each stage into the second network model to obtain feature vectors of the same dimension of each stage;

And respectively calculating the quality scores of all stages by using the third network model with the function of calculating the video quality scores, and calculating the global quality score of the video sequence by combining the attention model in the third network model.

Preferably, the first network model comprises J three-dimensional convolution layers, J generalized divisor normalization layers and J maximum pooling layers, and each stage comprises the three-dimensional convolution layers, the generalized divisor normalization layers and the maximum pooling layers which are sequentially connected; the first network model employs a linear rectification unit as an activation function.

Preferably, the second network model includes J branches, each branch includes a plurality of spatial convolution layers, a plurality of gating circulation units and a uniform pooling layer, the activation function adopts a linear rectification unit, and features of each stage are respectively input into each branch to obtain a quality feature vector of each branch, where J is the total stage number.

Preferably, the third network model comprises J fully connected layers for stage quality regression, an attention model consisting of two fully connected layers, a linear rectifying unit and an S-shaped growth curve unit, and a fully connected layer for global quality regression.

Compared with the prior art, the invention has the remarkable advantages that:

the method can effectively integrate the time-space domain characteristics of the video, reasonably utilize the semantic information of the middle layer of the neural network, and can be used as an end-to-end overall architecture for common optimization, and the accuracy and the reliability of the method are superior to those of other existing video objective quality evaluation methods; the invention utilizes the three-dimensional convolution layer to form the characteristic extractor, and the network can effectively extract the space-time characteristic so as to detect the degradation mode of the video.

Drawings

FIG. 1 is a basic flow chart of the present invention;

Detailed Description

The following describes the implementation of the present invention in detail with reference to the accompanying drawings.

The invention relates to an end-to-end non-reference video quality evaluation method for layered space-time characteristic representation, which is shown in figure 1 and comprises the following steps:

Step 1: preprocessing video: dividing an original video into non-overlapping time slices, and cutting the time slices into blocks, wherein the regions with the same positions in each slice form a video block which is used as the input of a neural network;

Specifically, in order to sufficiently extract the temporal and spatial information of the videos, each video is segmented into video segments of time length 8; at the same time, each frame of video is also cut into image blocks uniformly and non-overlapping. According to different resolution, each video can extract several video blocks with 8 time length and 256×256×3 space size, which respectively represent the height, width and channel number of video frame.

Step 2: training a neural network to obtain a first network model with a video space-time feature extraction function, and carrying out staged feature extraction on an input video time segment by using the first network model;

Specifically, the first network model comprises J three-dimensional convolution layers, four generalized divisor normalization layers and four maximum pooling layers, and each stage comprises the three-dimensional convolution layers, the generalized divisor normalization layers and the maximum pooling layers which are sequentially connected; the first network model employs a linear rectifying unit (ReLU) as an activation function. The resulting outputs of each stage are as follows:

X_j＝CNN_j(X_j-1),j∈[1,2,…,J],

Where CNN _j represents the three-dimensional convolution layer of the J-th stage, X _j represents the output of the J-th stage, J being the total number of stages.

Step 3: training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of each stage into the second network model to obtain feature vectors of the same dimension of each stage;

Specifically, each branch of the second network model comprises a plurality of spatial convolution layers, a plurality of gating circulation units and a uniform pooling layer, and the activation function adopts a linear rectification unit to finally obtain the quality feature vector of each branch.

Specifically, the specific process of inputting the feature map of each stage to the second network model to obtain the feature vector of the same dimension of each stage is as follows:

Step 31: spatial feature fusion is carried out by using a plurality of airspace convolution layers which are connected in sequence so as to obtain features with consistent dimensions:

Where Φ _j (·) represents a series of spatial convolution layers, with kernels of 3 x 3 size, zero padding and step sizes of 2 x2 size, In the j-th stage, the characteristic diagram with the time slice number k is expressed as/>And in the j-th stage, the feature vector with the sequence number k of the time slice after the spatial information fusion is represented.

Step 32: given frame level featuresA global maximization layer (denoted GP _max) is used to obtain efficient features and reduce spatial redundancy. Simultaneously using a gating loop unit (Gate Recurrent Unit, GRU), the frame-level features are refined by integrating temporal information:

Wherein the method comprises the steps of And in the j-th stage, the feature vector with the time slice sequence number k after the time domain information fusion is represented.

Step 33: obtaining the feature vector after the temporal and spatial information fusion at the stage by using uniform pooling

Step 34: repeating steps 31-33 for each stage to obtain the quality feature vectors with the same dimension of each stage.

Step 4: and respectively calculating the quality scores of all stages by using the third network model with the function of calculating the video quality scores, and calculating the global quality score of the video sequence by combining the attention model in the third network model.

Specifically, the third network model comprises four fully connected layers for stage quality regression, an attention model consisting of two fully connected layers, a linear rectifying unit and an S-shaped growth curve unit, and a fully connected layer for global quality regression.

Further, the specific process of calculating the global quality score of the video sequence is as follows:

Step 41: the quality feature vectors of all the stages are respectively input into a full-connection layer to obtain the quality scores of all the stages of the video:

Where FC _j (·) represents the fully connected layer of quality feature vector inputs for stage j and q _j is the quality score for that stage.

Step 42: because the learned model tends to over-fit a particular scene in the training set, the attention model is used to obtain a corresponding weight vector, resulting in features that have a greater impact on perceived quality. The attention model consists of two full-connection layers, a linear rectifying unit and an S-shaped growth curve unit, and the calculation mode is as follows:

H^W＝Sigmoid(FC_w2(ReLu(FC_w1(H)))),

Wherein, Representing the join operation, FC _w1 (. Cndot.) and FC _w2 (. Cndot.) represent fully joined layers, sigmoid (. Cndot.) and ReLu (. Cndot.) represent linear rectification functions and S-type growth curve functions, respectively, and H, H ^W represents eigenvectors and weight vectors, respectively.

Step 43: inputting the global quality feature vector into the full connection layer to obtain the global quality score:

Q＝FC(H⊙H^W).

Wherein, the parity element is multiplied correspondingly.

The foregoing detailed description has presented only one embodiment of the invention, which is described in greater detail and is not intended to limit the scope of the invention. It should be noted that, for other persons skilled in the art, several variations and modifications can be made without departing from the spirit of the invention, and such presently unforeseen alternatives or modifications to the present embodiments fall within the scope of the invention.

Claims

1. The end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation is characterized by comprising the following steps of:

Preprocessing video: dividing an original video into non-overlapping time slices, and cutting the time slices into blocks, wherein the regions with the same positions in each slice form a video block;

The first network model comprises J three-dimensional convolution layers, J generalized divisor normalization layers and J maximum pooling layers, and each stage comprises the three-dimensional convolution layers, the generalized divisor normalization layers and the maximum pooling layers which are sequentially connected; the first network model adopts a linear rectifying unit as an activation function, and J is the total stage number;

The second network model comprises J branches, each branch comprises a plurality of airspace convolution layers, a plurality of gating circulation units and a uniform pooling layer, an activation function adopts a linear rectification unit, and features of each stage are respectively input into each branch to obtain a quality feature vector of each branch, wherein J is the total stage number;

calculating the quality scores of all stages by using a third network model with the function of calculating the video quality scores, which is obtained through training, and calculating the global quality score of the video sequence by combining the attention model in the third network model;

the third network model comprises J full-connection layers for stage quality regression, an attention model consisting of two full-connection layers, a linear rectifying unit and an S-shaped growth curve unit, and a full-connection layer for global quality regression;

The specific process for calculating the global quality score of the video sequence is as follows:

where FC _j (·) represents the fully connected layer of quality feature vector inputs for stage j, q _j is the quality score for that stage;

step 42: the mass fraction of each stage is input into an attention model to obtain a corresponding weight vector, and the characteristics with larger influence on the perceived quality are obtained by the following calculation modes:

H^W＝Sigmoid(FC_w2(ReLu(FC_w1(H))))，

Wherein, Representing connection operation, FC _w1 (-) and FC _w2 (-) represent full connection layers, sigmoid (-) and ReLu (-) represent linear rectification functions and S-shaped growth curve functions respectively, and H, H ^W represents feature vectors and weight vectors respectively;

Q＝FC(H⊙H^W)

Wherein, the parity element is multiplied correspondingly.

2. The end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation according to claim 1, wherein the feature output at each stage of the first network model is:

X_j＝CNN_j(X_j-1)，j∈[1，2，...，J]，

3. The end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation according to claim 1, wherein the specific process of inputting feature graphs of each stage to the second network model to obtain feature vectors of the same dimension of each stage is as follows:

wherein phi _j (DEG) represents a plurality of spatial convolution layers, In the j-th stage, the feature map with the time slice number k is shown,In the j-th stage, the feature vector with the sequence number k of the time slice after the spatial information fusion is represented;

Step 32: given frame level features The global maximization layer GP _max is used to obtain efficient features and reduce spatial redundancy, while the gating loop unit GRU is used to refine frame-level features by integrating temporal information:

Wherein the method comprises the steps of In the j-th stage, the feature vector with the time slice sequence number k after the time domain information fusion is represented;

Wherein K is the total number of time slices;

Step 34: repeating the steps 31-33 for each stage feature to obtain quality feature vectors with the same dimension of each stage.