CN115511858A

CN115511858A - Video quality evaluation method based on novel time sequence characteristic relation mapping

Info

Publication number: CN115511858A
Application number: CN202211230036.XA
Authority: CN
Inventors: 毛钰; 郑博仑; 颜成钢; 孙垚棋; 高宇涵
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2022-12-23

Abstract

The invention discloses a video quality evaluation method based on novel time sequence characteristic relation mapping. The invention uses the pre-trained feature extraction network to extract the features of the frame level, then uses the Bi-LSTM network to capture the long-term dependence relationship between the content perception features of the context information of the video sequence and the frame level quality, and combines the hidden state of the network output and the features of the pre-trained feature extraction network to extract the features of the frame level to construct a novel time sequence feature relationship diagram, thereby fully utilizing the long-term and short-term time sequence relationship between the adjacent frames and the interval frames of the video sequence. The invention utilizes the bidirectional long-short term memory neural network to carry out time sequence modeling, more effectively fuses the content perception characteristics of the video sequence in the time dimension, and simultaneously, the constructed novel time sequence characteristic relational graph more effectively captures the time sequence information change of the video in the short term, thereby providing rich time sequence information for the expansion of the subsequent quality prediction task.

Description

Video quality evaluation method based on novel time sequence characteristic relation mapping

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a video quality evaluation method based on novel time sequence characteristic relation mapping.

Technical Field

In recent years, mobile communication technology has been rapidly developed, and digital devices (such as mobile phones and tablet computers) have been explosively increased. Video, as a main form of content transmission, covers many fields such as consumption, medical treatment, education and the like, and accounts for more than 80% of internet data. However, distortion and degradation occur in the links of video acquisition, compression, transmission, storage or playing and the like, which are difficult to avoid, and the watching experience of audiences and even the understanding of video semantics are affected. In order to realize effective processing of Video traffic and guide subsequent processing, such as further compression and enhancement, trade-off between bandwidth and Video Quality is made, and a reasonable and reliable Video Quality Assessment (VQA) method is designed, which becomes a research hotspot in the field of computer vision.

Video quality evaluation methods can be roughly divided into two categories: subjective evaluation and objective evaluation. Subjective indicators relying on manual evaluation of distorted video are most reliable, and the average of the test subject opinion scores collected in this study is called the Mean Opinion Score (MOS). However, when a large-scale data set is created by a subjective quality assessment method, a large amount of resources (manpower, material resources, etc.) are consumed. Meanwhile, the subjective evaluation method cannot meet the requirements of real-time evaluation of video quality in some practical applications, such as real-time monitoring of video quality in a live broadcast website. Therefore, it is very important to research the objective video quality evaluation method with great practical application value and less resource consumption. Among them, the no-reference quality evaluation model (NR-VQA) only needs to train and evaluate the objective model by using the distorted video, and has higher flexibility and real-time performance, and has been continuously concerned and researched in recent years.

Due to the lack of original reference video, conventional no-reference video evaluation methods were generally used earlier to evaluate video that was distorted after compression. And then constructing a statistical model for the distribution of the transform domain coefficients involved in a Natural Scene Statistics (NSS) method, extracting relevant characteristic information from the statistical model, and completing the prediction of video quality by combining a machine learning algorithm. However, due to the diversity of noise and complex content of the reference-free video, these methods cannot effectively and correctly represent various distortions, thereby limiting the performance of the model. Since deep learning can extract high-level semantic features, in recent years, researchers introduce a deep learning method into a video quality evaluation task, and the prediction accuracy of the quality score of a non-reference video is continuously improved.

Because the length of a video sequence contained in the existing data set is mostly within 20 seconds, the change of the timing information of the video in a short-distance sequence cannot be well captured by simply using a recurrent neural network (such as LSTM or GRU), that is, the recurrent neural network emphasizes the association between adjacent frames and fades the timing change relationship between the interval frames, in some adjacent frames of the video, if a frame which is not related to the video theme is present, the frame may not be important when determining the final video quality, and at this time, the information of the interval frames is more important.

In summary, the existing research on the video quality evaluation method, or the influence of fading the time sequence information of the video on the video quality, or the information expression of the confusing video in the time domain and the space domain, cannot accurately simulate the human visual quality perception process.

Disclosure of Invention

The invention aims to provide a video quality evaluation method based on novel time sequence characteristic relation mapping aiming at the defects of the existing quality evaluation method. The invention makes full use of the time sequence characteristics of the video sequence.

The technical scheme adopted for solving the problems is as follows:

step 1, extracting content perception features;

the invention utilizes a pre-trained convolutional neural network as a content perception feature extraction network. The involved frame-level video content-aware feature extractor is the ResNet-50 model pre-trained on ImageNet. The content-aware feature extraction network comprises a pre-trained ResNet-50 model, a spatial global average pooling layer and a global standard deviation pooling layer.

1-1, inputting content sensing characteristic extraction network, using all frames of a video as input of convolution neural network ResNet-50, extracting characteristics of every frame of video and outputting N characteristic graphs M _t The method comprises the following steps:

M _t ＝CNN(I _t ) (1)

wherein t is a subscript, t =1,2, 3.. N, N is the total frame number of the video; i is _t An image representing a tth frame of the video; m _t And showing a characteristic diagram corresponding to the t-th frame of the video.

1-2, reserving more effective information by using space pooling operation: specifically, the global average pooling operation is used for removing redundant information between different frames, the global standard deviation pooling operation is used for storing change information between different frames, and feature vectors are respectively obtained

And

finally by applying the feature vector

And

aggregate to form a content-aware feature f _t The specific calculation is as follows:

wherein GP is _mean () Representing spatial global pooling operations, GP _std () Which represents a global standard level operation of the network,

and

respectively, feature vectors obtained through global pooling and global mean difference operations,

denotes the joining of two vectors, f _t Representing the final content-aware features extracted into a single frame of video.

Step 2, time sequence characteristic fusion:

extracting the content perception characteristics f at the frame level _t The method is accessed into a Bi-directional long-short term memory (Bi-LSTM) neural network, and then the output characteristics of the network are utilized to construct a brand-new mapping of the characteristics of the current frame, the first two frames and the second two frames of the fused video by taking the characteristics of the continuous five frames of the video as a group. The new feature map structure is shown in fig. 2, and is specifically implemented as follows:

2-1, because the dimensionality of the extracted single-frame video features is too high, the subsequent model training is not facilitated, and in order to solve the problem, the extracted single content perception feature f needs to be firstly extracted _t Connecting into a full connection layer for reducing dimension to obtain new characteristic vector X _t The method comprises the following steps:

X _t ＝W _fx f _t +b _fx (5)

wherein, b _fx 、W _fx Representing parameter deviations and weights, respectively, in a single fully-connected layer, f _t Representing the ultimate content perception of the extraction of a single frame of videoAnd (5) characterizing.

2-2. Obtaining the characteristic vector X of the frame level _t And accessing a Bi-LSTM network, and capturing the long-term dependence of the content perception characteristic of the video sequence context information on the frame-level quality. The hidden size of the single-layer network unit is 128, the convolution kernel is 1X 128, and the hidden state initial value of the bidirectional long-short term network is H ₀ According to the input feature vector X of the current time _t And hidden state H of network at previous time _(t-1) Calculating the hidden state H of the bidirectional long-short term network at the current moment _t . The method comprises the following specific steps:

wherein, X _t Feature vectors representing the video frame level, a and a' are two network elements of Bi-LSTM,

representing the current implicit state of the unidirectional network a of the bidirectional long and short term network,

an implicit state representing a previous time of the network;

the current implicit state of the unidirectional network a' representing a bidirectional long and short term network,

indicating the implicit state of a' at the previous moment.

And 2-3, constructing a new time sequence characteristic mapping by utilizing the output result of the bidirectional long-short term memory network.

Because the input video sequence contains more frames, a time sequence containing all the video frame characteristics is constructedThe feature mapping calculation amount is too large, so the invention takes five frames as a group to be divided into

Each group of five consecutive frame feature vectors X contained in the group _t And constructing a 5 multiplied by 5 time sequence characteristic mapping matrix by taking the hidden state of the corresponding bidirectional long-short term network as a target element. The structure of the timing feature mapping matrix is shown in figure 2,

further, the specific steps of constructing the time sequence characteristic mapping matrix are as follows:

inputting: feature X at video frame level _t Implicit states of a bidirectional long-short term network

And

and (3) outputting: n 5 × 5 dimensional feature mapping matrices;

(1) characterizing video frame level X _t Dividing into n groups in a centralized manner;

(2) constructing n feature mapping matrixes of 5 x 5 dimensions, namely the row number i =5 and the column number j =5 of the matrixes;

characterizing video frame level X _t Dividing the set into n groups, constructing a feature mapping matrix contained in each group, and entering the following cycle;

1) The first line element is a continuous 5-frame video frame-level feature contained in the current packet, i.e.

I _1j ＝X _j

2) Starting from the second row of the matrix, implicit states for a bi-directional long-short term network will be introduced

And

the representation of the corresponding position element may be selected by the following cyclic process:

3) And returning n groups of feature mapping matrixes.

Where I denotes the row of the new timing characteristic mapping matrix, j denotes the column of the matrix, I _ij Are the elements in the matrix.

And 2-4, performing feature aggregation on each generated time sequence feature mapping matrix.

x _i ＝conv3(conv2(conv1(map _i )))) (10)

The characteristic aggregation unit is three two-dimensional convolution layers, wherein map _i Indicating the timing characteristic mapping matrix, x, possessed by the ith packet _i For the output characteristics of the ith time sequence characteristic mapping matrix, the convolution kernel size of the first convolution layer conv1 is 5 × 5, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the second convolution layer conv2 is 3 × 3, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the third convolution layer conv3 is 3 × 3, the step size is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128.

And 3, quality regression, wherein the specific method comprises the following steps:

after the aggregation of the features of the i groups is completed, the aggregated i groups of features need to be mapped onto the quality scores of the videos through a regression model. The regression model related to the invention is a Multi-layer perceptron (Multi-layer Perception), which comprises two fully-connected layers, wherein each layer is respectively provided with 128 neurons and 1 neuron, and the aggregated grouping features are regressed to the quality scores. The method comprises the following specific steps:

q _i ＝f _wFC (x _i ) (11)

wherein, f _wFC Is a function of two fully connected layers, q _i The quality score of the regressed ith packet is represented.

Step 4, quality pooling, wherein the specific method comprises the following steps:

the quality pooling module comprises a time average pooling model, and the quality score Q of the whole video is output by averagely pooling the results of all the characteristic groups predicted by the quality regression module. The method comprises the following specific steps:

wherein q is _i For each packet's quality score, Q is the final quality score of the entire video.

Step 5, performing combined training on the content perception characteristic extraction network, the time sequence characteristic fusion module, the quality regression module and the quality pooling model, wherein the specific method comprises the following steps:

the global model was trained using an Adaptive Moment Estimation (Adam) optimizer with a weight decay of 0 and an initial learning rate of 10 ^-5 And then decays by 80% every 200 epochs. Carrying out weight initialization on the model by adopting a pre-trained ResNet-50 network, wherein L is used in training ₁ The loss function is specifically as follows:

wherein w represents the weight of the model, b represents the parameter deviation in the training process,

is the actual quality score, y, of the video _i The mass fraction predicted by the quality evaluation model according to the present invention.

The invention has the following beneficial effects:

the invention uses the pre-trained feature extraction network to extract the features at the frame level, then utilizes the Bi-LSTM network to capture the long-term dependence relationship between the content perception features of the context information of the video sequence and the frame level quality, and combines the hidden state output by the network and the features extracted at the frame level by the pre-trained feature extraction network to construct a novel time sequence feature relationship diagram, thereby fully utilizing the long-term and short-term time sequence relationship between the adjacent frames and the interval frames of the video sequence. Compared with the method that the time sequence relation modeling is directly carried out on the extracted frame level characteristics by using the unidirectional circulation neural network, the content perception characteristics of the video sequence are more effectively fused in the time dimension by using the bidirectional long-short term memory neural network to carry out the time sequence modeling, and meanwhile, the time sequence information change of the video in a short term is more effectively captured by the constructed novel time sequence characteristic relation graph, so that abundant time sequence information is provided for the development of a subsequent quality prediction task.

Drawings

FIG. 1 is a block diagram of the method of the present invention;

fig. 2 is a diagram of a novel timing feature map according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

The method comprises a pre-trained convolutional neural network, a bidirectional long-short term network, a quality regression model and a video quality pooling model. Assuming that a video contains N frames, the complete video sequence is input into the global model. Firstly, extracting content perception characteristics of each frame by using a pre-trained convolutional neural network, processing the characteristics by using global pooling operation, and reserving effective information. And then, capturing the long-term dependence of the content perception characteristic of the video sequence context information on the frame-level quality by using a Bi-LSTM network. And combining the hidden state output by the network and the frame-level features extracted by the pre-trained feature extraction network to construct a novel time sequence feature relational graph, extracting the output features of each time sequence feature relational graph by using a feature aggregation unit, accessing the output features into a quality regression model to obtain the quality score corresponding to each group, and finally obtaining the quality score of the whole video by using a time pooling module.

As shown in fig. 1, the method of the present invention is as follows:

step 1, extracting content perception features, wherein a pre-trained convolutional neural network is used as a content perception feature extraction network. The involved frame-level video content-aware feature extractor is the ResNet-50 model pre-trained on ImageNet. The content-aware feature extraction network comprises a pre-trained ResNet-50 model, a spatial global average pooling layer and a global standard deviation pooling layer.

M _t ＝CNN(I _t ) (1)

where t is a subscript, t =1,2, 3.. N, N being the total number of frames of the video; I.C. A _t An image representing a tth frame of the video; m is a group of _t And showing a characteristic diagram corresponding to the t-th frame of the video.

And

finally by applying the feature vector

And

and

respectively feature vectors obtained through global pooling and global mean-difference operations,

representing the joining of two vectors, f _t Representing the extraction of the final content-aware features of a single frame of video.

Step 2, fusing time sequence characteristics, wherein the specific method comprises the following steps:

and accessing the extracted content perception features of the frame level into a bidirectional long-short term memory (Bi-directional LSTM) neural network, and constructing a brand new mapping of the features of the current frame, the previous two frames and the next two frames of the fusion video by taking the features of the continuous five frames of the video as a group by utilizing the output features of the network. The new feature map is shown in fig. 2.

X _t ＝W _fx f _t +b _fx (5)

wherein, b _fx 、W _fx Representing parameter deviations and weights, respectively, in a single fully-connected layer, f _t Representing the extraction of the final content-aware features of a single frame of video.

2-2. Obtaining the feature vector X of the frame level _t Accessing a Bi-LSTM network to capture long-term dependence of content-aware features and frame-level quality of video sequence context informationAnd (4) relationship. The hidden size of the single-layer network unit is 128, the convolution kernel is 1X 128, and the hidden state initial value of the bidirectional long-short term network is H ₀ According to the input feature vector X of the current time _t And implicit state H of the network at the previous moment _(t-1) Calculating the hidden state H of the bidirectional long-short term network at the current moment _t . The method comprises the following specific steps:

an implicit state representing a previous time of the network;

indicating the implicit state of a' at the previous moment.

And 2-3, constructing a new time sequence feature map by using the output result of the bidirectional long-short term memory network.

Because the input video sequence contains more frames and the calculation amount for constructing the time sequence characteristic mapping containing all the video frame characteristics is too large, the invention takes five frames as one group and divides the five frames into

further, the specific steps of the time sequence feature mapping matrix construction are as follows:

And

and (3) outputting: n 5 × 5 dimensional feature mapping matrices;

characterizing video frame level X _t Dividing the set into n groups, constructing a feature mapping matrix contained in each group, and entering the following circulation;

I _1j ＝X _j

2) Starting from the second row of the matrix, the implicit state of the bidirectional long-short term network will be introduced

And

3) And returning n groups of feature mapping matrixes.

And 2-4, performing feature aggregation on each generated group of time sequence feature mapping matrixes.

x _i ＝conv3(conv2(conv1(map _i )))) (10)

The characteristic polymerization unit is three two-dimensional convolution layers, wherein map _i Indicating the timing characteristic mapping matrix, x, possessed by the ith packet _i For the output characteristic of the ith time sequence characteristic mapping matrix, the convolution kernel size of the first convolution layer conv1 is 5 × 5, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the second convolution layer conv2 is 3 × 3, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the third convolution layer conv3 is 3 × 3, the step size is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128.

after the aggregation of the features of the i groups is completed, the aggregated i groups of features need to be mapped to the quality scores of the videos through a regression model. The regression model related to the invention is a Multi-layer perceptron (Multi-layer Perception), which comprises two fully-connected layers, wherein each layer is respectively provided with 128 neurons and 1 neuron, and the aggregated grouping features are regressed to the quality scores. The method comprises the following specific steps:

q _i ＝f _wFC (x _i ) (11)

wherein f is _wFC Is a function of two fully connected layers, q _i Representing the quality score of the i-th packet after regression.

the quality pooling module comprises a time average pooling model, and the quality pooling model is used for averagely pooling the results of all the characteristic groups predicted by the quality regression module and outputting the quality score Q of the whole video. The method comprises the following specific steps:

the global model was trained using an Adaptive Moment Estimation (Adam) optimizer, with a weight decay of 0 and an initial learning rate of 10 ^-5 And then decays 80% every 200 epochs. Performing weight initialization on the model by adopting a pre-trained ResNet-50 network, wherein L is used in training ₁ The loss function is specifically as follows:

Claims

1. A video quality evaluation method based on novel time sequence characteristic relation mapping is characterized by comprising the following steps:

step 1, extracting content perception features;

step 2, fusing time sequence characteristics:

and step 3, quality regression is carried out,

step 4, pooling the mass;

and 5, performing combined training on the content perception characteristic extraction network, the time sequence characteristic fusion module, the quality regression module and the quality pooling model.

2. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 1, wherein the step 1 is implemented as follows:

the method comprises the steps that a pre-trained convolutional neural network is used as a content-aware feature extraction network, and the content-aware feature extraction network comprises a pre-trained ResNet-50 model, a spatial global average pooling layer and a global standard deviation pooling layer;

1-1, obtaining the input of content sensing characteristic extraction network, using all frames of a video as the input of convolution neural network ResNet-50, extracting the characteristics of each frame of video, and outputting N characteristic graphs M _t The method comprises the following steps:

M _t ＝CNN(I _t ) (1)

where t is a subscript, t =1,2, 3.. N, N being the total number of frames of the video; i is _t An image representing a tth frame of the video; m _t Representing a feature map corresponding to the t frame of the video;

And

finally by applying the feature vector

And

aggregation to form content-aware feature f _t The specific calculation is as follows:

wherein GP _mean () Representing spatial global pooling operations, GP _std () Which represents a global standard level operation of the network,

and

3. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 2, wherein the step 2 is implemented as follows:

extracting the content perception characteristics f at the frame level _t The method is accessed into a bidirectional long-short term memory neural network, and the output characteristics of the network are utilized to construct a brand new mapping of the characteristics of the current frame, the first two frames and the second two frames of the fusion video by taking the characteristics of the continuous five frames as a group.

4. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 3, wherein the novel feature mapping is specifically realized as follows:

2-1, firstly extracting a single content perception characteristic f _t Connecting into a full connection layer for reducing dimension to obtain new characteristic vector X _t The method comprises the following steps:

X _t ＝W _fx f _t +b _fx (5)

wherein, b _fx 、W _fx Representing parameter deviations and weights, respectively, in a single fully-connected layer, f _t Representing the final content perception features of the extracted single-frame video;

2-2. Obtaining the characteristic vector X of the frame level _t Accessing a Bi-LSTM network, and capturing the long-term dependence relationship between the content perception characteristics of the context information of the video sequence and the frame-level quality; the implicit size of the single-layer network unit is set to 128, the convolution kernel is 1 multiplied by 128, and the implicit state initial value of the bidirectional long-short term network is set to H ₀ According to the input feature vector X of the current time _t And implicit state H of the network at the previous moment _(t-1) Calculating the hidden state H of the bidirectional long-short term network at the current moment _t (ii) a The method comprises the following specific steps:

wherein X _t Feature vectors representing the video frame level, a and a' are two network elements of Bi-LSTM,

an implicit state representing a previous time of the network;

representing the implicit state of a' at the previous moment;

2-3, constructing a new time sequence feature map by utilizing the output result of the bidirectional long and short term memory network;

dividing five frames into one group

Each group of five consecutive frame feature vectors X contained in the group _t And constructing a 5 multiplied by 5 time sequence characteristic mapping matrix by taking the hidden state of the corresponding bidirectional long-short term network as a target element;

5. The method according to claim 4, wherein the time series feature mapping matrix is constructed by the following steps:

inputting: features X at video frame level _t Implicit status of two-way long-short-term networks

And

and (3) outputting: n 5 × 5 dimensional feature mapping matrices;

(1) characterizing video frame level X _t Dividing into n groups in a set;

(2) constructing n feature mapping matrixes with dimensions of 5 multiplied by 5, namely the row number i =5 and the column number j =5 of the matrixes;

characterizing video frame level X _t Dividing into n groups, constructing characteristic mapping matrix contained in each group, andentering the following circulation;

1) The first row element is a continuous 5-frame video frame level feature contained in the current packet, i.e.

I _1j ＝X _j

And

3) Returning n groups of feature mapping matrixes;

6. The method according to claim 4, wherein the feature aggregation unit is three two-dimensional convolutional layers, and the specific formula is as follows:

x _i ＝conv3(conv2(conv1(map _i )))) (10)

wherein map is _i Indicating the timing characteristic mapping matrix, x, possessed by the ith packet _i For the output characteristic of the ith time sequence characteristic mapping matrix, the convolution kernel size of the first convolution layer conv1 is 5 × 5, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the second convolution layer conv2 is 3 × 3, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the third convolution layer conv3 is 3 × 3, the step size is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128.

7. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 4, wherein the quality regression in step 3 is specifically as follows:

after the feature aggregation of i groups is completed, the i groups of features after aggregation need to be mapped to the quality scores of the videos through a regression model, and the aggregated features of each group are regressed to the quality scores, which is specifically as follows:

q _i ＝f _wFC (x _i ) (11)

8. The method for evaluating video quality based on novel time series feature relationship mapping according to claim 4, wherein the quality in step 4 is pooled, and the specific method is as follows:

and (3) averagely pooling the results of all the feature groups predicted by the quality regression module, and outputting a quality score Q of the whole video, wherein the average pooling is as follows: