CN112954312A

CN112954312A - No-reference video quality evaluation method fusing spatio-temporal characteristics

Info

Publication number: CN112954312A
Application number: CN202110176125.XA
Authority: CN
Inventors: 牛玉贞; 钟梦真; 陈俊豪
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-11
Anticipated expiration: 2041-02-07
Also published as: CN112954312B

Abstract

The invention relates to a no-reference video quality evaluation method fusing spatio-temporal characteristics, which comprises the following steps of S1, acquiring a video data set as a training set; s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set; s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set; step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism; and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected. The invention can obviously improve the performance of the quality evaluation of the non-reference video.

Description

No-reference video quality evaluation method fusing spatio-temporal characteristics

Technical Field

The invention relates to the field of image and video processing and computer vision, in particular to a no-reference video quality evaluation method fusing spatio-temporal characteristics.

Background

With the development of social media applications and the popularity of consumer capture devices, people can record their daily lives anytime and anywhere by capturing video through portable mobile devices, and share through various media platforms. This has led to a proliferation in the number of user-generated content (UGC) videos shared and streamed over the internet. Therefore, it is very necessary to study an accurate Video Quality Assessment (VQA) model for consumer videos to monitor, control, and optimize this enormous content. In addition, since most users are not trained professionally, the lack of professional imaging knowledge may cause distortion caused by camera shake, sensor noise, out-of-focus, and the like. In addition, part of original data is inevitably lost in the processes of encoding, decoding, storing, transmitting and processing of the video, so that the video generates distortion phenomena, and phenomena such as noise, deformation, distortion, deletion and the like occur. The distortion can lose the information contained in the original video to different degrees, thereby influencing the look and feel of people on the video and influencing people to acquire information from the video. For an organization providing user-centric video services, it is important to ensure that the video after the production and distribution chain can meet the quality requirements of the video receiving end. The video quality evaluation model can evaluate the quality of the video according to the video distortion degree, so that a basis is provided for subsequent video processing. Video quality assessment is one of the key technologies in the field of video processing, and is crucial for current images in the fields of medicine, aviation, education, entertainment and the like.

Quality assessment of video can be divided into subjective quality assessment and objective quality assessment. Subjective quality assessment, which relies on manual scoring, is the most accurate and reasonable quality assessment, but its widespread use in the real world is limited due to the time and labor it consumes. Therefore, researchers have proposed objective quality assessment methods to automatically predict the visual quality of distorted video. According to the availability of the reference information, the objective quality assessment method is divided into: full reference, half reference and no reference. Since many videos do not have a reference video in practical applications, such as user generated content videos, because during the video capture process it is not possible to capture a "perfect" video that is completely distortion free, the additional information of the reference video also results in high bandwidth occupation during video transmission. Therefore, the no-reference quality evaluation method without reference to the original video has wider practical application value.

Most of the existing reference-free video quality assessment models mainly aim at synthesis distortion (such as compression distortion). There is a large difference between real and composite distorted video, the former may suffer from complex mixed real world distortion, and the distortion may also be different for different time periods of the same video. And according to recent studies, some of the most advanced video quality assessment methods validated on synthetic distortion datasets do not perform well on true distortion video datasets. In recent years, with the disclosure of real distortion video quality assessment data sets, and the urgent need of real application. The method comprises the steps of inputting a video residual image sequence into a 3D convolution network to calculate to obtain time domain characteristics of a video, and applying attention mechanism to adaptively adjust the influence of time domain and space domain distortion on video perception quality. The model can obviously improve the performance of the reference-free video quality evaluation model.

Disclosure of Invention

In view of the above, the present invention provides a method for evaluating quality of a reference-free video by fusing spatio-temporal features, so as to effectively improve the efficiency and performance of evaluating the quality of the reference-free video.

In order to achieve the purpose, the invention adopts the following technical scheme:

a no-reference video quality assessment method fusing spatio-temporal features comprises the following steps:

step S1, acquiring a video data set as a training set;

s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;

s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;

step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;

and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.

Further, the step S2 is specifically:

step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;

s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;

and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.

Further, the spatial domain feature extraction sub-network specifically includes: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 convolution layer with the channel number of C is used to obtain the spatial domain feature map of the video frame

Then, the space domain feature map F is processed_sAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.

Further, the step S3 is specifically:

step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;

step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;

step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.

Further, the time domain feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically includes: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-video

The pooling module is composed of a global maximum pooling layer and realizes a time domain feature map F_tConversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the characteristic vector and the quality score.

Further, the step S32 is specifically: dividing a video of a training set into a plurality of sub-videos with equal length, wherein each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:

RF_i～j＝F_i+1～j-F_i～j-1

wherein, F_iRepresenting the ith frame of the video, using F_i～jRepresenting a sub-video, RF, from frame i to frame j of the video_i～jA residual image sequence representing the segment of sub-video;

inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × F × W time domain feature map F through a 3D convolution module_tAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.

Furthermore, the video quality evaluation network comprises a spatial domain feature extraction module, a time domain feature extraction module, an attention module, a plurality of subsequent pooling layers and a full-connection layer; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.

Further, the video quality assessment network is constructed and trained, specifically:

obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature maps

Then F is summed by splicing_tAnd F_sPreliminarily fused into a space-time characteristic diagram

Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map F_stCompute fusion attention maps for F_stAggregating spatial information of each feature map using average pooling and maximum pooling separately

And

then will be

And

adding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid function

Computing a spatial attention map of spatio-temporal features, fusing the attention maps

Broadcast results along the spatial dimension

Will be expanded A^′ _fAnd original characteristic diagram F_stElement-by-element multiplication to obtain new characteristic diagram F^′ _stThen using the new feature map F^′ _stGenerating a spatial attention map A_s；

For new feature diagram F^′ _stApplying average pooling and maximum pooling along the channel dimension

And

and after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid function

Draw spatial attention to A_sAnd space-time characteristics F^′ _stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram F_fusion；

Spatio-temporal feature map F using global pooling_fusionConversion to C-dimensional vector F_vFinally, vector F_vObtaining a sub-video quality score through full connection layer regression;

using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;

parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;

and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.

Further, A is_fThe calculation formula is as follows:

wherein the content of the first and second substances,

representing the splice according to the channel, sigma representing the sigmoid function, MLP is a shared multi-layer perceptron, each layer perceptron is followed by a ReLU activation function;

A_sthe calculation formula is as follows:

wherein

Representing the multiplication element by element,

representing the concatenation by channel, sigma represents the sigmoid function,

conv stands for convolutional layer.

Further, the space-time feature map F is processed by using a global pooling method_fusionConversion to C-dimensional vector F_vThe method specifically comprises the following steps: f is to be_fusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as F_sv(ii) a And F is_fusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as F_tv(ii) a Then F is mixed_svAnd F_tvSplicing to obtain a C-dimensional vector F_v。

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method, the deep semantic features are extracted through the spatial domain feature extraction module so as to solve the problem of content dependency of the predicted video quality. Designing a time domain feature extraction module, replacing an RGB frame with a video residual image, and removing a static object and background information to capture more information specific to motion; the attention module is fused with the space-time characteristics, the influence of space-domain and time-domain distortion on the video perception quality is adjusted in a self-adaptive mode, and the performance of non-reference video quality evaluation can be improved remarkably.

2. The model of the invention can be well suitable for the video suffering from complex mixed real world distortion, and has wider practical application value.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of a model for reference-free video quality assessment incorporating spatiotemporal features in an embodiment of the present invention;

FIG. 3 is a block diagram of a time domain feature extraction sub-network in an example of the invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a no-reference video quality assessment method with spatio-temporal features fused, comprising the following steps:

step S1, acquiring a video data set, and randomly dividing the video data set into a training set (80%) and a testing set (20%) according to a preset proportion;

S3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set, specifically comprising the steps of;

preferably, a video of the training set is divided into a plurality of sub-videos with equal length, and each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:

RF_i～j＝F_i+1～j-F_i～j-1

Preferably, in this embodiment, the spatial domain feature extraction sub-network specifically includes: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 (C-128) convolution layer with the channel number of C is used to obtain a spatial domain feature map of a video frame

Preferably, in this embodiment, the space-time domain feature extraction sub-network sequentially includes a 3D convolution module, a pooling module, and a regression module, and specifically includes: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-video

Preferably, in this embodiment, the video quality evaluation network includes a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, and a plurality of subsequent pooling layers and full-link layers; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.

The video quality assessment network construction and training method specifically comprises the following steps:

And

then will be

And

A_fThe calculation formula is as follows:

wherein the content of the first and second substances,

representing the channel splicing, representing sigma representing a sigmoid function, wherein MLP is a shared multilayer perceptron, and each layer of perceptron is followed by a ReLU activation function;

Broadcast results along the spatial dimension

And

A_sThe calculation formula is as follows:

wherein

Representing the multiplication element by element,

represents the splice by channel, σ represents the sigmoid function, and Conv represents the convolutional layer.

Spatio-temporal feature map F using global pooling_fusionConversion to C-dimensional vector F_vWill F_fusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as F_sv(ii) a And F is_fusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as F_tv(ii) a Then F is mixed_SvAnd F_tvIs spliced to obtainTo a C-dimensional vector F_vFinally, vector F_vObtaining a sub-video quality score through full connection layer regression;

In this embodiment, step S5 is specifically;

step S51: and dividing each video to be tested into a plurality of sub-videos by the method of the step S32, wherein each sub-video comprises continuous m frames.

Step S52: firstly, the sub-video is divided into frames and input into a spatial domain feature extraction module. The sub-video is then input to a time domain feature extraction module. And finally, predicting the quality scores of the sub videos through a video quality evaluation network.

Step S53: and taking the average value of the predicted quality scores of all the sub-videos in the video as the predicted quality score of the video.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A no-reference video quality evaluation method fusing spatio-temporal characteristics is characterized by comprising the following steps:

step S1, acquiring a video data set as a training set;

2. The method for evaluating quality of a reference-free video fused with spatio-temporal features according to claim 1, wherein the step S2 specifically comprises:

3. The method as claimed in claim 2, wherein the spatial domain feature extraction sub-network specifically comprises: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 convolution layer with the channel number of C is used to obtain the spatial domain feature map of the video frame

Then to space featuresSign graph F_sAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.

4. The method for evaluating quality of a reference-free video fused with spatio-temporal features according to claim 1, wherein the step S3 specifically comprises:

5. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features as claimed in claim 4, wherein the temporal feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically comprises: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-video

The pooling module is composed of a global maximum pooling layer and realizes a time domain feature map F_tConversion to a feature vector; the regression module is composed of a full linkAnd the layer connection component is used for realizing the mapping of the feature vector and the quality fraction.

6. The method for reference-free video quality assessment with fusion of spatio-temporal features according to claim 4, wherein said step S32 specifically comprises: dividing a video of a training set into a plurality of sub-videos with equal length, wherein each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:

RF_i～j＝F_i+1～j-F_i～j-1

inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × H × W time domain feature map F through a 3D convolution module_tAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.

7. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features according to claim 1, wherein the video quality evaluation network comprises a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, a plurality of subsequent pooling layers and a full-link layer; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.

8. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features according to claim 1, wherein the video quality evaluation network is constructed and trained, and specifically comprises the following steps:

And

then will be

And

Broadcast results along the spatial dimension

Extended A'_fAnd original characteristic diagram F_stElement-by-element multiplication to obtain new feature map F'_stFollowed by a new feature map F'_stGenerating a spatial attention map A_s；

To new feature map F'_stApplying average pooling and maximum pooling along the channel dimension

And

Draw spatial attention to A_sAnd spatio-temporal feature F'_stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram F_fusion；

9. The method according to claim 8, wherein A is a quality estimation method for spatio-temporal feature fused reference-free video_fThe calculation formula is as follows:

wherein the content of the first and second substances,

A_sthe calculation formula is as follows:

wherein

Representing the multiplication element by element,

10. The method as claimed in claim 8, wherein the spatiotemporal feature fusion method is a global pooling method for spatiotemporal feature fusion_fusionConversion to C-dimensional vector F_vThe method specifically comprises the following steps: f is to be_fusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as F_sv(ii) a And F is_fusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as F_tv(ii) a Then F is mixed_svAnd F_tvSplicing to obtain a C-dimensional vector F_v。