CN112954312B

CN112954312B - Non-reference video quality assessment method integrating space-time characteristics

Info

Publication number: CN112954312B
Application number: CN202110176125.XA
Authority: CN
Inventors: 牛玉贞; 钟梦真; 陈俊豪
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2024-01-05
Anticipated expiration: 2041-02-07
Also published as: CN112954312A

Abstract

The invention relates to a non-reference video quality assessment method fusing space-time characteristics, which comprises the following steps of S1, obtaining a video data set as a training set; s2, constructing a space domain feature extraction sub-network, and training a frame set based on downsampling of a training set; s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set; s4, constructing a video quality evaluation network according to the trained airspace feature extraction sub-network and the trained time domain feature extraction sub-network, and adaptively adjusting the influence of the time domain and airspace features on the video perceived quality through an attention mechanism, and training to obtain a video quality evaluation model; and S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected. The invention can obviously improve the performance of the non-reference video quality evaluation.

Description

Non-reference video quality assessment method integrating space-time characteristics

Technical Field

The invention relates to the fields of image and video processing and computer vision, in particular to a non-reference video quality assessment method integrating space-time characteristics.

Background

With the development of social media applications and the popularity of consumer capture devices, people can record their daily lives by capturing video through portable mobile devices anywhere and anytime, and share through various media platforms. This has led to a proliferation in the number of user-generated content (user-generated content, UGC) videos that are shared and streamed over the internet. Therefore, it is highly necessary to study accurate video quality assessment (video quality assessment, VQA) models for consumer video to monitor, control and optimize this vast content. In addition, since most users are not trained, distortion due to camera shake, sensor noise, defocus, and the like may be included therein because of lack of specialized imaging knowledge. And partial original data are inevitably lost in the processes of encoding and decoding, storing, transmitting and processing the video, so that the video generates distortion phenomenon, and noise, deformation, distortion, deletion and the like are generated. Distortion can lose information contained in the original video to varying degrees, thereby affecting the look and feel of people to the video and affecting people to acquire information from the video. For organizations that provide user-centric video services, it is critical to ensure that the video after the production and distribution chain can meet the quality requirements of the video receiving end. The video quality assessment model can assess the quality of video according to the video distortion degree, thereby providing a basis for subsequent video processing. Video quality assessment is one of the key technologies in the video processing field, and is important for current images in the fields of medicine, aviation, education, entertainment and the like.

The quality assessment of video can be categorized into subjective quality assessment and objective quality assessment. Subjective quality assessment relying on manual scoring is the most accurate and reasonable quality assessment, but its widespread use in the real world is limited by the time and manpower it consumes. Accordingly, researchers have proposed objective quality assessment methods to automatically predict the visual quality of distorted video. According to the availability of the reference information, the objective quality assessment method is divided into: full reference, half reference and no reference. Since in practical applications many videos do not have reference videos, such as user generated content videos, because during video capture it is not possible to capture a "perfect" video with no distortion at all, the additional information of the reference video also results in high bandwidth occupation during video transmission. Therefore, the non-reference quality assessment method without reference to the original video has wider practical application value.

Most existing reference-free video quality assessment models are mainly directed to synthetic distortion (e.g., compression distortion). There is a large distinction between true distorted video, which may suffer from complex mixed real world distortion, and composite distorted video, which may also be different at different time periods for the same video. And according to recent studies, some of the most advanced video quality assessment methods validated on synthetically distorted data sets perform poorly on truly distorted video data sets. In recent years, with the disclosure of true distortion video quality assessment data sets, and urgent demands of real-world applications. The method for estimating the quality of the reference-free video fused with the space-time characteristics is provided, the time domain characteristics of the video are obtained by using a 3D convolution network to calculate through a video residual image sequence input, and the influence of time domain and space domain distortion on the perceived quality of the video is adaptively adjusted by applying a attention mechanism. The model can significantly improve the performance of the reference-free video quality assessment model.

Disclosure of Invention

Therefore, the invention aims to provide a non-reference video quality assessment method integrating space-time characteristics, which effectively improves the efficiency and performance of non-reference video quality assessment.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a non-reference video quality assessment method integrating space-time characteristics comprises the following steps:

step S1, acquiring a video data set as a training set;

s2, constructing a space domain feature extraction sub-network, and training a frame set based on downsampling of a training set;

s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;

s4, constructing a video quality evaluation network according to the trained airspace feature extraction sub-network and the trained time domain feature extraction sub-network, and adaptively adjusting the influence of the time domain and airspace features on the video perceived quality through an attention mechanism, and training to obtain a video quality evaluation model;

and S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.

Further, the step S2 specifically includes:

step S21, uniformly downsampling each video of a training set, wherein the sampling frequency is that one frame is taken for each f frames, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;

s22, constructing an airspace feature extraction sub-network according to the image classification network as a main network, and pre-training;

step S23, training the space domain feature extraction sub-network according to the training frame set by fixing the pre-trained parameters in the main network, and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all frames in the training frame set, so as to complete the training process of the space domain feature extraction sub-network.

Further, the airspace feature extraction sub-network specifically includes: VGG16, resNet50 or Densenet is used as a backbone network, and the part after the last convolution layer of the backbone network is replaced by the following parts: firstly, a 1X 1 convolution layer with the channel number of C is used to obtain the space domain characteristic diagram of the video frameThen for the space domain feature map F _s And carrying out global average pooling and global standard deviation pooling, splicing the pooled two vectors, and finally mapping the spliced vectors into the quality fraction of the video frame by adopting a full connection layer, wherein the modified network is used as a spatial domain feature extraction sub-network of the video.

Further, the step S3 specifically includes:

s31, constructing a neural network formed by a plurality of 3D convolution layers as a video time domain feature extraction sub-network;

step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained by all videos in the training set as a sub-video set, wherein the real quality score of each sub-video is the real quality score of the corresponding video;

step S33, training a time domain feature extraction sub-network by using the sub-video set and taking a batch as a unit; and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video, so that the training process of extracting the time domain characteristics from the sub-network is completed.

Further, the time domain feature extraction sub-network is sequentially composed of a 3D convolution module, a pooling module and a regression module, and specifically comprises: the 3D convolution module is provided with 6 3D convolution layers, the convolution kernel size of the first 5 convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; an activation function ReLU is used behind each convolution layer, and the channel number of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain feature map of the input sub-videoThe pooling module consists of a global maximum pooling layer for realizing a time domain feature map F _t Conversion to a feature vector; the regression module consists of a full connection layer and is used for realizing the mapping of the feature vector and the quality fraction.

Further, the step S32 specifically includes: dividing a video of a training set into a plurality of equal-length sub-videos, wherein each sub-video comprises continuous m frames; for each sub-video, a corresponding residual image sequence is calculated, with the following formula:

RF _i～j ＝F _i+1～j -F _i～j-1

wherein F is _i An ith frame representing video, using F _i～j Representing a sub-video from the i-th frame of the video to the j-th frame of the video,RF _i～j a sequence of residual images representing the segment of sub-video;

the residual image sequence of each sub-video is input into the network designed in the step S31, and a C×F×W time domain feature map F is obtained through a 3D convolution module _t And C, H and W are the channel number, the height and the width of the feature map respectively, then a vector of Cx1 is obtained through a pooling module, and the quality fraction of the sub video is obtained through mapping of a regression module.

Further, the video quality evaluation network comprises a airspace feature extraction module, a time domain feature extraction module, an attention module, a plurality of later pooling layers and a full connection layer; the trained airspace feature extraction module is a main network and a 1 multiplied by 1 convolution layer of the airspace feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of the time domain feature extraction sub-network.

Further, the video quality evaluation network construction and training specifically comprises:

calculating average value of m airspace feature images to obtain airspace feature image of corresponding sub-videoThen sum F by stitching _t And F _s Preliminary fusion into a spatiotemporal profile +.>

The attention module is designed to integrate attention and space attention, and is firstly based on a space-time characteristic diagram F _st Calculate a fused attention diagram, for F _st Aggregating spatial information of each feature map separately using average pooling and maximum pooling to obtainAnd->Then will->And->The results obtained by sharing the multi-layer perceptron are added, and then a fusion attention strive is obtained by a sigmoid function>

Calculating a spatial attention map of the spatiotemporal features, fusing the attention mapsBroadcast get +.>Will expand A ^′ _f And original feature map F _st Multiplication by element to obtain a new feature map F ^′ _st Then use the new feature map F ^′ _st Generates a spatial attention map A _s ；

For new feature map F ^′ _st Applying average pooling and maximum pooling along the channel dimensionAndand generating a spatial attention map by splicing them and passing through a convolution layer and sigmoid function

Attempt a to pay attention to space _s And spatiotemporal features F ^′ _st Element-by-element multiplication to obtain final spatiotemporal feature map F _fusion ；

Spatiotemporal feature map F using global pooling method _fusion Conversion to a C-dimensional vector F _v Finally, vector F _v Obtaining the son vision through full-connection layer regressionFrequency quality fraction;

using the parameters of the corresponding part in the trained airspace feature extraction sub-network as the parameters of the airspace feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;

fixing parameters of the airspace feature extraction module and the time domain feature extraction module, and training a video quality evaluation network according to the sub-video set;

and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, so that the training process of the video quality evaluation network is completed.

Further, the A _f The calculation formula is as follows:

wherein,representing sigmoid function according to channel splicing, wherein the MLP is a shared multi-layer perceptron, and each layer of perceptron is followed by a ReLU activation function;

A _s the calculation formula is as follows:

wherein the method comprises the steps ofRepresenting multiplication element by element>Representing the sum of the sigmoid functions in terms of channel concatenation, sigma represents the sigmoid function,

conv stands for convolutional layer.

Further, the space-time characteristic diagram F is formed by using a global pooling method _fusion Conversion to a C-dimensional vector F _v The method comprises the following steps: will F _fusion The first C/2 feature graphs of the model (C) are respectively subjected to average pooling and standard deviation pooling to obtain two vectors of C dimension, and the vectors of C dimension are reduced to C/2 dimension through a full connection layer for maintaining feature balance and marked as F _sv The method comprises the steps of carrying out a first treatment on the surface of the And F is combined with _fusion The latter C/2 feature graphs are subjected to maximum pooling to obtain a vector with C/2 dimension, which is marked as F _tv The method comprises the steps of carrying out a first treatment on the surface of the And then F is arranged _sv And F _tv Splicing to obtain a C-dimensional vector F _v 。

Compared with the prior art, the invention has the following beneficial effects:

1. the invention extracts deep semantic features through the airspace feature extraction module to solve the problem of content dependence of the predicted video quality. Designing a time domain feature extraction module and using a video residual image to replace an RGB frame, and removing static objects and background information to capture more motion-specific information; the attention module is used for fusing the space-time characteristics, so that the influence of spatial domain and time domain distortion on the video perception quality can be adaptively adjusted, and the performance of non-reference video quality assessment can be remarkably improved.

2. The model of the invention can be well applied to videos suffering from complex mixed real world distortion, and has wider practical application value.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of a non-reference video quality assessment model incorporating spatio-temporal features in an embodiment of the present invention;

fig. 3 is a block diagram of a time domain feature extraction sub-network in an example of the invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

Referring to fig. 1, the present invention provides a non-reference video quality assessment method with fusion of spatio-temporal features, comprising the following steps:

step S1, acquiring a video data set, and randomly dividing the video data set into a training set (80%) and a testing set (20%) according to a preset proportion;

S3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set, specifically;

preferably, a video of the training set is divided into a plurality of equal-length sub-videos, and each sub-video comprises continuous m frames; for each sub-video, a corresponding residual image sequence is calculated, with the following formula:

RF _i～j ＝F _i+1～j -F _i～j-1

wherein F is _i An ith frame representing video, using F _i～j Representing a sub-video from video ith frame to video jth frame, RF _i～j A sequence of residual images representing the segment of sub-video;

Preferably, in this embodiment, the spatial domain feature extraction sub-network specifically includes: VGG16, resNet50 or Densenet is used as a backbone network, and the part after the last convolution layer of the backbone network is replaced by the following parts: first, a 1×1 (c=128) convolution layer with a channel number of C is used to obtain a spatial signature of a video frameThen for the space domain feature map F _s And carrying out global average pooling and global standard deviation pooling, splicing the pooled two vectors, and finally mapping the spliced vectors into the quality fraction of the video frame by adopting a full connection layer, wherein the modified network is used as a spatial domain feature extraction sub-network of the video.

Preferably, in this embodiment, the space-time domain feature extraction sub-network sequentially comprises a 3D convolution module, a pooling module and a regression module, which specifically includes: the 3D convolution module is provided with 6 3D convolution layers, the convolution kernel size of the first 5 convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; an activation function ReLU is used behind each convolution layer, and the channel number of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain feature map of the input sub-videoThe pooling module consists of a global maximum pooling layer for realizing a time domain feature map F _t Conversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the feature vector and the quality score.

Preferably, in this embodiment, the video quality assessment network includes a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, and a plurality of subsequent pooling layers and full connection layers; the trained airspace feature extraction module is a main network and a 1 multiplied by 1 convolution layer of the airspace feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of the time domain feature extraction sub-network.

The video quality evaluation network construction and training is specifically as follows:

The attention module is designed to integrate attention and space attention, and is firstly based on a space-time characteristic diagram F _st Calculate a fused attention diagram, for F _st Aggregating spatial information of each feature map separately using average pooling and maximum pooling to obtainAnd->Then will->And->The results obtained by sharing the multi-layer perceptron are added, and then a fusion attention strive is obtained by a sigmoid function>A _f The calculation formula is as follows:

For new feature map F ^′ _st Applying average pooling and maximum pooling along the channel dimensionAndand generating a spatial attention map by splicing them and passing through a convolution layer and sigmoid functionA _s The calculation formula is as follows:

wherein the method comprises the steps ofRepresenting multiplication element by element>Representing the splice by channel, σ represents the sigmoid function, conv represents the convolutional layer.

Spatiotemporal feature map F using global pooling method _fusion Conversion to a C-dimensional vector F _v F is to F _fusion The first C/2 feature graphs of the model (C) are respectively subjected to average pooling and standard deviation pooling to obtain two vectors of C dimension, and the vectors of C dimension are reduced to C/2 dimension through a full connection layer for maintaining feature balance and marked as F _sv The method comprises the steps of carrying out a first treatment on the surface of the And F is combined with _fusion The latter C/2 feature graphs are subjected to maximum pooling to obtain a vector with C/2 dimension, which is marked as F _tv The method comprises the steps of carrying out a first treatment on the surface of the And then F is arranged _Sv And F _tv Splicing to obtain a C-dimensional vector F _v Finally vector F _v Obtaining sub-video quality scores through full-connection layer regression;

In this embodiment, step S5 is specifically;

step S51: each video to be tested is divided into a plurality of sub-videos by the method of step S32, and each sub-video contains consecutive m frames.

Step S52: first, the sub-video split frame is input to the spatial domain feature extraction module. The sub-video is then input to a temporal feature extraction module. And finally, predicting the quality score of the sub-video through a video quality evaluation network.

Step S53: taking the average value of the predicted quality scores obtained by all the sub-videos in the video as the predicted quality score of the video.

The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The reference-free video quality assessment method integrating the space-time characteristics is characterized by comprising the following steps of:

step S1, acquiring a video data set as a training set;

s5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected;

the step S2 specifically comprises the following steps:

step S23, fixing pre-trained parameters in a backbone network, training a space domain feature extraction sub-network according to a training frame set, and learning optimal parameters of a model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all frames in the training frame set, so as to complete the training process of the space domain feature extraction sub-network;

the airspace characteristic extraction sub-network specifically comprises: VGG16, resNet50 or Densenet is used as a backbone network, and the part after the last convolution layer of the backbone network is replaced by the following parts: firstly, a 1X 1 convolution layer with the channel number of C is used to obtain the space domain characteristic diagram of the video frameThen for the space domain feature map F _s Carrying out global average pooling and global standard deviation pooling, splicing the pooled two vectors, and finally mapping the spliced vectors into the quality fraction of the video frame by adopting a full connection layer, wherein the modified network is used as a spatial domain feature extraction sub-network of the video;

the step S3 specifically comprises the following steps:

step S33, training a time domain feature extraction sub-network by using the sub-video set and taking a batch as a unit; the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video, and the training process of extracting the time domain characteristics from the sub-network is completed;

the time domain feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically comprises: the 3D convolution module is provided with 6 3D convolution layers, the convolution kernel size of the first 5 convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; an activation function ReLU is used behind each convolution layer, and the channel number of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain feature map of the input sub-videoThe pooling module consists of a global maximum pooling layer for realizing a time domain feature map F _t Conversion to a feature vector; the regression module consists of a full connection layer and is used for realizing the mapping of the feature vector and the quality fraction;

the step S32 specifically includes: dividing a video of a training set into a plurality of equal-length sub-videos, wherein each sub-video comprises continuous n frames; for each sub-video, a corresponding residual image sequence is calculated, with the following formula:

RF _i～j ＝F _i+1～j -F _i～j-1

the residual image sequence of each sub-video is input into the network designed in the step S31, and a time domain characteristic diagram F of C multiplied by H multiplied by W is obtained through a 3D convolution module _t Wherein C, H and W are the channel number, the height and the width of the feature map respectively, and then a vector of Cx1 is obtained through a pooling module and is subjected to regressionThe module maps to obtain the quality fraction of the sub video;

the video quality evaluation network construction and training specifically comprises the following steps:

calculating average value of p airspace feature images to obtain airspace feature image of corresponding sub-videoThen sum F by stitching _t And F _s Preliminary fusion into a spatiotemporal profile +.>

Calculating a spatial attention map of the spatiotemporal features, fusing the attention mapsBroadcast get +.>Will expand A' _f And original feature map F _st Multiplication by element to obtain a new feature map F' _st Then use the new feature map F' _st Generates a spatial attention map A _s ；

For new feature map F' _st Applying average pooling and maximum pooling along the channel dimension Andand generating a spatial attention map by splicing them and passing through a convolution layer and sigmoid function

Attempt a to pay attention to space _s And spatiotemporal features F' _st Element-by-element multiplication to obtain final spatiotemporal feature map F _fusion ；

Spatiotemporal feature map F using global pooling method _fusion Conversion to a C-dimensional vector F _v Finally, vector F _v Obtaining sub-video quality scores through full-connection layer regression;

2. The method for non-reference video quality assessment with fusion of spatio-temporal features according to claim 1, wherein said video quality assessment network comprises a spatial domain feature extraction module, a temporal feature extraction module, an attention module, and a plurality of subsequent pooling layers and full connection layers; the trained airspace feature extraction module is a main network and a 1 multiplied by 1 convolution layer of the airspace feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of the time domain feature extraction sub-network.

3. The method for reference-free video quality assessment with fusion of spatio-temporal features of claim 1, wherein said a _f The calculation formula is as follows:

wherein,representing the splice according to channel, sigmaRepresenting sigmoid functions, MLP is a shared multi-layer perceptron, each layer of perceptron is followed by a ReLU activation function;

A _s the calculation formula is as follows:

4. The method for non-reference video quality assessment with fusion of spatio-temporal features according to claim 1, wherein said spatio-temporal feature map F is generated using a global pooling method _fusion Conversion to a C-dimensional vector F _v The method comprises the following steps: will F _fusion The first C/2 feature graphs of the model (C) are respectively subjected to average pooling and standard deviation pooling to obtain two vectors of C dimension, and the vectors of C dimension are reduced to C/2 dimension through a full connection layer for maintaining feature balance and marked as F _sv The method comprises the steps of carrying out a first treatment on the surface of the And F is combined with _fusion Through maximum C/2 feature mapsPooling to obtain a vector of C/2 dimension, denoted F _tv The method comprises the steps of carrying out a first treatment on the surface of the And then F is arranged _sv And F _tv Splicing to obtain a C-dimensional vector F _v 。