CN112954312A - No-reference video quality evaluation method fusing spatio-temporal characteristics - Google Patents

No-reference video quality evaluation method fusing spatio-temporal characteristics Download PDF

Info

Publication number
CN112954312A
CN112954312A CN202110176125.XA CN202110176125A CN112954312A CN 112954312 A CN112954312 A CN 112954312A CN 202110176125 A CN202110176125 A CN 202110176125A CN 112954312 A CN112954312 A CN 112954312A
Authority
CN
China
Prior art keywords
video
network
sub
feature extraction
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110176125.XA
Other languages
Chinese (zh)
Other versions
CN112954312B (en
Inventor
牛玉贞
钟梦真
陈俊豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202110176125.XA priority Critical patent/CN112954312B/en
Publication of CN112954312A publication Critical patent/CN112954312A/en
Application granted granted Critical
Publication of CN112954312B publication Critical patent/CN112954312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Abstract

The invention relates to a no-reference video quality evaluation method fusing spatio-temporal characteristics, which comprises the following steps of S1, acquiring a video data set as a training set; s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set; s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set; step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism; and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected. The invention can obviously improve the performance of the quality evaluation of the non-reference video.

Description

No-reference video quality evaluation method fusing spatio-temporal characteristics
Technical Field
The invention relates to the field of image and video processing and computer vision, in particular to a no-reference video quality evaluation method fusing spatio-temporal characteristics.
Background
With the development of social media applications and the popularity of consumer capture devices, people can record their daily lives anytime and anywhere by capturing video through portable mobile devices, and share through various media platforms. This has led to a proliferation in the number of user-generated content (UGC) videos shared and streamed over the internet. Therefore, it is very necessary to study an accurate Video Quality Assessment (VQA) model for consumer videos to monitor, control, and optimize this enormous content. In addition, since most users are not trained professionally, the lack of professional imaging knowledge may cause distortion caused by camera shake, sensor noise, out-of-focus, and the like. In addition, part of original data is inevitably lost in the processes of encoding, decoding, storing, transmitting and processing of the video, so that the video generates distortion phenomena, and phenomena such as noise, deformation, distortion, deletion and the like occur. The distortion can lose the information contained in the original video to different degrees, thereby influencing the look and feel of people on the video and influencing people to acquire information from the video. For an organization providing user-centric video services, it is important to ensure that the video after the production and distribution chain can meet the quality requirements of the video receiving end. The video quality evaluation model can evaluate the quality of the video according to the video distortion degree, so that a basis is provided for subsequent video processing. Video quality assessment is one of the key technologies in the field of video processing, and is crucial for current images in the fields of medicine, aviation, education, entertainment and the like.
Quality assessment of video can be divided into subjective quality assessment and objective quality assessment. Subjective quality assessment, which relies on manual scoring, is the most accurate and reasonable quality assessment, but its widespread use in the real world is limited due to the time and labor it consumes. Therefore, researchers have proposed objective quality assessment methods to automatically predict the visual quality of distorted video. According to the availability of the reference information, the objective quality assessment method is divided into: full reference, half reference and no reference. Since many videos do not have a reference video in practical applications, such as user generated content videos, because during the video capture process it is not possible to capture a "perfect" video that is completely distortion free, the additional information of the reference video also results in high bandwidth occupation during video transmission. Therefore, the no-reference quality evaluation method without reference to the original video has wider practical application value.
Most of the existing reference-free video quality assessment models mainly aim at synthesis distortion (such as compression distortion). There is a large difference between real and composite distorted video, the former may suffer from complex mixed real world distortion, and the distortion may also be different for different time periods of the same video. And according to recent studies, some of the most advanced video quality assessment methods validated on synthetic distortion datasets do not perform well on true distortion video datasets. In recent years, with the disclosure of real distortion video quality assessment data sets, and the urgent need of real application. The method comprises the steps of inputting a video residual image sequence into a 3D convolution network to calculate to obtain time domain characteristics of a video, and applying attention mechanism to adaptively adjust the influence of time domain and space domain distortion on video perception quality. The model can obviously improve the performance of the reference-free video quality evaluation model.
Disclosure of Invention
In view of the above, the present invention provides a method for evaluating quality of a reference-free video by fusing spatio-temporal features, so as to effectively improve the efficiency and performance of evaluating the quality of the reference-free video.
In order to achieve the purpose, the invention adopts the following technical scheme:
a no-reference video quality assessment method fusing spatio-temporal features comprises the following steps:
step S1, acquiring a video data set as a training set;
s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;
s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;
step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;
and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
Further, the step S2 is specifically:
step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;
and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.
Further, the spatial domain feature extraction sub-network specifically includes: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 convolution layer with the channel number of C is used to obtain the spatial domain feature map of the video frame
Figure BDA0002939897160000041
Then, the space domain feature map F is processedsAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.
Further, the step S3 is specifically:
step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;
step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.
Further, the time domain feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically includes: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-video
Figure BDA0002939897160000051
The pooling module is composed of a global maximum pooling layer and realizes a time domain feature map FtConversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the characteristic vector and the quality score.
Further, the step S32 is specifically: dividing a video of a training set into a plurality of sub-videos with equal length, wherein each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:
RFi~j=Fi+1~j-Fi~j-1
wherein, FiRepresenting the ith frame of the video, using Fi~jRepresenting a sub-video, RF, from frame i to frame j of the videoi~jA residual image sequence representing the segment of sub-video;
inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × F × W time domain feature map F through a 3D convolution moduletAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.
Furthermore, the video quality evaluation network comprises a spatial domain feature extraction module, a time domain feature extraction module, an attention module, a plurality of subsequent pooling layers and a full-connection layer; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.
Further, the video quality assessment network is constructed and trained, specifically:
obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature maps
Figure BDA0002939897160000061
Then F is summed by splicingtAnd FsPreliminarily fused into a space-time characteristic diagram
Figure BDA0002939897160000062
Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map FstCompute fusion attention maps for FstAggregating spatial information of each feature map using average pooling and maximum pooling separately
Figure BDA0002939897160000063
And
Figure BDA0002939897160000064
then will be
Figure BDA0002939897160000065
And
Figure BDA0002939897160000066
adding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid function
Figure BDA0002939897160000067
Computing a spatial attention map of spatio-temporal features, fusing the attention maps
Figure BDA0002939897160000068
Broadcast results along the spatial dimension
Figure BDA0002939897160000069
Will be expanded A fAnd original characteristic diagram FstElement-by-element multiplication to obtain new characteristic diagram F stThen using the new feature map F stGenerating a spatial attention map As
For new feature diagram F stApplying average pooling and maximum pooling along the channel dimension
Figure BDA00029398971600000610
And
Figure BDA00029398971600000611
and after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid function
Figure BDA00029398971600000612
Draw spatial attention to AsAnd space-time characteristics F stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram Ffusion
Spatio-temporal feature map F using global poolingfusionConversion to C-dimensional vector FvFinally, vector FvObtaining a sub-video quality score through full connection layer regression;
using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;
and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.
Further, A isfThe calculation formula is as follows:
Figure BDA0002939897160000071
Figure BDA0002939897160000072
Figure BDA0002939897160000073
Figure BDA0002939897160000074
wherein the content of the first and second substances,
Figure BDA0002939897160000075
representing the splice according to the channel, sigma representing the sigmoid function, MLP is a shared multi-layer perceptron, each layer perceptron is followed by a ReLU activation function;
Asthe calculation formula is as follows:
Figure BDA0002939897160000076
Figure BDA0002939897160000077
Figure BDA0002939897160000078
Figure BDA0002939897160000079
wherein
Figure BDA00029398971600000710
Representing the multiplication element by element,
Figure BDA00029398971600000711
representing the concatenation by channel, sigma represents the sigmoid function,
conv stands for convolutional layer.
Further, the space-time feature map F is processed by using a global pooling methodfusionConversion to C-dimensional vector FvThe method specifically comprises the following steps: f is to befusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as Fsv(ii) a And F isfusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as Ftv(ii) a Then F is mixedsvAnd FtvSplicing to obtain a C-dimensional vector Fv
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the deep semantic features are extracted through the spatial domain feature extraction module so as to solve the problem of content dependency of the predicted video quality. Designing a time domain feature extraction module, replacing an RGB frame with a video residual image, and removing a static object and background information to capture more information specific to motion; the attention module is fused with the space-time characteristics, the influence of space-domain and time-domain distortion on the video perception quality is adjusted in a self-adaptive mode, and the performance of non-reference video quality evaluation can be improved remarkably.
2. The model of the invention can be well suitable for the video suffering from complex mixed real world distortion, and has wider practical application value.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a model for reference-free video quality assessment incorporating spatiotemporal features in an embodiment of the present invention;
FIG. 3 is a block diagram of a time domain feature extraction sub-network in an example of the invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a no-reference video quality assessment method with spatio-temporal features fused, comprising the following steps:
step S1, acquiring a video data set, and randomly dividing the video data set into a training set (80%) and a testing set (20%) according to a preset proportion;
s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;
step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;
and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.
S3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set, specifically comprising the steps of;
step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;
preferably, a video of the training set is divided into a plurality of sub-videos with equal length, and each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:
RFi~j=Fi+1~j-Fi~j-1
wherein, FiRepresenting the ith frame of the video, using Fi~jRepresenting a sub-video, RF, from frame i to frame j of the videoi~jA residual image sequence representing the segment of sub-video;
inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × F × W time domain feature map F through a 3D convolution moduletAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.
Step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.
Step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;
and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
Preferably, in this embodiment, the spatial domain feature extraction sub-network specifically includes: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 (C-128) convolution layer with the channel number of C is used to obtain a spatial domain feature map of a video frame
Figure BDA0002939897160000101
Then, the space domain feature map F is processedsAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.
Preferably, in this embodiment, the space-time domain feature extraction sub-network sequentially includes a 3D convolution module, a pooling module, and a regression module, and specifically includes: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-video
Figure BDA0002939897160000111
The pooling module is composed of a global maximum pooling layer and realizes a time domain feature map FtConversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the characteristic vector and the quality score.
Preferably, in this embodiment, the video quality evaluation network includes a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, and a plurality of subsequent pooling layers and full-link layers; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.
The video quality assessment network construction and training method specifically comprises the following steps:
obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature maps
Figure BDA0002939897160000112
Then F is summed by splicingtAnd FsPreliminarily fused into a space-time characteristic diagram
Figure BDA0002939897160000113
Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map FstCompute fusion attention maps for FstAggregating spatial information of each feature map using average pooling and maximum pooling separately
Figure BDA0002939897160000114
And
Figure BDA0002939897160000115
then will be
Figure BDA0002939897160000121
And
Figure BDA0002939897160000122
adding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid function
Figure BDA0002939897160000123
AfThe calculation formula is as follows:
Figure BDA0002939897160000124
Figure BDA0002939897160000125
Figure BDA0002939897160000126
Figure BDA0002939897160000127
wherein the content of the first and second substances,
Figure BDA0002939897160000128
representing the channel splicing, representing sigma representing a sigmoid function, wherein MLP is a shared multilayer perceptron, and each layer of perceptron is followed by a ReLU activation function;
computing a spatial attention map of spatio-temporal features, fusing the attention maps
Figure BDA0002939897160000129
Broadcast results along the spatial dimension
Figure BDA00029398971600001210
Will be expanded A fAnd original characteristic diagram FstElement-by-element multiplication to obtain new characteristic diagram F stThen using the new feature map F stGenerating a spatial attention map As
For new feature diagram F stApplying average pooling and maximum pooling along the channel dimension
Figure BDA00029398971600001211
And
Figure BDA00029398971600001212
and after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid function
Figure BDA00029398971600001213
AsThe calculation formula is as follows:
Figure BDA00029398971600001214
Figure BDA00029398971600001215
Figure BDA00029398971600001216
Figure BDA00029398971600001217
wherein
Figure BDA00029398971600001218
Representing the multiplication element by element,
Figure BDA00029398971600001219
represents the splice by channel, σ represents the sigmoid function, and Conv represents the convolutional layer.
Draw spatial attention to AsAnd space-time characteristics F stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram Ffusion
Spatio-temporal feature map F using global poolingfusionConversion to C-dimensional vector FvWill FfusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as Fsv(ii) a And F isfusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as Ftv(ii) a Then F is mixedSvAnd FtvIs spliced to obtainTo a C-dimensional vector FvFinally, vector FvObtaining a sub-video quality score through full connection layer regression;
using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;
and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.
In this embodiment, step S5 is specifically;
step S51: and dividing each video to be tested into a plurality of sub-videos by the method of the step S32, wherein each sub-video comprises continuous m frames.
Step S52: firstly, the sub-video is divided into frames and input into a spatial domain feature extraction module. The sub-video is then input to a time domain feature extraction module. And finally, predicting the quality scores of the sub videos through a video quality evaluation network.
Step S53: and taking the average value of the predicted quality scores of all the sub-videos in the video as the predicted quality score of the video.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims (10)

1. A no-reference video quality evaluation method fusing spatio-temporal characteristics is characterized by comprising the following steps:
step S1, acquiring a video data set as a training set;
s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;
s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;
step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;
and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
2. The method for evaluating quality of a reference-free video fused with spatio-temporal features according to claim 1, wherein the step S2 specifically comprises:
step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;
and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.
3. The method as claimed in claim 2, wherein the spatial domain feature extraction sub-network specifically comprises: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 convolution layer with the channel number of C is used to obtain the spatial domain feature map of the video frame
Figure FDA0002939897150000021
Then to space featuresSign graph FsAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.
4. The method for evaluating quality of a reference-free video fused with spatio-temporal features according to claim 1, wherein the step S3 specifically comprises:
step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;
step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.
5. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features as claimed in claim 4, wherein the temporal feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically comprises: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-video
Figure FDA0002939897150000031
The pooling module is composed of a global maximum pooling layer and realizes a time domain feature map FtConversion to a feature vector; the regression module is composed of a full linkAnd the layer connection component is used for realizing the mapping of the feature vector and the quality fraction.
6. The method for reference-free video quality assessment with fusion of spatio-temporal features according to claim 4, wherein said step S32 specifically comprises: dividing a video of a training set into a plurality of sub-videos with equal length, wherein each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:
RFi~j=Fi+1~j-Fi~j-1
wherein, FiRepresenting the ith frame of the video, using Fi~jRepresenting a sub-video, RF, from frame i to frame j of the videoi~jA residual image sequence representing the segment of sub-video;
inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × H × W time domain feature map F through a 3D convolution moduletAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.
7. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features according to claim 1, wherein the video quality evaluation network comprises a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, a plurality of subsequent pooling layers and a full-link layer; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.
8. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features according to claim 1, wherein the video quality evaluation network is constructed and trained, and specifically comprises the following steps:
obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature maps
Figure FDA0002939897150000041
Then F is summed by splicingtAnd FsPreliminarily fused into a space-time characteristic diagram
Figure FDA0002939897150000042
Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map FstCompute fusion attention maps for FstAggregating spatial information of each feature map using average pooling and maximum pooling separately
Figure FDA0002939897150000043
And
Figure FDA0002939897150000044
then will be
Figure FDA0002939897150000045
And
Figure FDA0002939897150000046
adding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid function
Figure FDA0002939897150000047
Computing a spatial attention map of spatio-temporal features, fusing the attention maps
Figure FDA0002939897150000048
Broadcast results along the spatial dimension
Figure FDA0002939897150000049
Extended A'fAnd original characteristic diagram FstElement-by-element multiplication to obtain new feature map F'stFollowed by a new feature map F'stGenerating a spatial attention map As
To new feature map F'stApplying average pooling and maximum pooling along the channel dimension
Figure FDA00029398971500000410
And
Figure FDA00029398971500000411
and after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid function
Figure FDA00029398971500000412
Draw spatial attention to AsAnd spatio-temporal feature F'stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram Ffusion
Spatio-temporal feature map F using global poolingfusionConversion to C-dimensional vector FvFinally, vector FvObtaining a sub-video quality score through full connection layer regression;
using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;
and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.
9. The method according to claim 8, wherein A is a quality estimation method for spatio-temporal feature fused reference-free videofThe calculation formula is as follows:
Figure FDA0002939897150000051
Figure FDA0002939897150000052
Figure FDA0002939897150000053
Figure FDA0002939897150000054
wherein the content of the first and second substances,
Figure FDA0002939897150000055
representing the channel splicing, representing sigma representing a sigmoid function, wherein MLP is a shared multilayer perceptron, and each layer of perceptron is followed by a ReLU activation function;
Asthe calculation formula is as follows:
Figure FDA0002939897150000056
Figure FDA0002939897150000057
Figure FDA0002939897150000058
Figure FDA0002939897150000061
wherein
Figure FDA0002939897150000062
Representing the multiplication element by element,
Figure FDA0002939897150000063
represents the splice by channel, σ represents the sigmoid function, and Conv represents the convolutional layer.
10. The method as claimed in claim 8, wherein the spatiotemporal feature fusion method is a global pooling method for spatiotemporal feature fusionfusionConversion to C-dimensional vector FvThe method specifically comprises the following steps: f is to befusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as Fsv(ii) a And F isfusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as Ftv(ii) a Then F is mixedsvAnd FtvSplicing to obtain a C-dimensional vector Fv
CN202110176125.XA 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics Active CN112954312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110176125.XA CN112954312B (en) 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110176125.XA CN112954312B (en) 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics

Publications (2)

Publication Number Publication Date
CN112954312A true CN112954312A (en) 2021-06-11
CN112954312B CN112954312B (en) 2024-01-05

Family

ID=76244601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110176125.XA Active CN112954312B (en) 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics

Country Status (1)

Country Link
CN (1) CN112954312B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113473117A (en) * 2021-07-19 2021-10-01 上海交通大学 No-reference audio and video quality evaluation method based on gated recurrent neural network
CN113554599A (en) * 2021-06-28 2021-10-26 杭州电子科技大学 Video quality evaluation method based on human visual effect
CN113642513A (en) * 2021-08-30 2021-11-12 东南大学 Action quality evaluation method based on self-attention and label distribution learning
CN113784113A (en) * 2021-08-27 2021-12-10 中国传媒大学 No-reference video quality evaluation method based on short-term and long-term time-space fusion network and long-term sequence fusion network
CN113810683A (en) * 2021-08-27 2021-12-17 南京信息工程大学 No-reference evaluation method for objectively evaluating underwater video quality
CN113822856A (en) * 2021-08-16 2021-12-21 南京中科逆熵科技有限公司 End-to-end no-reference video quality evaluation method based on layered time-space domain feature representation
CN113837047A (en) * 2021-09-16 2021-12-24 广州大学 Video quality evaluation method, system, computer equipment and storage medium
CN114697648A (en) * 2022-04-25 2022-07-01 上海为旌科技有限公司 Frame rate variable video non-reference evaluation method and system, electronic device and storage medium
CN115278303A (en) * 2022-07-29 2022-11-01 腾讯科技(深圳)有限公司 Video processing method, apparatus, device and medium
WO2024041268A1 (en) * 2022-08-24 2024-02-29 腾讯科技(深圳)有限公司 Video quality assessment method and apparatus, and computer device, computer storage medium and computer program product

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023227A (en) * 2014-05-28 2014-09-03 宁波大学 Objective video quality evaluation method based on space domain and time domain structural similarities
US20160330439A1 (en) * 2016-05-27 2016-11-10 Ningbo University Video quality objective assessment method based on spatiotemporal domain structure
US20180189572A1 (en) * 2016-12-30 2018-07-05 Mitsubishi Electric Research Laboratories, Inc. Method and System for Multi-Modal Fusion Model
CN109325435A (en) * 2018-09-15 2019-02-12 天津大学 Video actions identification and location algorithm based on cascade neural network
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110837842A (en) * 2019-09-12 2020-02-25 腾讯科技(深圳)有限公司 Video quality evaluation method, model training method and model training device
CN111784694A (en) * 2020-08-20 2020-10-16 中国传媒大学 No-reference video quality evaluation method based on visual attention mechanism
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023227A (en) * 2014-05-28 2014-09-03 宁波大学 Objective video quality evaluation method based on space domain and time domain structural similarities
US20160330439A1 (en) * 2016-05-27 2016-11-10 Ningbo University Video quality objective assessment method based on spatiotemporal domain structure
US20180189572A1 (en) * 2016-12-30 2018-07-05 Mitsubishi Electric Research Laboratories, Inc. Method and System for Multi-Modal Fusion Model
CN109325435A (en) * 2018-09-15 2019-02-12 天津大学 Video actions identification and location algorithm based on cascade neural network
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110837842A (en) * 2019-09-12 2020-02-25 腾讯科技(深圳)有限公司 Video quality evaluation method, model training method and model training device
CN111784694A (en) * 2020-08-20 2020-10-16 中国传媒大学 No-reference video quality evaluation method based on visual attention mechanism
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱泽,桑庆兵,张浩: "基于空时特征和注意力机制的无参考视频质量评价", 激光与光电子学进展, vol. 57, no. 18, pages 181509 - 1 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554599A (en) * 2021-06-28 2021-10-26 杭州电子科技大学 Video quality evaluation method based on human visual effect
CN113554599B (en) * 2021-06-28 2023-08-18 杭州电子科技大学 Video quality evaluation method based on human visual effect
CN113473117A (en) * 2021-07-19 2021-10-01 上海交通大学 No-reference audio and video quality evaluation method based on gated recurrent neural network
CN113473117B (en) * 2021-07-19 2022-09-02 上海交通大学 Non-reference audio and video quality evaluation method based on gated recurrent neural network
CN113822856A (en) * 2021-08-16 2021-12-21 南京中科逆熵科技有限公司 End-to-end no-reference video quality evaluation method based on layered time-space domain feature representation
CN113784113A (en) * 2021-08-27 2021-12-10 中国传媒大学 No-reference video quality evaluation method based on short-term and long-term time-space fusion network and long-term sequence fusion network
CN113810683A (en) * 2021-08-27 2021-12-17 南京信息工程大学 No-reference evaluation method for objectively evaluating underwater video quality
CN113642513B (en) * 2021-08-30 2022-11-18 东南大学 Action quality evaluation method based on self-attention and label distribution learning
CN113642513A (en) * 2021-08-30 2021-11-12 东南大学 Action quality evaluation method based on self-attention and label distribution learning
CN113837047A (en) * 2021-09-16 2021-12-24 广州大学 Video quality evaluation method, system, computer equipment and storage medium
CN114697648A (en) * 2022-04-25 2022-07-01 上海为旌科技有限公司 Frame rate variable video non-reference evaluation method and system, electronic device and storage medium
CN114697648B (en) * 2022-04-25 2023-12-08 上海为旌科技有限公司 Variable frame rate video non-reference evaluation method, system, electronic equipment and storage medium
CN115278303A (en) * 2022-07-29 2022-11-01 腾讯科技(深圳)有限公司 Video processing method, apparatus, device and medium
CN115278303B (en) * 2022-07-29 2024-04-19 腾讯科技(深圳)有限公司 Video processing method, device, equipment and medium
WO2024041268A1 (en) * 2022-08-24 2024-02-29 腾讯科技(深圳)有限公司 Video quality assessment method and apparatus, and computer device, computer storage medium and computer program product

Also Published As

Publication number Publication date
CN112954312B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
CN112954312B (en) Non-reference video quality assessment method integrating space-time characteristics
Sun et al. MC360IQA: A multi-channel CNN for blind 360-degree image quality assessment
CN109544524B (en) Attention mechanism-based multi-attribute image aesthetic evaluation system
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
Zhu et al. No-reference video quality assessment based on artifact measurement and statistical analysis
Moorthy et al. Visual quality assessment algorithms: what does the future hold?
Sun et al. Deep learning based full-reference and no-reference quality assessment models for compressed ugc videos
Zhang et al. Fine-grained quality assessment for compressed images
CN112995652B (en) Video quality evaluation method and device
CN111047543A (en) Image enhancement method, device and storage medium
Prabhushankar et al. Ms-unique: Multi-model and sharpness-weighted unsupervised image quality estimation
Xu et al. Perceptual quality assessment of internet videos
Siahaan et al. Semantic-aware blind image quality assessment
Shen et al. An end-to-end no-reference video quality assessment method with hierarchical spatiotemporal feature representation
Sinno et al. Spatio-temporal measures of naturalness
Antsiferova et al. Video compression dataset and benchmark of learning-based video-quality metrics
CN116703857A (en) Video action quality evaluation method based on time-space domain sensing
Wang A survey on IQA
Chen et al. GAMIVAL: Video quality prediction on mobile cloud gaming content
Xian et al. A content-oriented no-reference perceptual video quality assessment method for computer graphics animation videos
Da et al. Perceptual quality assessment of nighttime video
Jenadeleh Blind Image and Video Quality Assessment
CN112380395A (en) Method and system for obtaining emotion of graph convolution network based on double-flow architecture and storage medium
Nyman et al. Evaluation of the visual performance of image processing pipes: information value of subjective image attributes
Qiu et al. Blind 360-degree image quality assessment via saliency-guided convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant