CN112954312A - No-reference video quality evaluation method fusing spatio-temporal characteristics - Google Patents
No-reference video quality evaluation method fusing spatio-temporal characteristics Download PDFInfo
- Publication number
- CN112954312A CN112954312A CN202110176125.XA CN202110176125A CN112954312A CN 112954312 A CN112954312 A CN 112954312A CN 202110176125 A CN202110176125 A CN 202110176125A CN 112954312 A CN112954312 A CN 112954312A
- Authority
- CN
- China
- Prior art keywords
- video
- network
- sub
- feature extraction
- quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 33
- 238000000605 extraction Methods 0.000 claims abstract description 81
- 238000012549 training Methods 0.000 claims abstract description 64
- 238000005070 sampling Methods 0.000 claims abstract description 7
- 230000008447 perception Effects 0.000 claims abstract description 6
- 230000007246 mechanism Effects 0.000 claims abstract description 5
- 238000011176 pooling Methods 0.000 claims description 47
- 239000013598 vector Substances 0.000 claims description 38
- 238000010586 diagram Methods 0.000 claims description 24
- 238000001303 quality assessment method Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 239000004576 sand Substances 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000007500 overflow downdraw method Methods 0.000 claims 1
- 238000012545 processing Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4038—Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30168—Image quality inspection
Abstract
The invention relates to a no-reference video quality evaluation method fusing spatio-temporal characteristics, which comprises the following steps of S1, acquiring a video data set as a training set; s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set; s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set; step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism; and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected. The invention can obviously improve the performance of the quality evaluation of the non-reference video.
Description
Technical Field
The invention relates to the field of image and video processing and computer vision, in particular to a no-reference video quality evaluation method fusing spatio-temporal characteristics.
Background
With the development of social media applications and the popularity of consumer capture devices, people can record their daily lives anytime and anywhere by capturing video through portable mobile devices, and share through various media platforms. This has led to a proliferation in the number of user-generated content (UGC) videos shared and streamed over the internet. Therefore, it is very necessary to study an accurate Video Quality Assessment (VQA) model for consumer videos to monitor, control, and optimize this enormous content. In addition, since most users are not trained professionally, the lack of professional imaging knowledge may cause distortion caused by camera shake, sensor noise, out-of-focus, and the like. In addition, part of original data is inevitably lost in the processes of encoding, decoding, storing, transmitting and processing of the video, so that the video generates distortion phenomena, and phenomena such as noise, deformation, distortion, deletion and the like occur. The distortion can lose the information contained in the original video to different degrees, thereby influencing the look and feel of people on the video and influencing people to acquire information from the video. For an organization providing user-centric video services, it is important to ensure that the video after the production and distribution chain can meet the quality requirements of the video receiving end. The video quality evaluation model can evaluate the quality of the video according to the video distortion degree, so that a basis is provided for subsequent video processing. Video quality assessment is one of the key technologies in the field of video processing, and is crucial for current images in the fields of medicine, aviation, education, entertainment and the like.
Quality assessment of video can be divided into subjective quality assessment and objective quality assessment. Subjective quality assessment, which relies on manual scoring, is the most accurate and reasonable quality assessment, but its widespread use in the real world is limited due to the time and labor it consumes. Therefore, researchers have proposed objective quality assessment methods to automatically predict the visual quality of distorted video. According to the availability of the reference information, the objective quality assessment method is divided into: full reference, half reference and no reference. Since many videos do not have a reference video in practical applications, such as user generated content videos, because during the video capture process it is not possible to capture a "perfect" video that is completely distortion free, the additional information of the reference video also results in high bandwidth occupation during video transmission. Therefore, the no-reference quality evaluation method without reference to the original video has wider practical application value.
Most of the existing reference-free video quality assessment models mainly aim at synthesis distortion (such as compression distortion). There is a large difference between real and composite distorted video, the former may suffer from complex mixed real world distortion, and the distortion may also be different for different time periods of the same video. And according to recent studies, some of the most advanced video quality assessment methods validated on synthetic distortion datasets do not perform well on true distortion video datasets. In recent years, with the disclosure of real distortion video quality assessment data sets, and the urgent need of real application. The method comprises the steps of inputting a video residual image sequence into a 3D convolution network to calculate to obtain time domain characteristics of a video, and applying attention mechanism to adaptively adjust the influence of time domain and space domain distortion on video perception quality. The model can obviously improve the performance of the reference-free video quality evaluation model.
Disclosure of Invention
In view of the above, the present invention provides a method for evaluating quality of a reference-free video by fusing spatio-temporal features, so as to effectively improve the efficiency and performance of evaluating the quality of the reference-free video.
In order to achieve the purpose, the invention adopts the following technical scheme:
a no-reference video quality assessment method fusing spatio-temporal features comprises the following steps:
step S1, acquiring a video data set as a training set;
s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;
s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;
step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;
and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
Further, the step S2 is specifically:
step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;
and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.
Further, the spatial domain feature extraction sub-network specifically includes: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 convolution layer with the channel number of C is used to obtain the spatial domain feature map of the video frameThen, the space domain feature map F is processedsAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.
Further, the step S3 is specifically:
step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;
step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.
Further, the time domain feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically includes: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-videoThe pooling module is composed of a global maximum pooling layer and realizes a time domain feature map FtConversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the characteristic vector and the quality score.
Further, the step S32 is specifically: dividing a video of a training set into a plurality of sub-videos with equal length, wherein each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:
RFi~j=Fi+1~j-Fi~j-1
wherein, FiRepresenting the ith frame of the video, using Fi~jRepresenting a sub-video, RF, from frame i to frame j of the videoi~jA residual image sequence representing the segment of sub-video;
inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × F × W time domain feature map F through a 3D convolution moduletAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.
Furthermore, the video quality evaluation network comprises a spatial domain feature extraction module, a time domain feature extraction module, an attention module, a plurality of subsequent pooling layers and a full-connection layer; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.
Further, the video quality assessment network is constructed and trained, specifically:
obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature mapsThen F is summed by splicingtAnd FsPreliminarily fused into a space-time characteristic diagram
Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map FstCompute fusion attention maps for FstAggregating spatial information of each feature map using average pooling and maximum pooling separatelyAndthen will beAndadding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid function
Computing a spatial attention map of spatio-temporal features, fusing the attention mapsBroadcast results along the spatial dimensionWill be expanded A′ fAnd original characteristic diagram FstElement-by-element multiplication to obtain new characteristic diagram F′ stThen using the new feature map F′ stGenerating a spatial attention map As;
For new feature diagram F′ stApplying average pooling and maximum pooling along the channel dimensionAndand after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid function
Draw spatial attention to AsAnd space-time characteristics F′ stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram Ffusion;
Spatio-temporal feature map F using global poolingfusionConversion to C-dimensional vector FvFinally, vector FvObtaining a sub-video quality score through full connection layer regression;
using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;
and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.
Further, A isfThe calculation formula is as follows:
wherein the content of the first and second substances,representing the splice according to the channel, sigma representing the sigmoid function, MLP is a shared multi-layer perceptron, each layer perceptron is followed by a ReLU activation function;
Asthe calculation formula is as follows:
whereinRepresenting the multiplication element by element,representing the concatenation by channel, sigma represents the sigmoid function,
conv stands for convolutional layer.
Further, the space-time feature map F is processed by using a global pooling methodfusionConversion to C-dimensional vector FvThe method specifically comprises the following steps: f is to befusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as Fsv(ii) a And F isfusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as Ftv(ii) a Then F is mixedsvAnd FtvSplicing to obtain a C-dimensional vector Fv。
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the deep semantic features are extracted through the spatial domain feature extraction module so as to solve the problem of content dependency of the predicted video quality. Designing a time domain feature extraction module, replacing an RGB frame with a video residual image, and removing a static object and background information to capture more information specific to motion; the attention module is fused with the space-time characteristics, the influence of space-domain and time-domain distortion on the video perception quality is adjusted in a self-adaptive mode, and the performance of non-reference video quality evaluation can be improved remarkably.
2. The model of the invention can be well suitable for the video suffering from complex mixed real world distortion, and has wider practical application value.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a model for reference-free video quality assessment incorporating spatiotemporal features in an embodiment of the present invention;
FIG. 3 is a block diagram of a time domain feature extraction sub-network in an example of the invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a no-reference video quality assessment method with spatio-temporal features fused, comprising the following steps:
step S1, acquiring a video data set, and randomly dividing the video data set into a training set (80%) and a testing set (20%) according to a preset proportion;
s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;
step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;
and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.
S3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set, specifically comprising the steps of;
step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;
preferably, a video of the training set is divided into a plurality of sub-videos with equal length, and each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:
RFi~j=Fi+1~j-Fi~j-1
wherein, FiRepresenting the ith frame of the video, using Fi~jRepresenting a sub-video, RF, from frame i to frame j of the videoi~jA residual image sequence representing the segment of sub-video;
inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × F × W time domain feature map F through a 3D convolution moduletAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.
Step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.
Step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;
and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
Preferably, in this embodiment, the spatial domain feature extraction sub-network specifically includes: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 (C-128) convolution layer with the channel number of C is used to obtain a spatial domain feature map of a video frameThen, the space domain feature map F is processedsAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.
Preferably, in this embodiment, the space-time domain feature extraction sub-network sequentially includes a 3D convolution module, a pooling module, and a regression module, and specifically includes: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-videoThe pooling module is composed of a global maximum pooling layer and realizes a time domain feature map FtConversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the characteristic vector and the quality score.
Preferably, in this embodiment, the video quality evaluation network includes a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, and a plurality of subsequent pooling layers and full-link layers; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.
The video quality assessment network construction and training method specifically comprises the following steps:
obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature mapsThen F is summed by splicingtAnd FsPreliminarily fused into a space-time characteristic diagram
Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map FstCompute fusion attention maps for FstAggregating spatial information of each feature map using average pooling and maximum pooling separatelyAndthen will beAndadding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid functionAfThe calculation formula is as follows:
wherein the content of the first and second substances,representing the channel splicing, representing sigma representing a sigmoid function, wherein MLP is a shared multilayer perceptron, and each layer of perceptron is followed by a ReLU activation function;
computing a spatial attention map of spatio-temporal features, fusing the attention mapsBroadcast results along the spatial dimensionWill be expanded A′ fAnd original characteristic diagram FstElement-by-element multiplication to obtain new characteristic diagram F′ stThen using the new feature map F′ stGenerating a spatial attention map As;
For new feature diagram F′ stApplying average pooling and maximum pooling along the channel dimensionAndand after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid functionAsThe calculation formula is as follows:
whereinRepresenting the multiplication element by element,represents the splice by channel, σ represents the sigmoid function, and Conv represents the convolutional layer.
Draw spatial attention to AsAnd space-time characteristics F′ stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram Ffusion;
Spatio-temporal feature map F using global poolingfusionConversion to C-dimensional vector FvWill FfusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as Fsv(ii) a And F isfusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as Ftv(ii) a Then F is mixedSvAnd FtvIs spliced to obtainTo a C-dimensional vector FvFinally, vector FvObtaining a sub-video quality score through full connection layer regression;
using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;
and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.
In this embodiment, step S5 is specifically;
step S51: and dividing each video to be tested into a plurality of sub-videos by the method of the step S32, wherein each sub-video comprises continuous m frames.
Step S52: firstly, the sub-video is divided into frames and input into a spatial domain feature extraction module. The sub-video is then input to a time domain feature extraction module. And finally, predicting the quality scores of the sub videos through a video quality evaluation network.
Step S53: and taking the average value of the predicted quality scores of all the sub-videos in the video as the predicted quality score of the video.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.
Claims (10)
1. A no-reference video quality evaluation method fusing spatio-temporal characteristics is characterized by comprising the following steps:
step S1, acquiring a video data set as a training set;
s2, constructing a spatial domain feature extraction sub-network, and training based on a frame set obtained by down-sampling of a training set;
s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;
step S4, constructing a video quality evaluation network according to the trained spatial domain feature extraction sub-network and time domain feature extraction sub-network, and training to obtain a video quality evaluation model by adaptively adjusting the influence of the time domain and spatial domain features on the video perception quality through an attention mechanism;
and step S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
2. The method for evaluating quality of a reference-free video fused with spatio-temporal features according to claim 1, wherein the step S2 specifically comprises:
step S21, uniformly downsampling each video of the training set, wherein the sampling frequency is that one frame is taken for each f frame, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing a spatial domain feature extraction sub-network according to the image classification network as a backbone network, and pre-training;
and S23, fixing pre-trained parameters in the backbone network, training the spatial domain feature extraction sub-network according to the training frame set, learning the optimal parameters of the model by minimizing the loss of the mean square error between the predicted quality fraction and the real quality fraction of all frames in the training frame set, and completing the training process of the spatial domain feature extraction sub-network.
3. The method as claimed in claim 2, wherein the spatial domain feature extraction sub-network specifically comprises: the VGG16, ResNet50 or Densenet is used as a backbone network, and the part of the backbone network after the last layer of convolutional layer is replaced by the following parts: firstly, a 1 × 1 convolution layer with the channel number of C is used to obtain the spatial domain feature map of the video frameThen to space featuresSign graph FsAnd performing global average pooling and global standard deviation pooling, splicing the two pooled vectors, mapping the spliced vectors into quality fractions of video frames by adopting a full connection layer, and taking the modified network as a sub-network for extracting spatial domain features of the video.
4. The method for evaluating quality of a reference-free video fused with spatio-temporal features according to claim 1, wherein the step S3 specifically comprises:
step S31, constructing a neural network composed of a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained from all the videos in the training set as a sub-video set, and taking the real quality score of each sub-video as the real quality score of the corresponding video;
step S33, training a time domain feature extraction sub-network by using a sub-video set and taking batches as units; and the training process of the time domain feature extraction sub-network is completed by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video and learning the optimal parameters of the model.
5. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features as claimed in claim 4, wherein the temporal feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically comprises: the 3D convolution module has 6 3D convolution layers, the convolution kernel size of the first 5 layers of convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; the back of each convolution layer uses an activation function ReLU, and the number of channels of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain characteristic diagram of the input sub-videoThe pooling module is composed of a global maximum pooling layer and realizes a time domain feature map FtConversion to a feature vector; the regression module is composed of a full linkAnd the layer connection component is used for realizing the mapping of the feature vector and the quality fraction.
6. The method for reference-free video quality assessment with fusion of spatio-temporal features according to claim 4, wherein said step S32 specifically comprises: dividing a video of a training set into a plurality of sub-videos with equal length, wherein each sub-video comprises continuous m frames; calculating a corresponding residual image sequence for each sub-video, the calculation formula is as follows:
RFi~j=Fi+1~j-Fi~j-1
wherein, FiRepresenting the ith frame of the video, using Fi~jRepresenting a sub-video, RF, from frame i to frame j of the videoi~jA residual image sequence representing the segment of sub-video;
inputting the residual image sequence of each sub-video into the network designed in step S31, and obtaining a C × H × W time domain feature map F through a 3D convolution moduletAnd C, H and W are respectively the channel number, height and width of the feature map, then the vector of C multiplied by 1 is obtained through a pooling module, and the quality score of the sub-video is obtained through the mapping of a regression module.
7. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features according to claim 1, wherein the video quality evaluation network comprises a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, a plurality of subsequent pooling layers and a full-link layer; the trained spatial domain feature extraction module is a backbone network and a 1 x 1 convolution layer of a spatial domain feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of a time domain feature extraction sub-network.
8. The method for evaluating the quality of the reference-free video fused with the spatio-temporal features according to claim 1, wherein the video quality evaluation network is constructed and trained, and specifically comprises the following steps:
obtaining a space domain feature map of the corresponding sub-video by calculating the average value of the m space domain feature mapsThen F is summed by splicingtAnd FsPreliminarily fused into a space-time characteristic diagram
Designing an attention Module, including fused attention and spatial attention, first based on spatiotemporal feature map FstCompute fusion attention maps for FstAggregating spatial information of each feature map using average pooling and maximum pooling separatelyAndthen will beAndadding results obtained by sharing a multilayer perceptron, and obtaining a fusion attention diagram by a sigmoid function
Computing a spatial attention map of spatio-temporal features, fusing the attention mapsBroadcast results along the spatial dimensionExtended A'fAnd original characteristic diagram FstElement-by-element multiplication to obtain new feature map F'stFollowed by a new feature map F'stGenerating a spatial attention map As;
To new feature map F'stApplying average pooling and maximum pooling along the channel dimensionAndand after being spliced, the space attention diagram is generated through a convolution layer and a sigmoid function
Draw spatial attention to AsAnd spatio-temporal feature F'stElement-by-element multiplication is carried out to obtain a final space-time characteristic diagram Ffusion;
Spatio-temporal feature map F using global poolingfusionConversion to C-dimensional vector FvFinally, vector FvObtaining a sub-video quality score through full connection layer regression;
using the parameters of the corresponding part in the trained spatial domain feature extraction sub-network as the parameters of the spatial domain feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
parameters of a fixed spatial domain feature extraction module and a time domain feature extraction module are extracted, and a video quality evaluation network is trained according to a sub-video set;
and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, thereby completing the training process of the video quality evaluation network.
9. The method according to claim 8, wherein A is a quality estimation method for spatio-temporal feature fused reference-free videofThe calculation formula is as follows:
wherein the content of the first and second substances,representing the channel splicing, representing sigma representing a sigmoid function, wherein MLP is a shared multilayer perceptron, and each layer of perceptron is followed by a ReLU activation function;
Asthe calculation formula is as follows:
10. The method as claimed in claim 8, wherein the spatiotemporal feature fusion method is a global pooling method for spatiotemporal feature fusionfusionConversion to C-dimensional vector FvThe method specifically comprises the following steps: f is to befusionThe first C/2 feature maps are respectively subjected to average pooling and standard deviation pooling to obtain two vectors which are spliced to obtain a C-dimensional vector, and the C-dimensional vector is reduced to C/2 dimension through a full-connection layer in order to keep feature balance, and is marked as Fsv(ii) a And F isfusionThe last C/2 characteristic graphs are subjected to maximum pooling to obtain a C/2-dimensional vector which is marked as Ftv(ii) a Then F is mixedsvAnd FtvSplicing to obtain a C-dimensional vector Fv。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110176125.XA CN112954312B (en) | 2021-02-07 | 2021-02-07 | Non-reference video quality assessment method integrating space-time characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110176125.XA CN112954312B (en) | 2021-02-07 | 2021-02-07 | Non-reference video quality assessment method integrating space-time characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112954312A true CN112954312A (en) | 2021-06-11 |
CN112954312B CN112954312B (en) | 2024-01-05 |
Family
ID=76244601
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110176125.XA Active CN112954312B (en) | 2021-02-07 | 2021-02-07 | Non-reference video quality assessment method integrating space-time characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112954312B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113473117A (en) * | 2021-07-19 | 2021-10-01 | 上海交通大学 | No-reference audio and video quality evaluation method based on gated recurrent neural network |
CN113554599A (en) * | 2021-06-28 | 2021-10-26 | 杭州电子科技大学 | Video quality evaluation method based on human visual effect |
CN113642513A (en) * | 2021-08-30 | 2021-11-12 | 东南大学 | Action quality evaluation method based on self-attention and label distribution learning |
CN113784113A (en) * | 2021-08-27 | 2021-12-10 | 中国传媒大学 | No-reference video quality evaluation method based on short-term and long-term time-space fusion network and long-term sequence fusion network |
CN113810683A (en) * | 2021-08-27 | 2021-12-17 | 南京信息工程大学 | No-reference evaluation method for objectively evaluating underwater video quality |
CN113822856A (en) * | 2021-08-16 | 2021-12-21 | 南京中科逆熵科技有限公司 | End-to-end no-reference video quality evaluation method based on layered time-space domain feature representation |
CN113837047A (en) * | 2021-09-16 | 2021-12-24 | 广州大学 | Video quality evaluation method, system, computer equipment and storage medium |
CN114697648A (en) * | 2022-04-25 | 2022-07-01 | 上海为旌科技有限公司 | Frame rate variable video non-reference evaluation method and system, electronic device and storage medium |
CN115278303A (en) * | 2022-07-29 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Video processing method, apparatus, device and medium |
WO2024041268A1 (en) * | 2022-08-24 | 2024-02-29 | 腾讯科技(深圳)有限公司 | Video quality assessment method and apparatus, and computer device, computer storage medium and computer program product |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104023227A (en) * | 2014-05-28 | 2014-09-03 | 宁波大学 | Objective video quality evaluation method based on space domain and time domain structural similarities |
US20160330439A1 (en) * | 2016-05-27 | 2016-11-10 | Ningbo University | Video quality objective assessment method based on spatiotemporal domain structure |
US20180189572A1 (en) * | 2016-12-30 | 2018-07-05 | Mitsubishi Electric Research Laboratories, Inc. | Method and System for Multi-Modal Fusion Model |
CN109325435A (en) * | 2018-09-15 | 2019-02-12 | 天津大学 | Video actions identification and location algorithm based on cascade neural network |
CN110135369A (en) * | 2019-05-20 | 2019-08-16 | 威创集团股份有限公司 | A kind of Activity recognition method, system, equipment and computer readable storage medium |
CN110837842A (en) * | 2019-09-12 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Video quality evaluation method, model training method and model training device |
CN111784694A (en) * | 2020-08-20 | 2020-10-16 | 中国传媒大学 | No-reference video quality evaluation method based on visual attention mechanism |
WO2020221278A1 (en) * | 2019-04-29 | 2020-11-05 | 北京金山云网络技术有限公司 | Video classification method and model training method and apparatus thereof, and electronic device |
CN112085102A (en) * | 2020-09-10 | 2020-12-15 | 西安电子科技大学 | No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition |
-
2021
- 2021-02-07 CN CN202110176125.XA patent/CN112954312B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104023227A (en) * | 2014-05-28 | 2014-09-03 | 宁波大学 | Objective video quality evaluation method based on space domain and time domain structural similarities |
US20160330439A1 (en) * | 2016-05-27 | 2016-11-10 | Ningbo University | Video quality objective assessment method based on spatiotemporal domain structure |
US20180189572A1 (en) * | 2016-12-30 | 2018-07-05 | Mitsubishi Electric Research Laboratories, Inc. | Method and System for Multi-Modal Fusion Model |
CN109325435A (en) * | 2018-09-15 | 2019-02-12 | 天津大学 | Video actions identification and location algorithm based on cascade neural network |
WO2020221278A1 (en) * | 2019-04-29 | 2020-11-05 | 北京金山云网络技术有限公司 | Video classification method and model training method and apparatus thereof, and electronic device |
CN110135369A (en) * | 2019-05-20 | 2019-08-16 | 威创集团股份有限公司 | A kind of Activity recognition method, system, equipment and computer readable storage medium |
CN110837842A (en) * | 2019-09-12 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Video quality evaluation method, model training method and model training device |
CN111784694A (en) * | 2020-08-20 | 2020-10-16 | 中国传媒大学 | No-reference video quality evaluation method based on visual attention mechanism |
CN112085102A (en) * | 2020-09-10 | 2020-12-15 | 西安电子科技大学 | No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition |
Non-Patent Citations (1)
Title |
---|
朱泽,桑庆兵,张浩: "基于空时特征和注意力机制的无参考视频质量评价", 激光与光电子学进展, vol. 57, no. 18, pages 181509 - 1 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113554599A (en) * | 2021-06-28 | 2021-10-26 | 杭州电子科技大学 | Video quality evaluation method based on human visual effect |
CN113554599B (en) * | 2021-06-28 | 2023-08-18 | 杭州电子科技大学 | Video quality evaluation method based on human visual effect |
CN113473117A (en) * | 2021-07-19 | 2021-10-01 | 上海交通大学 | No-reference audio and video quality evaluation method based on gated recurrent neural network |
CN113473117B (en) * | 2021-07-19 | 2022-09-02 | 上海交通大学 | Non-reference audio and video quality evaluation method based on gated recurrent neural network |
CN113822856A (en) * | 2021-08-16 | 2021-12-21 | 南京中科逆熵科技有限公司 | End-to-end no-reference video quality evaluation method based on layered time-space domain feature representation |
CN113784113A (en) * | 2021-08-27 | 2021-12-10 | 中国传媒大学 | No-reference video quality evaluation method based on short-term and long-term time-space fusion network and long-term sequence fusion network |
CN113810683A (en) * | 2021-08-27 | 2021-12-17 | 南京信息工程大学 | No-reference evaluation method for objectively evaluating underwater video quality |
CN113642513B (en) * | 2021-08-30 | 2022-11-18 | 东南大学 | Action quality evaluation method based on self-attention and label distribution learning |
CN113642513A (en) * | 2021-08-30 | 2021-11-12 | 东南大学 | Action quality evaluation method based on self-attention and label distribution learning |
CN113837047A (en) * | 2021-09-16 | 2021-12-24 | 广州大学 | Video quality evaluation method, system, computer equipment and storage medium |
CN114697648A (en) * | 2022-04-25 | 2022-07-01 | 上海为旌科技有限公司 | Frame rate variable video non-reference evaluation method and system, electronic device and storage medium |
CN114697648B (en) * | 2022-04-25 | 2023-12-08 | 上海为旌科技有限公司 | Variable frame rate video non-reference evaluation method, system, electronic equipment and storage medium |
CN115278303A (en) * | 2022-07-29 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Video processing method, apparatus, device and medium |
CN115278303B (en) * | 2022-07-29 | 2024-04-19 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and medium |
WO2024041268A1 (en) * | 2022-08-24 | 2024-02-29 | 腾讯科技(深圳)有限公司 | Video quality assessment method and apparatus, and computer device, computer storage medium and computer program product |
Also Published As
Publication number | Publication date |
---|---|
CN112954312B (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112954312B (en) | Non-reference video quality assessment method integrating space-time characteristics | |
Sun et al. | MC360IQA: A multi-channel CNN for blind 360-degree image quality assessment | |
CN109544524B (en) | Attention mechanism-based multi-attribute image aesthetic evaluation system | |
CN110751649B (en) | Video quality evaluation method and device, electronic equipment and storage medium | |
Zhu et al. | No-reference video quality assessment based on artifact measurement and statistical analysis | |
Moorthy et al. | Visual quality assessment algorithms: what does the future hold? | |
Sun et al. | Deep learning based full-reference and no-reference quality assessment models for compressed ugc videos | |
Zhang et al. | Fine-grained quality assessment for compressed images | |
CN112995652B (en) | Video quality evaluation method and device | |
CN111047543A (en) | Image enhancement method, device and storage medium | |
Prabhushankar et al. | Ms-unique: Multi-model and sharpness-weighted unsupervised image quality estimation | |
Xu et al. | Perceptual quality assessment of internet videos | |
Siahaan et al. | Semantic-aware blind image quality assessment | |
Shen et al. | An end-to-end no-reference video quality assessment method with hierarchical spatiotemporal feature representation | |
Sinno et al. | Spatio-temporal measures of naturalness | |
Antsiferova et al. | Video compression dataset and benchmark of learning-based video-quality metrics | |
CN116703857A (en) | Video action quality evaluation method based on time-space domain sensing | |
Wang | A survey on IQA | |
Chen et al. | GAMIVAL: Video quality prediction on mobile cloud gaming content | |
Xian et al. | A content-oriented no-reference perceptual video quality assessment method for computer graphics animation videos | |
Da et al. | Perceptual quality assessment of nighttime video | |
Jenadeleh | Blind Image and Video Quality Assessment | |
CN112380395A (en) | Method and system for obtaining emotion of graph convolution network based on double-flow architecture and storage medium | |
Nyman et al. | Evaluation of the visual performance of image processing pipes: information value of subjective image attributes | |
Qiu et al. | Blind 360-degree image quality assessment via saliency-guided convolution neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |