CN115511858A - Video quality evaluation method based on novel time sequence characteristic relation mapping - Google Patents

Video quality evaluation method based on novel time sequence characteristic relation mapping Download PDF

Info

Publication number
CN115511858A
CN115511858A CN202211230036.XA CN202211230036A CN115511858A CN 115511858 A CN115511858 A CN 115511858A CN 202211230036 A CN202211230036 A CN 202211230036A CN 115511858 A CN115511858 A CN 115511858A
Authority
CN
China
Prior art keywords
video
network
feature
time sequence
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211230036.XA
Other languages
Chinese (zh)
Inventor
毛钰
郑博仑
颜成钢
孙垚棋
高宇涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202211230036.XA priority Critical patent/CN115511858A/en
Publication of CN115511858A publication Critical patent/CN115511858A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • G06V10/993Evaluation of the quality of the acquired pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Abstract

The invention discloses a video quality evaluation method based on novel time sequence characteristic relation mapping. The invention uses the pre-trained feature extraction network to extract the features of the frame level, then uses the Bi-LSTM network to capture the long-term dependence relationship between the content perception features of the context information of the video sequence and the frame level quality, and combines the hidden state of the network output and the features of the pre-trained feature extraction network to extract the features of the frame level to construct a novel time sequence feature relationship diagram, thereby fully utilizing the long-term and short-term time sequence relationship between the adjacent frames and the interval frames of the video sequence. The invention utilizes the bidirectional long-short term memory neural network to carry out time sequence modeling, more effectively fuses the content perception characteristics of the video sequence in the time dimension, and simultaneously, the constructed novel time sequence characteristic relational graph more effectively captures the time sequence information change of the video in the short term, thereby providing rich time sequence information for the expansion of the subsequent quality prediction task.

Description

Video quality evaluation method based on novel time sequence characteristic relation mapping
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a video quality evaluation method based on novel time sequence characteristic relation mapping.
Technical Field
In recent years, mobile communication technology has been rapidly developed, and digital devices (such as mobile phones and tablet computers) have been explosively increased. Video, as a main form of content transmission, covers many fields such as consumption, medical treatment, education and the like, and accounts for more than 80% of internet data. However, distortion and degradation occur in the links of video acquisition, compression, transmission, storage or playing and the like, which are difficult to avoid, and the watching experience of audiences and even the understanding of video semantics are affected. In order to realize effective processing of Video traffic and guide subsequent processing, such as further compression and enhancement, trade-off between bandwidth and Video Quality is made, and a reasonable and reliable Video Quality Assessment (VQA) method is designed, which becomes a research hotspot in the field of computer vision.
Video quality evaluation methods can be roughly divided into two categories: subjective evaluation and objective evaluation. Subjective indicators relying on manual evaluation of distorted video are most reliable, and the average of the test subject opinion scores collected in this study is called the Mean Opinion Score (MOS). However, when a large-scale data set is created by a subjective quality assessment method, a large amount of resources (manpower, material resources, etc.) are consumed. Meanwhile, the subjective evaluation method cannot meet the requirements of real-time evaluation of video quality in some practical applications, such as real-time monitoring of video quality in a live broadcast website. Therefore, it is very important to research the objective video quality evaluation method with great practical application value and less resource consumption. Among them, the no-reference quality evaluation model (NR-VQA) only needs to train and evaluate the objective model by using the distorted video, and has higher flexibility and real-time performance, and has been continuously concerned and researched in recent years.
Due to the lack of original reference video, conventional no-reference video evaluation methods were generally used earlier to evaluate video that was distorted after compression. And then constructing a statistical model for the distribution of the transform domain coefficients involved in a Natural Scene Statistics (NSS) method, extracting relevant characteristic information from the statistical model, and completing the prediction of video quality by combining a machine learning algorithm. However, due to the diversity of noise and complex content of the reference-free video, these methods cannot effectively and correctly represent various distortions, thereby limiting the performance of the model. Since deep learning can extract high-level semantic features, in recent years, researchers introduce a deep learning method into a video quality evaluation task, and the prediction accuracy of the quality score of a non-reference video is continuously improved.
Because the length of a video sequence contained in the existing data set is mostly within 20 seconds, the change of the timing information of the video in a short-distance sequence cannot be well captured by simply using a recurrent neural network (such as LSTM or GRU), that is, the recurrent neural network emphasizes the association between adjacent frames and fades the timing change relationship between the interval frames, in some adjacent frames of the video, if a frame which is not related to the video theme is present, the frame may not be important when determining the final video quality, and at this time, the information of the interval frames is more important.
In summary, the existing research on the video quality evaluation method, or the influence of fading the time sequence information of the video on the video quality, or the information expression of the confusing video in the time domain and the space domain, cannot accurately simulate the human visual quality perception process.
Disclosure of Invention
The invention aims to provide a video quality evaluation method based on novel time sequence characteristic relation mapping aiming at the defects of the existing quality evaluation method. The invention makes full use of the time sequence characteristics of the video sequence.
The technical scheme adopted for solving the problems is as follows:
step 1, extracting content perception features;
the invention utilizes a pre-trained convolutional neural network as a content perception feature extraction network. The involved frame-level video content-aware feature extractor is the ResNet-50 model pre-trained on ImageNet. The content-aware feature extraction network comprises a pre-trained ResNet-50 model, a spatial global average pooling layer and a global standard deviation pooling layer.
1-1, inputting content sensing characteristic extraction network, using all frames of a video as input of convolution neural network ResNet-50, extracting characteristics of every frame of video and outputting N characteristic graphs M t The method comprises the following steps:
M t =CNN(I t ) (1)
wherein t is a subscript, t =1,2, 3.. N, N is the total frame number of the video; i is t An image representing a tth frame of the video; m t And showing a characteristic diagram corresponding to the t-th frame of the video.
1-2, reserving more effective information by using space pooling operation: specifically, the global average pooling operation is used for removing redundant information between different frames, the global standard deviation pooling operation is used for storing change information between different frames, and feature vectors are respectively obtained
Figure BDA0003881289820000021
And
Figure BDA0003881289820000022
finally by applying the feature vector
Figure BDA0003881289820000023
And
Figure BDA0003881289820000024
aggregate to form a content-aware feature f t The specific calculation is as follows:
Figure BDA0003881289820000025
Figure BDA0003881289820000026
Figure BDA0003881289820000027
wherein GP is mean () Representing spatial global pooling operations, GP std () Which represents a global standard level operation of the network,
Figure BDA0003881289820000028
and
Figure BDA0003881289820000029
respectively, feature vectors obtained through global pooling and global mean difference operations,
Figure BDA00038812898200000210
denotes the joining of two vectors, f t Representing the final content-aware features extracted into a single frame of video.
Step 2, time sequence characteristic fusion:
extracting the content perception characteristics f at the frame level t The method is accessed into a Bi-directional long-short term memory (Bi-LSTM) neural network, and then the output characteristics of the network are utilized to construct a brand-new mapping of the characteristics of the current frame, the first two frames and the second two frames of the fused video by taking the characteristics of the continuous five frames of the video as a group. The new feature map structure is shown in fig. 2, and is specifically implemented as follows:
2-1, because the dimensionality of the extracted single-frame video features is too high, the subsequent model training is not facilitated, and in order to solve the problem, the extracted single content perception feature f needs to be firstly extracted t Connecting into a full connection layer for reducing dimension to obtain new characteristic vector X t The method comprises the following steps:
X t =W fx f t +b fx (5)
wherein, b fx 、W fx Representing parameter deviations and weights, respectively, in a single fully-connected layer, f t Representing the ultimate content perception of the extraction of a single frame of videoAnd (5) characterizing.
2-2. Obtaining the characteristic vector X of the frame level t And accessing a Bi-LSTM network, and capturing the long-term dependence of the content perception characteristic of the video sequence context information on the frame-level quality. The hidden size of the single-layer network unit is 128, the convolution kernel is 1X 128, and the hidden state initial value of the bidirectional long-short term network is H 0 According to the input feature vector X of the current time t And hidden state H of network at previous time (t-1) Calculating the hidden state H of the bidirectional long-short term network at the current moment t . The method comprises the following specific steps:
Figure BDA0003881289820000031
Figure BDA0003881289820000032
wherein, X t Feature vectors representing the video frame level, a and a' are two network elements of Bi-LSTM,
Figure BDA0003881289820000033
representing the current implicit state of the unidirectional network a of the bidirectional long and short term network,
Figure BDA0003881289820000034
an implicit state representing a previous time of the network;
Figure BDA0003881289820000035
the current implicit state of the unidirectional network a' representing a bidirectional long and short term network,
Figure BDA0003881289820000036
indicating the implicit state of a' at the previous moment.
And 2-3, constructing a new time sequence characteristic mapping by utilizing the output result of the bidirectional long-short term memory network.
Because the input video sequence contains more frames, a time sequence containing all the video frame characteristics is constructedThe feature mapping calculation amount is too large, so the invention takes five frames as a group to be divided into
Figure BDA0003881289820000037
Each group of five consecutive frame feature vectors X contained in the group t And constructing a 5 multiplied by 5 time sequence characteristic mapping matrix by taking the hidden state of the corresponding bidirectional long-short term network as a target element. The structure of the timing feature mapping matrix is shown in figure 2,
further, the specific steps of constructing the time sequence characteristic mapping matrix are as follows:
inputting: feature X at video frame level t Implicit states of a bidirectional long-short term network
Figure BDA0003881289820000038
And
Figure BDA0003881289820000039
and (3) outputting: n 5 × 5 dimensional feature mapping matrices;
(1) characterizing video frame level X t Dividing into n groups in a centralized manner;
(2) constructing n feature mapping matrixes of 5 x 5 dimensions, namely the row number i =5 and the column number j =5 of the matrixes;
characterizing video frame level X t Dividing the set into n groups, constructing a feature mapping matrix contained in each group, and entering the following cycle;
1) The first line element is a continuous 5-frame video frame-level feature contained in the current packet, i.e.
I 1j =X j
2) Starting from the second row of the matrix, implicit states for a bi-directional long-short term network will be introduced
Figure BDA0003881289820000041
And
Figure BDA0003881289820000042
the representation of the corresponding position element may be selected by the following cyclic process:
Figure BDA0003881289820000043
3) And returning n groups of feature mapping matrixes.
Where I denotes the row of the new timing characteristic mapping matrix, j denotes the column of the matrix, I ij Are the elements in the matrix.
And 2-4, performing feature aggregation on each generated time sequence feature mapping matrix.
x i =conv3(conv2(conv1(map i )))) (10)
The characteristic aggregation unit is three two-dimensional convolution layers, wherein map i Indicating the timing characteristic mapping matrix, x, possessed by the ith packet i For the output characteristics of the ith time sequence characteristic mapping matrix, the convolution kernel size of the first convolution layer conv1 is 5 × 5, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the second convolution layer conv2 is 3 × 3, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the third convolution layer conv3 is 3 × 3, the step size is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128.
And 3, quality regression, wherein the specific method comprises the following steps:
after the aggregation of the features of the i groups is completed, the aggregated i groups of features need to be mapped onto the quality scores of the videos through a regression model. The regression model related to the invention is a Multi-layer perceptron (Multi-layer Perception), which comprises two fully-connected layers, wherein each layer is respectively provided with 128 neurons and 1 neuron, and the aggregated grouping features are regressed to the quality scores. The method comprises the following specific steps:
q i =f wFC (x i ) (11)
wherein, f wFC Is a function of two fully connected layers, q i The quality score of the regressed ith packet is represented.
Step 4, quality pooling, wherein the specific method comprises the following steps:
the quality pooling module comprises a time average pooling model, and the quality score Q of the whole video is output by averagely pooling the results of all the characteristic groups predicted by the quality regression module. The method comprises the following specific steps:
Figure BDA0003881289820000051
wherein q is i For each packet's quality score, Q is the final quality score of the entire video.
Step 5, performing combined training on the content perception characteristic extraction network, the time sequence characteristic fusion module, the quality regression module and the quality pooling model, wherein the specific method comprises the following steps:
the global model was trained using an Adaptive Moment Estimation (Adam) optimizer with a weight decay of 0 and an initial learning rate of 10 -5 And then decays by 80% every 200 epochs. Carrying out weight initialization on the model by adopting a pre-trained ResNet-50 network, wherein L is used in training 1 The loss function is specifically as follows:
Figure BDA0003881289820000052
wherein w represents the weight of the model, b represents the parameter deviation in the training process,
Figure BDA0003881289820000053
is the actual quality score, y, of the video i The mass fraction predicted by the quality evaluation model according to the present invention.
The invention has the following beneficial effects:
the invention uses the pre-trained feature extraction network to extract the features at the frame level, then utilizes the Bi-LSTM network to capture the long-term dependence relationship between the content perception features of the context information of the video sequence and the frame level quality, and combines the hidden state output by the network and the features extracted at the frame level by the pre-trained feature extraction network to construct a novel time sequence feature relationship diagram, thereby fully utilizing the long-term and short-term time sequence relationship between the adjacent frames and the interval frames of the video sequence. Compared with the method that the time sequence relation modeling is directly carried out on the extracted frame level characteristics by using the unidirectional circulation neural network, the content perception characteristics of the video sequence are more effectively fused in the time dimension by using the bidirectional long-short term memory neural network to carry out the time sequence modeling, and meanwhile, the time sequence information change of the video in a short term is more effectively captured by the constructed novel time sequence characteristic relation graph, so that abundant time sequence information is provided for the development of a subsequent quality prediction task.
Drawings
FIG. 1 is a block diagram of the method of the present invention;
fig. 2 is a diagram of a novel timing feature map according to the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The method comprises a pre-trained convolutional neural network, a bidirectional long-short term network, a quality regression model and a video quality pooling model. Assuming that a video contains N frames, the complete video sequence is input into the global model. Firstly, extracting content perception characteristics of each frame by using a pre-trained convolutional neural network, processing the characteristics by using global pooling operation, and reserving effective information. And then, capturing the long-term dependence of the content perception characteristic of the video sequence context information on the frame-level quality by using a Bi-LSTM network. And combining the hidden state output by the network and the frame-level features extracted by the pre-trained feature extraction network to construct a novel time sequence feature relational graph, extracting the output features of each time sequence feature relational graph by using a feature aggregation unit, accessing the output features into a quality regression model to obtain the quality score corresponding to each group, and finally obtaining the quality score of the whole video by using a time pooling module.
As shown in fig. 1, the method of the present invention is as follows:
step 1, extracting content perception features, wherein a pre-trained convolutional neural network is used as a content perception feature extraction network. The involved frame-level video content-aware feature extractor is the ResNet-50 model pre-trained on ImageNet. The content-aware feature extraction network comprises a pre-trained ResNet-50 model, a spatial global average pooling layer and a global standard deviation pooling layer.
1-1, inputting content sensing characteristic extraction network, using all frames of a video as input of convolution neural network ResNet-50, extracting characteristics of every frame of video and outputting N characteristic graphs M t The method comprises the following steps:
M t =CNN(I t ) (1)
where t is a subscript, t =1,2, 3.. N, N being the total number of frames of the video; I.C. A t An image representing a tth frame of the video; m is a group of t And showing a characteristic diagram corresponding to the t-th frame of the video.
1-2, reserving more effective information by using space pooling operation: specifically, the global average pooling operation is used for removing redundant information between different frames, the global standard deviation pooling operation is used for storing change information between different frames, and feature vectors are respectively obtained
Figure BDA0003881289820000061
And
Figure BDA0003881289820000062
finally by applying the feature vector
Figure BDA0003881289820000063
And
Figure BDA0003881289820000064
aggregate to form a content-aware feature f t The specific calculation is as follows:
Figure BDA0003881289820000065
Figure BDA0003881289820000066
Figure BDA0003881289820000067
wherein GP is mean () Representing spatial global pooling operations, GP std () Which represents a global standard level operation of the network,
Figure BDA0003881289820000068
and
Figure BDA0003881289820000069
respectively feature vectors obtained through global pooling and global mean-difference operations,
Figure BDA00038812898200000610
representing the joining of two vectors, f t Representing the extraction of the final content-aware features of a single frame of video.
Step 2, fusing time sequence characteristics, wherein the specific method comprises the following steps:
and accessing the extracted content perception features of the frame level into a bidirectional long-short term memory (Bi-directional LSTM) neural network, and constructing a brand new mapping of the features of the current frame, the previous two frames and the next two frames of the fusion video by taking the features of the continuous five frames of the video as a group by utilizing the output features of the network. The new feature map is shown in fig. 2.
2-1, because the dimensionality of the extracted single-frame video features is too high, the subsequent model training is not facilitated, and in order to solve the problem, the extracted single content perception feature f needs to be firstly extracted t Connecting into a full connection layer for reducing dimension to obtain new characteristic vector X t The method comprises the following steps:
X t =W fx f t +b fx (5)
wherein, b fx 、W fx Representing parameter deviations and weights, respectively, in a single fully-connected layer, f t Representing the extraction of the final content-aware features of a single frame of video.
2-2. Obtaining the feature vector X of the frame level t Accessing a Bi-LSTM network to capture long-term dependence of content-aware features and frame-level quality of video sequence context informationAnd (4) relationship. The hidden size of the single-layer network unit is 128, the convolution kernel is 1X 128, and the hidden state initial value of the bidirectional long-short term network is H 0 According to the input feature vector X of the current time t And implicit state H of the network at the previous moment (t-1) Calculating the hidden state H of the bidirectional long-short term network at the current moment t . The method comprises the following specific steps:
Figure BDA0003881289820000071
Figure BDA0003881289820000072
wherein, X t Feature vectors representing the video frame level, a and a' are two network elements of Bi-LSTM,
Figure BDA0003881289820000073
representing the current implicit state of the unidirectional network a of the bidirectional long and short term network,
Figure BDA0003881289820000074
an implicit state representing a previous time of the network;
Figure BDA0003881289820000075
the current implicit state of the unidirectional network a' representing a bidirectional long and short term network,
Figure BDA0003881289820000076
indicating the implicit state of a' at the previous moment.
And 2-3, constructing a new time sequence feature map by using the output result of the bidirectional long-short term memory network.
Because the input video sequence contains more frames and the calculation amount for constructing the time sequence characteristic mapping containing all the video frame characteristics is too large, the invention takes five frames as one group and divides the five frames into
Figure BDA0003881289820000077
Each group of five consecutive frame feature vectors X contained in the group t And constructing a 5 multiplied by 5 time sequence characteristic mapping matrix by taking the hidden state of the corresponding bidirectional long-short term network as a target element. The structure of the timing feature mapping matrix is shown in figure 2,
further, the specific steps of the time sequence feature mapping matrix construction are as follows:
inputting: feature X at video frame level t Implicit states of a bidirectional long-short term network
Figure BDA0003881289820000078
And
Figure BDA0003881289820000079
and (3) outputting: n 5 × 5 dimensional feature mapping matrices;
(1) characterizing video frame level X t Dividing into n groups in a centralized manner;
(2) constructing n feature mapping matrixes of 5 x 5 dimensions, namely the row number i =5 and the column number j =5 of the matrixes;
characterizing video frame level X t Dividing the set into n groups, constructing a feature mapping matrix contained in each group, and entering the following circulation;
1) The first line element is a continuous 5-frame video frame-level feature contained in the current packet, i.e.
I 1j =X j
2) Starting from the second row of the matrix, the implicit state of the bidirectional long-short term network will be introduced
Figure BDA0003881289820000081
And
Figure BDA0003881289820000082
the representation of the corresponding position element may be selected by the following cyclic process:
Figure BDA0003881289820000083
3) And returning n groups of feature mapping matrixes.
Where I denotes the row of the new timing characteristic mapping matrix, j denotes the column of the matrix, I ij Are the elements in the matrix.
And 2-4, performing feature aggregation on each generated group of time sequence feature mapping matrixes.
x i =conv3(conv2(conv1(map i )))) (10)
The characteristic polymerization unit is three two-dimensional convolution layers, wherein map i Indicating the timing characteristic mapping matrix, x, possessed by the ith packet i For the output characteristic of the ith time sequence characteristic mapping matrix, the convolution kernel size of the first convolution layer conv1 is 5 × 5, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the second convolution layer conv2 is 3 × 3, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the third convolution layer conv3 is 3 × 3, the step size is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128.
And 3, quality regression, wherein the specific method comprises the following steps:
after the aggregation of the features of the i groups is completed, the aggregated i groups of features need to be mapped to the quality scores of the videos through a regression model. The regression model related to the invention is a Multi-layer perceptron (Multi-layer Perception), which comprises two fully-connected layers, wherein each layer is respectively provided with 128 neurons and 1 neuron, and the aggregated grouping features are regressed to the quality scores. The method comprises the following specific steps:
q i =f wFC (x i ) (11)
wherein f is wFC Is a function of two fully connected layers, q i Representing the quality score of the i-th packet after regression.
Step 4, quality pooling, wherein the specific method comprises the following steps:
the quality pooling module comprises a time average pooling model, and the quality pooling model is used for averagely pooling the results of all the characteristic groups predicted by the quality regression module and outputting the quality score Q of the whole video. The method comprises the following specific steps:
Figure BDA0003881289820000091
wherein q is i For each packet's quality score, Q is the final quality score of the entire video.
Step 5, performing combined training on the content perception characteristic extraction network, the time sequence characteristic fusion module, the quality regression module and the quality pooling model, wherein the specific method comprises the following steps:
the global model was trained using an Adaptive Moment Estimation (Adam) optimizer, with a weight decay of 0 and an initial learning rate of 10 -5 And then decays 80% every 200 epochs. Performing weight initialization on the model by adopting a pre-trained ResNet-50 network, wherein L is used in training 1 The loss function is specifically as follows:
Figure BDA0003881289820000092
wherein w represents the weight of the model, b represents the parameter deviation in the training process,
Figure BDA0003881289820000093
is the actual quality score, y, of the video i The mass fraction predicted by the quality evaluation model according to the present invention.

Claims (8)

1. A video quality evaluation method based on novel time sequence characteristic relation mapping is characterized by comprising the following steps:
step 1, extracting content perception features;
step 2, fusing time sequence characteristics:
and step 3, quality regression is carried out,
step 4, pooling the mass;
and 5, performing combined training on the content perception characteristic extraction network, the time sequence characteristic fusion module, the quality regression module and the quality pooling model.
2. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 1, wherein the step 1 is implemented as follows:
the method comprises the steps that a pre-trained convolutional neural network is used as a content-aware feature extraction network, and the content-aware feature extraction network comprises a pre-trained ResNet-50 model, a spatial global average pooling layer and a global standard deviation pooling layer;
1-1, obtaining the input of content sensing characteristic extraction network, using all frames of a video as the input of convolution neural network ResNet-50, extracting the characteristics of each frame of video, and outputting N characteristic graphs M t The method comprises the following steps:
M t =CNN(I t ) (1)
where t is a subscript, t =1,2, 3.. N, N being the total number of frames of the video; i is t An image representing a tth frame of the video; m t Representing a feature map corresponding to the t frame of the video;
1-2, reserving more effective information by using space pooling operation: specifically, the global average pooling operation is used for removing redundant information between different frames, the global standard deviation pooling operation is used for storing change information between different frames, and feature vectors are respectively obtained
Figure FDA0003881289810000011
And
Figure FDA0003881289810000012
finally by applying the feature vector
Figure FDA0003881289810000013
And
Figure FDA0003881289810000014
aggregation to form content-aware feature f t The specific calculation is as follows:
Figure FDA0003881289810000015
Figure FDA0003881289810000016
Figure FDA0003881289810000017
wherein GP mean () Representing spatial global pooling operations, GP std () Which represents a global standard level operation of the network,
Figure FDA0003881289810000018
and
Figure FDA0003881289810000019
respectively, feature vectors obtained through global pooling and global mean difference operations,
Figure FDA00038812898100000110
denotes the joining of two vectors, f t Representing the final content-aware features extracted into a single frame of video.
3. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 2, wherein the step 2 is implemented as follows:
extracting the content perception characteristics f at the frame level t The method is accessed into a bidirectional long-short term memory neural network, and the output characteristics of the network are utilized to construct a brand new mapping of the characteristics of the current frame, the first two frames and the second two frames of the fusion video by taking the characteristics of the continuous five frames as a group.
4. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 3, wherein the novel feature mapping is specifically realized as follows:
2-1, firstly extracting a single content perception characteristic f t Connecting into a full connection layer for reducing dimension to obtain new characteristic vector X t The method comprises the following steps:
X t =W fx f t +b fx (5)
wherein, b fx 、W fx Representing parameter deviations and weights, respectively, in a single fully-connected layer, f t Representing the final content perception features of the extracted single-frame video;
2-2. Obtaining the characteristic vector X of the frame level t Accessing a Bi-LSTM network, and capturing the long-term dependence relationship between the content perception characteristics of the context information of the video sequence and the frame-level quality; the implicit size of the single-layer network unit is set to 128, the convolution kernel is 1 multiplied by 128, and the implicit state initial value of the bidirectional long-short term network is set to H 0 According to the input feature vector X of the current time t And implicit state H of the network at the previous moment (t-1) Calculating the hidden state H of the bidirectional long-short term network at the current moment t (ii) a The method comprises the following specific steps:
Figure FDA0003881289810000021
Figure FDA0003881289810000022
wherein X t Feature vectors representing the video frame level, a and a' are two network elements of Bi-LSTM,
Figure FDA0003881289810000023
representing the current implicit state of the unidirectional network a of the bidirectional long and short term network,
Figure FDA0003881289810000024
an implicit state representing a previous time of the network;
Figure FDA0003881289810000025
the current implicit state of the unidirectional network a' representing a bidirectional long and short term network,
Figure FDA0003881289810000026
representing the implicit state of a' at the previous moment;
2-3, constructing a new time sequence feature map by utilizing the output result of the bidirectional long and short term memory network;
dividing five frames into one group
Figure FDA0003881289810000027
Each group of five consecutive frame feature vectors X contained in the group t And constructing a 5 multiplied by 5 time sequence characteristic mapping matrix by taking the hidden state of the corresponding bidirectional long-short term network as a target element;
and 2-4, performing feature aggregation on each generated group of time sequence feature mapping matrixes.
5. The method according to claim 4, wherein the time series feature mapping matrix is constructed by the following steps:
inputting: features X at video frame level t Implicit status of two-way long-short-term networks
Figure FDA0003881289810000028
And
Figure FDA0003881289810000029
and (3) outputting: n 5 × 5 dimensional feature mapping matrices;
(1) characterizing video frame level X t Dividing into n groups in a set;
(2) constructing n feature mapping matrixes with dimensions of 5 multiplied by 5, namely the row number i =5 and the column number j =5 of the matrixes;
characterizing video frame level X t Dividing into n groups, constructing characteristic mapping matrix contained in each group, andentering the following circulation;
1) The first row element is a continuous 5-frame video frame level feature contained in the current packet, i.e.
I 1j =X j
2) Starting from the second row of the matrix, the implicit state of the bidirectional long-short term network will be introduced
Figure FDA0003881289810000031
And
Figure FDA0003881289810000032
the representation of the corresponding position element may be selected by the following cyclic process:
Figure FDA0003881289810000033
3) Returning n groups of feature mapping matrixes;
where I denotes the row of the new timing characteristic mapping matrix, j denotes the column of the matrix, I ij Are the elements in the matrix.
6. The method according to claim 4, wherein the feature aggregation unit is three two-dimensional convolutional layers, and the specific formula is as follows:
x i =conv3(conv2(conv1(map i )))) (10)
wherein map is i Indicating the timing characteristic mapping matrix, x, possessed by the ith packet i For the output characteristic of the ith time sequence characteristic mapping matrix, the convolution kernel size of the first convolution layer conv1 is 5 × 5, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the second convolution layer conv2 is 3 × 3, the step length is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128; the convolution kernel size of the third convolution layer conv3 is 3 × 3, the step size is 1 × 1, the number of convolution kernels is 128, and the number of output channels is 128.
7. The video quality evaluation method based on the novel time sequence feature relationship mapping according to claim 4, wherein the quality regression in step 3 is specifically as follows:
after the feature aggregation of i groups is completed, the i groups of features after aggregation need to be mapped to the quality scores of the videos through a regression model, and the aggregated features of each group are regressed to the quality scores, which is specifically as follows:
q i =f wFC (x i ) (11)
wherein, f wFC Is a function of two fully connected layers, q i The quality score of the regressed ith packet is represented.
8. The method for evaluating video quality based on novel time series feature relationship mapping according to claim 4, wherein the quality in step 4 is pooled, and the specific method is as follows:
and (3) averagely pooling the results of all the feature groups predicted by the quality regression module, and outputting a quality score Q of the whole video, wherein the average pooling is as follows:
Figure FDA0003881289810000041
wherein q is i For each packet's quality score, Q is the final quality score of the entire video.
CN202211230036.XA 2022-10-08 2022-10-08 Video quality evaluation method based on novel time sequence characteristic relation mapping Pending CN115511858A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211230036.XA CN115511858A (en) 2022-10-08 2022-10-08 Video quality evaluation method based on novel time sequence characteristic relation mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211230036.XA CN115511858A (en) 2022-10-08 2022-10-08 Video quality evaluation method based on novel time sequence characteristic relation mapping

Publications (1)

Publication Number Publication Date
CN115511858A true CN115511858A (en) 2022-12-23

Family

ID=84508112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211230036.XA Pending CN115511858A (en) 2022-10-08 2022-10-08 Video quality evaluation method based on novel time sequence characteristic relation mapping

Country Status (1)

Country Link
CN (1) CN115511858A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071691A (en) * 2023-04-03 2023-05-05 成都索贝数码科技股份有限公司 Video quality evaluation method based on content perception fusion characteristics

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071691A (en) * 2023-04-03 2023-05-05 成都索贝数码科技股份有限公司 Video quality evaluation method based on content perception fusion characteristics
CN116071691B (en) * 2023-04-03 2023-06-23 成都索贝数码科技股份有限公司 Video quality evaluation method based on content perception fusion characteristics

Similar Documents

Publication Publication Date Title
CN111182292B (en) No-reference video quality evaluation method and system, video receiver and intelligent terminal
Li et al. No-reference image quality assessment with deep convolutional neural networks
CN112954312B (en) Non-reference video quality assessment method integrating space-time characteristics
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
US20230353828A1 (en) Model-based data processing method and apparatus
CN113554599B (en) Video quality evaluation method based on human visual effect
CN109919011A (en) A kind of action video recognition methods based on more duration informations
CN111539290A (en) Video motion recognition method and device, electronic equipment and storage medium
CN112883231B (en) Short video popularity prediction method, system, electronic equipment and storage medium
CN109949264A (en) A kind of image quality evaluating method, equipment and storage equipment
CN113239825B (en) High-precision tobacco beetle detection method in complex scene
Shen et al. An end-to-end no-reference video quality assessment method with hierarchical spatiotemporal feature representation
CN111597929A (en) Group behavior identification method based on channel information fusion and group relation space structured modeling
CN115830392A (en) Student behavior identification method based on improved YOLOv5
CN115511858A (en) Video quality evaluation method based on novel time sequence characteristic relation mapping
CN116844041A (en) Cultivated land extraction method based on bidirectional convolution time self-attention mechanism
CN103281555B (en) Half reference assessment-based quality of experience (QoE) objective assessment method for video streaming service
CN114925270A (en) Session recommendation method and model
CN107729821B (en) Video summarization method based on one-dimensional sequence learning
CN113689382A (en) Tumor postoperative life prediction method and system based on medical images and pathological images
CN113542780B (en) Method and device for removing compression artifacts of live webcast video
WO2022218072A1 (en) Method for assessing quality of image or video on basis of approximation value, and related apparatus
CN113221951B (en) Time domain attention pooling network-based dynamic graph classification method and device
CN116546304A (en) Parameter configuration method, device, equipment, storage medium and product
CN111325145B (en) Behavior recognition method based on combined time domain channel correlation block

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination