CN113642513A

CN113642513A - Action quality evaluation method based on self-attention and label distribution learning

Info

Publication number: CN113642513A
Application number: CN202111000981.6A
Authority: CN
Inventors: 张宇; 米思娅; 徐天宇
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-12
Anticipated expiration: 2041-08-30
Also published as: CN113642513B

Abstract

The invention discloses an action quality evaluation method based on self-attention and label distribution learning. Firstly, preprocessing a video, inputting each video segment to a feature extraction module, and generating a space-time feature of each segment; then the space-time characteristics of each video clip are input to a self-attention module as sequences to obtain self-attention characteristics containing context information between the sequences; all self-attention characteristics are spliced and input to a mark distribution learning module, and prediction distribution is output; then, converting the real label into real distribution by using a Gaussian function, calculating loss functions of the predicted distribution and the real distribution, minimizing loss, and training the model; and finally, evaluating the test video by using the trained model to obtain the prediction distribution of the test set and further obtain the evaluation score in the test data set. According to the method, the spearman grade correlation coefficient is used as an evaluation index, so that a better evaluation result is obtained, and the effectiveness of the action quality evaluation method is shown.

Description

Action quality evaluation method based on self-attention and label distribution learning

Technical Field

The application relates to the field of computer vision, in particular to an action quality assessment method based on self-attention and mark distribution learning.

Background

Video Action Quality Assessment (AQA) is intended to assess the performance and completion quality of specific actions in a video. The automated action quality assessment effectively reduces the loss of human resources, and can more accurately and fairly assess video content. The technology has potential value and wide application in the fields of skill teaching, sports competition, medical surgery and the like, and becomes a new and attractive research topic in the field of computer vision.

Over the past few years, a number of AQA methods have been proposed. Most of the problems are that the quality evaluation is simple and is regarded as a regression problem, the obtained characteristics are regressed, a predicted action score is directly obtained, or the quality characteristics are learned through pairwise comparison. However, the above two methods have limited effect, the first method ignores the inherent ambiguity of the label, i.e. the different scores given by different officials and the subjectivity of the given score, and the second method selects a reference sample with uncertainty. In addition, most of the existing methods divide a video into a plurality of discrete video sequences and perform feature extraction by adopting three-dimensional convolution with a fixed receptive field size, and the methods have the problem of multi-scale space-time features, namely, different videos have the problems of different main body scales in spatial dimension and different duration and execution rates in time dimension, so that the model has limited understanding capability on samples, and the quality evaluation effect is reduced. To solve the above problems, an action quality assessment method based on self-attention and marker distribution learning is proposed.

Disclosure of Invention

The purpose of the invention is as follows: a new video motion quality evaluation model is designed, and a mark distribution learning method is used for predicting score distribution, so that the model can better deal with the problem of label ambiguity. By utilizing the self-attention module with positive correlation and negative correlation heads, the model can learn the positive and negative correlation of the video segment sequence to the whole sample video, so that the context information of the video segment is enhanced, and the multi-scale space-time problem in the traditional action quality evaluation task is solved.

The technical scheme is as follows: an action quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:

step one, preprocessing a video: downsampling an original video to obtain an input video with the total length of L

F_lRepresents the l-th frame and then segments it into n mutually overlapping segments C ═ C₁,C₂,…,C_n}，C_nRepresenting an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;

step two, inputting each video clip C preprocessed in step one into a feature extraction module to generate an extracted space-time feature sequence F of each clip_α＝{α₁,α₂,…,α_n}，α_nRepresenting spatiotemporal features of the nth segment;

step three, the space-time characteristic sequence F of each video clip_αThe sequence is input to a self-attention module to obtain a self-attention characteristic sequence F containing context information between sequences_β＝{β₁,β₂,…,β_n}，β_nRepresenting the self-attention feature of the nth segment;

step four, characterizing the sequence F from attention_βSplicing and inputting the data into a label distribution learning module, and outputting a prediction distribution s_pre＝{p_pre(c_i),p_pre(c₂),…,p_pre(c_m)}，p_pre(c_i) Is expressed as a score of c_iA predicted probability of (d);

step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian function_true＝{p_true(c_i),p_trne(c₂),…,p_true(c_m)}，p_true(c_i) Is expressed as a score of c_iTrue probability of (d);

step six, calculating loss functions of the predicted distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;

and step seven, evaluating the test video by using the model trained in the step six to obtain the prediction distribution of the test set, and further obtaining the evaluation score in the test data set.

Further, in the first step, the original video is down-sampled and divided into 10 segments, each segment comprises 16 frames, the down-sampling refers to performing center clipping on the picture, and the data enhancement refers to randomly selecting one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods by using random numbers to perform data processing on each frame in each segment.

Further, the feature extraction module in the second step is composed of a three-dimensional convolutional neural network I3D with multiple receptive fields, and comprises a three-dimensional convolutional layer, a three-dimensional maximum pooling layer, a three-dimensional average pooling layer and an inclusion layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence F_α。

Further, the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of the input spatio-temporal feature a_nLinear mapping out of Query (q)_n)、Key(k_n)、Value(v_n) Query is used to match Key, Value represents space-time feature a from input_nThe information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referenced_β。

Further, in step four, the label distribution learning module includes a multi-layer perceptron that combines the self-attention feature sequences F of the video segments_βAnd splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.

Further, the specific method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.

Further, in step six, for the predicted distribution s_preAnd true distribution s_trueUsing the KL divergence loss function, the overall loss function is expressed as:

wherein, L ({ s)_pre}) represents the overall loss for each training sample, the loss function described above is optimized using an Adam optimizer.

Further, in the seventh step, the spearman grade correlation coefficient is used as an evaluation index.

Has the advantages that: compared with the prior art, the deep learning method for motion quality evaluation can extract the temporal self-attention features among video sequences, increase the video segment weight positively correlated with the evaluation result, and reduce the negative correlation video segment weight so as to improve the evaluation accuracy of the model. In addition, the method can acquire multi-scale space-time context information among video sequences and generate prediction distribution with fine granularity. The following examples show: the invention can effectively learn the high-level characteristics with action information identification in action quality evaluation. In addition, the method provided by the invention has better effect on a plurality of behavior assessment data sets.

Drawings

Fig. 1 is a flowchart of an action quality evaluation method based on self-attention and marker distribution learning according to an embodiment of the present invention;

fig. 2 is a diagram of an action quality evaluation model structure based on self-attention and marker distribution learning according to an embodiment of the present invention;

FIG. 3 is a block diagram of a feature extraction module according to an embodiment of the present invention;

FIG. 4 is a self-attention module architecture according to an embodiment of the present invention;

FIG. 5 is a block diagram of a label distribution learning module according to an embodiment of the present invention;

FIG. 6 is a comparison of the present method and other methods provided by embodiments of the present invention in the MTL-AQA data set;

FIG. 7 is a comparison of the present method and other methods provided by embodiments of the present invention in a JIGSAWS data set;

FIG. 8 is a visualization of assessment results provided by embodiments of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment is based on a self-attention mechanism and label distribution learning and is used for motion quality evaluation of an MTL-AQA data set. In this example, the data set MTL-AQA used is a multi-label olympic diving sports data set, and the label is composed of a cutting score, a final score, a difficulty coefficient, a visual angle, an action type, a competition comment and the like. The final score is the sum of all referee scores after removing the two highest scores and the two lowest scores multiplied by a difficulty factor, the number of samples is 1412, and each sample is sampled to 103 frames. The example partitions the training set into the test set by 3 to 1. During training, the I3D model of the feature extraction module is pre-trained on the Kinects dataset. The number of training times is set to 100, and the learning rate of the Adam optimizer is 10^-4The weight attenuation rate is set to 10^-5The batch training size is 4 and the batch test size is 20. And evaluating the test video by using the trained model to obtain the prediction distribution and further obtain the evaluation score in the test data set. In order to explain the technical solution of the present invention, the following description is made with reference to the accompanying drawings and specific examples.

Fig. 1 shows a flowchart of an action quality assessment method based on self-attention and marker distribution learning according to an embodiment of the present invention, which includes the following steps:

step 1: total length of the tubeDown-sampling the original video of L to obtain a video V, and dividing the video V into n segments { C₁,C₂,…,C_nAnd (4) cutting and enhancing data.

Step 2: inputting each video segment into I3D model to generate space-time characteristic sequence F of all segments_α＝{α₁,α₂,…,α_n}，α_nRepresenting the spatio-temporal characteristics of the nth segment.

And step 3: the spatial feature sequence F_αInputting the data into a self-attention module to obtain a self-attention feature sequence F containing context information_β＝{β₁,β₂,…,β_n}，β_nThe self-attention feature of the nth segment is shown.

And 4, step 4: will self-attentive to characterize sequence F_βSplicing, inputting to multi-layer perceptron and generating prediction distribution s_pre。

And 5: if it is the test stage, select s_preTaking the medium maximum probability score as a judgment score, multiplying the sum of all judgment scores after removing the two maximum values and the two minimum values by a difficulty coefficient DD, and outputting a final score s_finalAnd then, the process is ended.

Step 6: if the training stage is adopted, the real label is converted into a real distribution s_true。

And 7: calculating s using KL divergence loss function_trueAnd s_preThe model is trained using Adam optimizer.

And 8: if the training times are less than 100, returning to the step 2, otherwise, ending.

Fig. 2 shows a structure diagram of an action quality assessment model based on self-attention and marker distribution learning according to an embodiment of the present invention, which is described in detail as follows:

the invention mainly comprises three parts: the system comprises a feature extraction module, a multi-head self-attention module and a mark distribution learning module. The algorithm flow of the invention is described below, a sample video with the total length of L is given, a video segment V with 103 frames is obtained by downsampling, the video segment V is segmented into 10 partially overlapped video segments, each segment comprises 16 frames, data enhancement is performed, and the data enhancement is input to I3D for three-dimensional feature extraction, so that space-time features are generated. Then, extracting positive correlation and negative correlation self-attention features between sequences by using a multi-head self-attention mechanism, splicing all the features, inputting the features into a multi-layer perceptron to transform dimensionality, and finally generating prediction label distribution by using a Softmax layer, wherein fig. 2 is a schematic network structure diagram of the action quality evaluation method based on self-attention and label distribution learning provided by the invention.

the characteristic extraction module is mainly composed of a three-dimensional convolutional neural network I3D with multiple receptive fields, and is provided with 4 three-dimensional convolutional layers (Conv3d), 4 maximum pooling layers (Maxpool3d), 1 average pooling layer (Avgpool3d) and 9 inclusion layers (Inc), the convolution and pooling steps of different dimensions are different, except the last layer of convolution layer, batch normalization is carried out behind other convolutional layers in the model, and as the ReLU activation function has the function of unilateral inhibition and has the characteristic of sparse activation, part of parameters can be reduced, the purpose of preventing overfitting is achieved, relatively wide excitation boundaries can be realized, and any input characteristic can be activated; since the gradient is constant, no phenomena of gradient disappearance or gradient explosion occur, the ReLU is used as an activation function, which is not shown in fig. 3 for the sake of simplicity of representation of the model. The Inc in the figure represents an Incep layer, the structure of the Incep layer is shown in the figure, after convolution is carried out on convolution kernels with different receptive field sizes of 3 x 3 and 1 x 1, all scale features are spliced to form a feature which can represent more information and has more depth, and therefore the problem of multi-scale space-time features is solved.

The specific process of the feature extraction module can be described as follows: in a segment of size [3,16,224 ]]The video Clip of (1) is input in the format of channel number, frame number, width and height, and the unit is frame and pixel, after passing through each layer, the final convolution is performed in time, and the average pooling is obtained to be the size of [1,1024 ]]Space-time feature of (a)_nThe sequence of spatio-temporal features of all video segments can be denoted as F_a＝{α₁,α₂,…,α_n}. Due to the versatility of the I3D model, the I3D model of the present algorithm was pre-trained on the Kinetics dataset.

in order to extract action information at different positions in a video, an action quality evaluation algorithm usually generates a plurality of video segment sequences in a segmented manner to extract features, and then an average pooling is used to achieve the purpose of feature aggregation. Inspired by seq2seq task in natural language processing, this can be considered as a seq2seq process, using a sequence model. Considering that parallel output cannot be realized in the RNN calculation process, and long-distance relationships are still difficult to obtain, the multi-head self-attention mechanism is used for calculating the relationship between video segment contexts. The attention mechanism can be understood as that a group of sequences are used as input, the dot product operation is carried out by using Query and Key mapped by a linear layer, the result and Value are subjected to weighted summation, and a group of vector sequences with weights among all the input sequences are output.

The space-time characteristics alpha of a certain segment_nInputting to a multi-head attention module, and respectively outputting the dimension d through three linear layers_kQuery, Key of, with q_nAnd k_nRepresents, and dimension is d_vValue of (1), using v_nExpressing, Query is used for matching Key, Value expresses space-time characteristic a input from input_nThe extracted information. Then, q is calculated_nDot product of the Kay values of the other fragments in the sequence, the result being p_n,mRepresenting that m belongs to 1 to n, in order to prevent the data from being too large, resulting in the result after subsequent calculations using the activation function taking constantly 0 or 1, the dot product result is divided by

Then, the result is calculated by using a Softmax function to obtain the weight of the Value of the sequence segment, and finally, the Value is compared with the Value v of the current segment_nPerforming dot product operation to obtain the self-attention sequence beta of the video segment_n：

Compared with the method for extracting single Query, Key and Value, the algorithm of the method extracts two queries, keys and Value from each segment in parallel by using a linear layer. As shown on the right of fig. 4, the space-time feature α is expressed_nThe Positive correlation Head (Positive Head) and the Negative correlation Head (Negative Head) are simultaneously inputted. In the example, the athlete falling segment has a very high positive correlation with its neighbors, while the correlation with the first few segments is very low, since the sequence-header segments tend to be irrelevant background content, even negative. This is equivalent to adding a dimension that includes both positively and negatively correlated features between many segments in the sequence. Then, the calculation is carried out in the same way as the single self-attention mechanism in the method, and the calculation results of the two self-attention mechanisms are spliced together to obtain the attention characteristic of the segment. In the actual calculation process, two-dimensional video segment sequences are combined together to form a matrix, and generally, parallel operation is used to obtain all segment results as shown in the following formula:

in the formula, Q, K and V respectively represent the spatial characteristic alpha of each segment_nA matrix stacked by Query, Key, Value obtained by linear layer mapping, subscripts pos and neg respectively representing parameters in positive correlation and negative correlation headers, d_kRepresenting the dimension of K. Finally, a self-attention feature sequence F with context information is obtained by referring to all the video segments_β＝{β₁,β₂,…,β_n}。

FIG. 5 is a diagram illustrating a multi-layered perceptron structure in a label distribution learning module according to an embodiment of the present invention;

in the label distribution learning module, a training sample is assumed, and the real label is s. In order to obtain the true distribution s of the sample_trueFirstly, a Gaussian equation with the sample mean value as a label score s and the variance as sigma is generated (in the experimental part, ablation experiment is carried out on the distribution function selection) As shown in the following formula:

where σ is both a variance and a hyperparameter, the uncertainty of how good an action is evaluated. And then uniformly discretizing the label score interval into a set of scores c ═ c₁,c₂,…,c_mThe value range of the true label of the MTL-AQA action quality assessment dataset is 0 to 10, and the predicted score interval is set to 0.5, i.e., [0,0.5, …,9.5,10]And thus the output dimension m takes 21. Then using the vector g_c＝{g(c₁),g(c₂),…,g(c_m) Describing the degree (i.e. probability) of each score, for g_cThe following normalization is performed to obtain the true distribution s of the training samples_true＝{p_true(c₁),p_true(c₂),…,p_true(c_m)}。

To learn the predicted distribution s_preN self-attention feature sequences F of respective segments learned by a multi-headed self-attention mechanism_β＝{β₁,β₂,…,β_nSplicing to form a large self-attention feature beta'. The input to the multi-layered perceptron as shown in FIG. 5 converts the magnitude of the self-attention feature β' to m, and s_trueIs the same, a ReLU activation function is used to add a nonlinear factor, and then a Softmax function is used for activation, so that a fine-grained prediction distribution is obtained:

s_pre＝{p_pre(c_i)，p_pre(c₂)，…，p_pre(c_m)}

in the MTL-AQA action quality evaluation data set, labels are given by a plurality of judges, and the final score is calculated by removing two maximum values and two minimum values from all the judge scores and adding the restThe scores are summed and multiplied by a difficulty factor. Therefore, after the self-attention model, K multi-layer perceptrons are trained in parallel to obtain K true distributions

And the predicted distribution

Can be represented as s_true,kAnd s_pre,k. Since the label of the label distribution learning is probability distribution, and the Kullback-Leibler divergence is called information divergence or relative entropy and is used for measuring the asymmetry measurement of the difference between the two probability distributions, s is calculated by using the KL divergence as the loss function of the label distribution learning in the training stage_true,kAnd s_pre,kThe Loss between the two probability distributions is optimized by minimizing KL Loss by using a gradient descent method, so that the difference between the two probability distributions is minimized, namely, the more similar the predicted distribution and the real distribution, the better. The loss function is shown as follows:

wherein, L ({ s)_pre,kDenotes the overall loss for each training sample, s_pre,kRepresenting the k-th prediction distribution, s, in the sample_true,kRepresenting the k-th true distribution, p, of the sample_pre(c_i,k) Representing the probability of the score at the ith scoring position in the kth predicted distribution. p is a radical of_true(c_i,k) Representing the score probability of the ith score position in the kth true distribution, the present invention optimizes the above-mentioned loss function using an Adam optimizer.

In the test phase, distribution s is predicted from_pre,kOne of them with the highest probability is selected as the final prediction score s obtained by the first K referee_final,kThen eliminating the maximum two items and the minimum two items in all the predicted referee scoresMultiplying the residual score by the difficulty coefficient for which the label is known to obtain a final predicted score s_finalAs shown in the following formula:

where DD represents the difficulty factor for the sample, and k ∈ U represents all scores after the two maxima and two minima are rejected.

using the self-attentive mechanism and signature distribution learning based action quality assessment model presented herein the present method performs comparative experiments on MTL-AQA datasets with other mainstream action quality assessment models. Compared with models with different structures and depths, the method has the advantage that the influence of the algorithm on the action quality evaluation task is analyzed. The experiment takes the spearman grade correlation coefficient as an evaluation index, the experimental result shows that the posture characteristic extraction method based on manual design has the worst effect, the method using the regression model takes the final score as a supervision label, is superior to most methods at present, predicts the score by using the regression method, is superior to the action quality evaluation algorithm based on regression at present, proves the effectiveness of the self-attention mechanism with positive and negative correlation heads, shows that the self-attention module and the mark distribution learning module greatly improve the invention, and the method using the mark distribution learning is superior to the former, has the spearman grade correlation coefficient reaching 0.9384 and is superior to the current reference method with the best effect. The experimental results fully indicate the effectiveness of the method. The experimental results show that the method proposed and improved herein has better generalization than other models and demonstrates the effectiveness of the method.

the method and other reference methods are tested in three scenes of knotting (knotting), Needle threading (Needle paging) and Suturing (Suturing) in the JIGSAWS data set. The number of video frames of the jitswaws dataset dynamically changes with the sample video length, so 160 frames are randomly sampled as input to the model, which then divides the video into 10 segments of 16 frames each, as with the MTL-AQA dataset. Comparing the experimental results with the structures of the reference methods, the results show that the method has better performance in the tasks of suturing (0.7806) and needle threading (0.8040) than all other reference methods, the average Spierman grade coefficient in the three tasks is 0.7762, and the experimental results fully prove the effectiveness of the method in the task of motion quality assessment.

and visualizing the quality evaluation result of the method by using a scatter diagram, wherein each scatter represents a prediction sample used by the method, the y-axis represents a prediction score, the x-axis represents a real score, and the real sample is represented by a dotted line. The more concentrated the scatter distribution is, the closer the prediction result is to the real sample, and the higher the model is accurate. The action quality evaluation result based on self-attention and mark distribution learning is close to real sample data, and the effectiveness of the method is fully proved by the experimental result.

Claims

1. An action quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:

step two, inputting each video clip C preprocessed in step one into a feature extraction module to generate an extracted space-time feature sequence F of each clip_α＝{α₁,α₂,…,α_n}，α_mRepresenting the spatiotemporal characteristics of the mth segment;

step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian function_true＝{p_true(c_i),p_true(c₂),…,p_true(c_m)}，p_true(c_i) Is expressed as a score of c_iTrue probability of (d);

2. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: in the first step, the original video is downsampled and divided into 10 sections, each section comprises 16 frames, the downsampling refers to the center cutting of the picture, and the data enhancement refers to the random selection of one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods to process data of each frame in each section.

3.The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the feature extraction module in the second step is composed of a three-dimensional convolution neural network I3D with multiple receptive fields, and comprises a three-dimensional convolution layer, a three-dimensional maximum pooling layer, a three-dimensional average pooling layer and an inclusion layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence F_α。

4. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of input space-time characteristics a_nLinear mapping out of Query (q)_n)、Key(k_n)、Value(v_n) Query is used to match Key, Value represents space-time feature a from input_nThe information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referenced_β。

5. The method of claim 1, wherein the method comprises: in the fourth step, the mark distribution learning module comprises a multilayer perceptron, and self-attention characteristic sequences F of all video segments are obtained_βAnd splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.

6. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the concrete method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.

7. The learning based on self-attention and marker distribution of claim 1The action quality evaluation method is characterized in that: in step six, for the predicted distribution s_preAnd true distribution s_trueUsing the KL divergence loss function, the overall loss function is expressed as:

8. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: and step seven, adopting the Spireman grade correlation coefficient as an evaluation index.