CN113642513B

CN113642513B - Action quality evaluation method based on self-attention and label distribution learning

Info

Publication number: CN113642513B
Application number: CN202111000981.6A
Authority: CN
Inventors: 张宇; 米思娅; 徐天宇
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2022-11-18
Anticipated expiration: 2041-08-30
Also published as: CN113642513A

Abstract

The invention discloses an action quality evaluation method based on self-attention and label distribution learning. Firstly, preprocessing a video, inputting each video segment to a feature extraction module, and generating a space-time feature of each segment; then the space-time characteristics of each video clip are input to a self-attention module as sequences to obtain self-attention characteristics containing context information between the sequences; all self-attention characteristics are spliced and input to a mark distribution learning module, and prediction distribution is output; then, converting the real label into real distribution by using a Gaussian function, calculating loss functions of the predicted distribution and the real distribution, minimizing loss, and training the model; and finally, evaluating the test video by using the trained model to obtain the prediction distribution of the test set and further obtain the evaluation score in the test data set. According to the method, the spearman grade correlation coefficient is used as an evaluation index, so that a better evaluation result is obtained, and the effectiveness of the action quality evaluation method is shown.

Description

Action quality evaluation method based on self-attention and label distribution learning

Technical Field

The application relates to the field of computer vision, in particular to an action quality assessment method based on self-attention and mark distribution learning.

Background

Video Action Quality Assessment (AQA) is intended to assess the performance and completion quality of specific actions in a video. The automated action quality assessment effectively reduces the loss of human resources, and can more accurately and fairly assess video content. The technology has potential value and wide application in the fields of skill teaching, sports competition, medical surgery and the like, and becomes a new and attractive research topic in the field of computer vision.

Over the past few years, a number of AQA methods have been proposed. Most of the problems are that the quality evaluation is simple and is regarded as a regression problem, the obtained characteristics are regressed, a predicted action score is directly obtained, or the quality characteristics are learned through pairwise comparison. However, the above two methods have limited effect, the first method ignores the inherent ambiguity of the label, i.e. the different scores given by different officials and the subjectivity of the given score, and the second method selects a reference sample with uncertainty. In addition, most of the existing methods divide the video into a plurality of discrete video sequences, and adopt three-dimensional convolution with fixed receptive field size to extract features, and these methods have the problem of multi-scale space-time features, that is, different videos may have the problems of different main body sizes in the spatial dimension and different durations and execution rates in the time dimension, so that the comprehension capability of the model to the sample is limited, and the quality evaluation effect is reduced. In order to solve the above problems, an action quality evaluation method based on self-attention and marker distribution learning is proposed.

Disclosure of Invention

The purpose of the invention is as follows: a new video motion quality evaluation model is designed, and a mark distribution learning method is used for predicting score distribution, so that the model can better deal with the problem of label ambiguity. The self-attention module with positive correlation and negative correlation heads is utilized to enable the model to learn the positive and negative correlation of the video clip sequence to the whole sample video, so that the context information of the video clip is enhanced, and the multi-scale space-time problem in the traditional action quality assessment task is solved.

The technical scheme is as follows: an action quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:

step one, preprocessing a video: downsampling an original video to obtain an input video with the total length of L

F _l Represents the l-th frame and is then segmented into n mutually overlapping segments C = { C = { (C) } ₁ ,C ₂ ,…,C _n }，C _n Representing an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;

step two, inputting each video clip C preprocessed in the step one into a feature extraction module to generate a space-time feature sequence F of each extracted clip _α ＝{α ₁ ,α ₂ ,…,α _n }，α _n Representing spatiotemporal features of the nth segment;

step three, the space-time characteristic sequence F of each video clip _α The sequences are input to a self-attention module to obtain a self-attention feature sequence F containing context information between the sequences _β ＝{β ₁ ,β ₂ ,…,β _n }，β _n Representing the self-attention feature of the nth segment;

step four, characterizing the sequence F from attention _β Splicing and inputting the data into a label distribution learning module, and outputting a prediction distribution s _pre ＝{p _pre (c _i ),p _pre (c ₂ ),…,p _pre (c _m )}，p _pre (c _i ) Is expressed as a score of c _i A predicted probability of (a);

step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian function _true ＝{p _true (c _i ),p _trne (c ₂ ),…,p _true (c _m )}，p _true (c _i ) Is expressed as a score of c _i True probability of (d);

step six, calculating loss functions of the prediction distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;

and step seven, evaluating the test video by using the model trained in the step six to obtain the prediction distribution of the test set, and further obtaining the evaluation score in the test data set.

Further, in the first step, the original video is down-sampled and divided into 10 segments, each segment comprises 16 frames, the down-sampling refers to performing center clipping on the picture, and the data enhancement refers to randomly selecting one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods by using random numbers to perform data processing on each frame in each segment.

Further, the feature extraction module in the second step is composed of a three-dimensional convolution neural network I3D with multiple receptive fields, and comprises a three-dimensional convolution layer, a three-dimensional maximum pooling layer and a third stepA dimensional average pooling layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence F _α 。

Further, the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of the input spatio-temporal feature a _n Linear mapping out of Query (q) _n )、Key(k _n )、Value(v _n ) Query is used to match Key, value represents space-time feature a from input _n The information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referenced _β 。

Further, in step four, the label distribution learning module includes a multi-layer perceptron that combines the self-attention feature sequences F of the video segments _β And splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.

Further, the specific method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.

Further, in step six, for the predicted distribution s _pre And true distribution s _true Using the KL divergence loss function, the overall loss function is expressed as:

wherein, L ({ s) _pre }) represents the overall loss for each training sample, the loss function described above is optimized using an Adam optimizer.

Further, in the seventh step, the spearman rank correlation coefficient is used as an evaluation index.

Has the advantages that: compared with the prior art, the deep learning method for motion quality evaluation can extract the temporal self-attention features among video sequences, increase the video segment weight positively correlated with the evaluation result, and reduce the negative correlation video segment weight so as to improve the evaluation accuracy of the model. In addition, the method can acquire multi-scale space-time context information among video sequences and generate prediction distribution with fine granularity. The following examples show: the invention can effectively learn the high-level characteristics with action information identification in action quality evaluation. In addition, the method provided by the invention has better effect on a plurality of behavior assessment data sets.

Drawings

Fig. 1 is a flowchart of an action quality evaluation method based on self-attention and label distribution learning according to an embodiment of the present invention;

fig. 2 is a diagram of an action quality evaluation model structure based on self-attention and marker distribution learning according to an embodiment of the present invention;

FIG. 3 is a block diagram of a feature extraction module according to an embodiment of the present invention;

FIG. 4 is a self-attention module structure according to an embodiment of the present invention;

FIG. 5 is a block diagram of a label distribution learning module according to an embodiment of the present invention;

FIG. 6 is a comparison of the present method and other methods provided by embodiments of the present invention in the MTL-AQA data set;

FIG. 7 is a comparison of the present method and other methods provided by embodiments of the present invention in a JIGSAWS data set;

FIG. 8 is a visualization of assessment results provided by embodiments of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

This embodiment is based on a self-attentive mechanism and label distributionThe method is used for evaluating the action quality of the MTL-AQA data set. In this example, the data set MTL-AQA used is a multi-label olympic diving sports data set, and the label is composed of a cutting score, a final score, a difficulty coefficient, a visual angle, an action type, a competition comment and the like. The final score is the sum of all referee scores after removing the two highest scores and the two lowest scores multiplied by a difficulty factor, the number of samples is 1412, and each sample is sampled to 103 frames. The example divides the training set into a test set by 3 to 1. During training, the I3D model of the feature extraction module is pre-trained to the Kinects dataset. The number of training times is set to 100, and the learning rate of the Adam optimizer is 10 ^-4 The weight attenuation rate is set to 10 ^-5 Batch training size is 4, and batch test size is 20. And evaluating the test video by using the trained model to obtain the prediction distribution and further obtain the evaluation score in the test data set. In order to explain the technical solution of the present invention, the following description is made with reference to the accompanying drawings and specific examples.

Fig. 1 shows a flowchart of an action quality assessment method based on self-attention and marker distribution learning according to an embodiment of the present invention, which includes the following steps:

step 1: the original video with the total length of L is downsampled to obtain a video V, and the video V is divided into n segments { C } ₁ ,C ₂ ,…,C _n And (4) cutting and enhancing data.

Step 2: inputting each video segment into the I3D model to generate a space-time characteristic sequence F of all the segments _α ＝{α ₁ ,α ₂ ,…,α _n }，α _n Representing the spatiotemporal characteristics of the nth segment.

And step 3: the spatial feature sequence F _α Inputting the data into a self-attention module to obtain a self-attention feature sequence F containing context information _β ＝{β ₁ ,β ₂ ,…,β _n }，β _n The self-attention feature of the nth segment is shown.

And 4, step 4: will self-attention feature sequence F _β Splicing, inputting to multi-layer perceptron and generating prediction distribution s _pre 。

And 5: if it is a testStage, select s _pre Taking the medium maximum probability score as a judgment score, multiplying the sum of all judgment scores after removing the two maximum values and the two minimum values by a difficulty coefficient DD, and outputting a final score s _final And then, the process is ended.

And 6: if the training stage is adopted, the real label is converted into a real distribution s _true 。

And 7: calculating s using KL divergence loss function _true And s _pre The model is trained using Adam optimizer.

And step 8: if the training times are less than 100, returning to the step 2, otherwise, ending.

Fig. 2 shows a structure diagram of an action quality assessment model based on self-attention and marker distribution learning according to an embodiment of the present invention, which is described in detail as follows:

the invention mainly comprises three parts: the system comprises a feature extraction module, a multi-head self-attention module and a mark distribution learning module. The algorithm flow of the invention is described below, a sample video with the total length of L is given, a video segment V with 103 frames is obtained by downsampling, the video segment V is segmented into 10 video segments which are partially overlapped, each segment comprises 16 frames, data enhancement is carried out, and the data enhancement is input to I3D for three-dimensional feature extraction, so that space-time features are generated. Then, extracting positive correlation and negative correlation self-attention features between sequences by using a multi-head self-attention mechanism, splicing all the features, inputting the features into a multi-layer perceptron to transform dimensionality, and finally generating prediction label distribution by using a Softmax layer, wherein fig. 2 is a schematic network structure diagram of the action quality evaluation method based on self-attention and label distribution learning provided by the invention.

the characteristic extraction module is mainly composed of a three-dimensional convolutional neural network I3D with multiple receptive fields, and is provided with 4 three-dimensional convolutional layers (Conv 3D), 4 maximum pooling layers (MaxPool 3D), 1 average pooling layer (AvgPool 3D) and 9 inclusion layers (Inc), the convolution and pooling steps in different dimensions are different, batch normalization is carried out on the back surfaces of other convolutional layers in the model except the last layer of convolution layer, and as the ReLU activation function has the function of unilateral inhibition and has the characteristic of sparse activation, part of parameters can be reduced, the purpose of over-fitting prevention is achieved, the excitation boundary is relatively wide, and any input characteristic can be activated; since the gradient is constant, no phenomena of gradient disappearance or gradient explosion occur, the ReLU is used as an activation function, which is not shown in fig. 3 for the sake of simplicity of representation of the model. The Inc in the figure represents an Incep layer, the structure of the Incep layer is shown in the figure, after convolution is carried out on convolution kernels with different receptive field sizes of 3 x 3 and 1 x 1, all scale features are spliced to form a feature which can represent more information and has more depth, and therefore the problem of multi-scale space-time features is solved.

The specific process of the feature extraction module can be described as follows: with a segment size of [3,16,224,224]The video Clip (Clip) n is input, the format is channel number, frame number, width and height, the unit is frame and pixel, after passing through each layer, the convolution is finally carried out on the time, and the average pooling is obtained to obtain the size of [1,1024]Space-time feature of (a) _n The sequence of spatio-temporal features of all video segments can be denoted as F _a ＝{α ₁ ,α ₂ ,…,α _n }. Due to the generality of the I3D model, the I3D model of the algorithm is pre-trained to the Kinetics dataset.

FIG. 4 is a self-attention module architecture according to an embodiment of the present invention;

in order to extract action information at different positions in a video, an action quality evaluation algorithm usually generates a plurality of video segment sequences in a segmented manner to extract features, and then an average pooling is used to achieve the purpose of feature aggregation. Inspired by the seq2seq task in natural language processing, this can be considered as a seq2seq process, using a sequence model. Considering that parallel output cannot be realized in the RNN calculation process, and long-distance relationships are still difficult to obtain, the multi-head self-attention mechanism is used for calculating the relationship between video segment contexts. The attention mechanism can be understood as that a group of sequences are used as input, the dot product operation is carried out by using Query and Key mapped by a linear layer, the result and Value are subjected to weighted summation, and a group of vector sequences with weights among all the input sequences are output.

The space-time characteristics alpha of a certain segment _n Inputting to a multi-head attention module, and respectively outputting the dimension d through three linear layers _k Query, key of (1), using q _n And k _n Represents, and dimension is d _v Value of (1), using v _n Expressing, query is used to match Key, value expresses spatio-temporal feature a input from _n The extracted information. Then, q is calculated _n Dot product of the Kay values of the other fragments in the sequence, the result being p _n,m Representing that m belongs to 1 to n, in order to prevent the data from being too large, resulting in the result after subsequent calculations using the activation function taking constantly 0 or 1, the dot product result is divided by

Then, the result is calculated by using a Softmax function to obtain the weight of the Value of the sequence segment, and finally, the Value is compared with the Value v of the current segment _n Performing dot product operation to obtain the self-attention sequence beta of the video clip _n ：

Compared with the method for extracting single Query, key and Value, the algorithm of the method extracts two queries, keys and Value from each segment in parallel by using a linear layer. As shown on the right of fig. 4, the space-time feature α is expressed _n The Positive correlation Head (Positive Head) and the Negative correlation Head (Negative Head) are simultaneously inputted. In the example, the athlete falling segment has a very high positive correlation with its neighbors, while the correlation with the first few segments is very low, since the sequence-header segments tend to be irrelevant background content, even negative. This is equivalent to adding a dimension that contains both positively and negatively correlated features between many segments in the sequence. Then, the calculation is carried out in the same way as the single self-attention mechanism in the method, and the calculation results of the two self-attention mechanisms are spliced together to obtain the attention characteristic of the segment. In the actual calculation process, the two-dimensional video segment sequence groupTogether to form a matrix, typically using parallel operations that take all the fragment results as shown below:

in the formula, Q, K and V respectively represent the space characteristic alpha of each segment _n A matrix stacked by Query, key, value obtained by linear layer mapping, subscripts pos and neg respectively representing parameters in positive correlation and negative correlation headers, d _k Representing the dimension of K. Finally, a self-attention feature sequence F with context information is obtained by referring to all the video segments _β ＝{β ₁ ,β ₂ ,…,β _n }。

FIG. 5 is a diagram illustrating a multi-layered perceptron structure in a label distribution learning module according to an embodiment of the present invention;

in the label distribution learning module, a training sample is assumed, and the true label is s. In order to obtain the true distribution s of the sample _true Firstly, a gaussian equation with the sample mean as the label score s and the variance as σ is generated (in the experimental part, the ablation experiment is performed on the distribution function selection), as shown in the following formula:

where σ is both a variance and a hyperparameter, the uncertainty of how good an action is evaluated. The label score interval is then uniformly discretized into a set of scores c = { c = { (c) } ₁ ,c ₂ ,…,c _m The value range of the true label of the MTL-AQA action quality assessment dataset is 0 to 10, and the predicted score interval is set to 0.5, i.e., [0,0.5, …,9.5,10]And thus the output dimension m takes 21. Then using the vector g _c ＝{g(c ₁ ),g(c ₂ ),…,g(c _m ) Describing the degree (i.e. probability) of each score, for g _c The following normalization is performed to obtain the true distribution s of the training samples _true ＝{p _true (c ₁ ),p _true (c ₂ ),…,p _true (c _m )}。

To learn the predicted distribution s _pre A sequence F of self-attention features of n segments learned by a multi-headed self-attention mechanism _β ＝{β ₁ ,β ₂ ,…,β _n Splicing, forming a large self-attention feature β'. The input to the multi-layered perceptron as shown in FIG. 5 converts the magnitude of the self-attention feature β' to m, and s _true Is the same, a ReLU activation function is used to add a nonlinear factor, and then a Softmax function is used for activation, so that a fine-grained prediction distribution is obtained:

s _pre ＝{p _pre (c _i )，p _pre (c ₂ )，…，p _pre (c _m )}

in the MTL-AQA action quality evaluation data set, labels are given by a plurality of referees, and the final score is calculated by removing two maximum values and two minimum values from all referee scores, summing the rest scores and multiplying by a difficulty coefficient. Therefore, after the self-attention model, K multi-layer perceptrons are trained in parallel to obtain K true distributions

And the predicted distribution

Can be represented as s _true,k And s _pre,k . Since the label of the label distribution learning is probability distribution, and the Kullback-Leibler divergence is called information divergence or relative entropy and is used for measuring the asymmetry measurement of the difference between the two probability distributions, s is calculated by using the KL divergence as the loss function of the label distribution learning in the training stage _true,k And s _pre,k The Loss between the two probability distributions is optimized by minimizing KL Loss by using a gradient descent method, so that the difference between the two probability distributions is minimized, namely, the more similar the predicted distribution and the real distribution, the better. The loss function is shown as follows:

wherein, L ({ s) _pre,k Denotes the overall loss for each training sample, s _pre,k Representing the k-th prediction distribution, s, in the sample _true,k Representing the kth true distribution, p, of the sample _pre (c _i,k ) Representing the probability of the score at the ith scoring position in the kth predicted distribution. p is a radical of _true (c _i,k ) Representing the score probability of the ith score position in the kth true distribution, the present invention optimizes the above-mentioned loss function using an Adam optimizer.

In the test phase, distribution s is predicted from _pre,k The one with the highest probability is selected as the final prediction score s obtained by the first K referee _final,k Then, the maximum two items and the minimum two items in all the predicted referee scores are removed, the residual scores are multiplied by the difficulty coefficient with known labels, and the final predicted score s is obtained _final As shown in the following formula:

where DD represents the difficulty factor for the sample, and k ∈ U represents all scores after the two maxima and two minima are rejected.

using the self-attentive mechanism and signature distribution learning based action quality assessment model presented herein the present method performs comparative experiments on MTL-AQA datasets with other mainstream action quality assessment models. Compared with models with different structures and depths, the method has the advantage that the influence of the algorithm on the action quality evaluation task is analyzed. The experiment takes the spearman grade correlation coefficient as an evaluation index, the experimental result shows that the posture characteristic extraction method based on manual design has the worst effect, the method using the regression model takes the final score as a supervision label, is superior to most methods at present, predicts the score by using the regression method, is superior to the existing action quality evaluation algorithm based on regression, proves the effectiveness of the self-attention mechanism with positive and negative correlation heads, shows that the self-attention module and the mark distribution learning module greatly improve the invention, and the method using the mark distribution learning is superior to the former, has the spearman grade correlation coefficient reaching 0.9384, and is superior to the standard method with the best effect at present. The experimental results fully indicate the effectiveness of the method. The experimental results show that the method proposed and improved herein has better generalization than other models and demonstrates the effectiveness of the method.

the method and other reference methods are tested in three scenes of knotting (knotting), needle threading (Needle paging) and Suturing (Suturing) in the JIGSAWS data set. The number of video frames of the jitswaws dataset dynamically changes with the sample video length, so 160 frames are randomly sampled as input to the model, which then divides the video into 10 segments of 16 frames each, as with the MTL-AQA dataset. Comparing the experimental results with the structures of the reference methods, the results show that the method has better performance in the tasks of suturing (0.7806) and threading (0.8040) than all other reference methods, the average Spireman grade coefficient in the three tasks is 0.7762, and the experimental results fully prove the effectiveness of the method in the task of motion quality evaluation.

and visualizing the quality evaluation result of the method by using a scatter diagram, wherein each scatter represents a prediction sample used by the method, the y-axis represents a prediction score, the x-axis represents a real score, and the real sample is represented by a dotted line. The more concentrated the scattered point distribution is, the closer the prediction result is to the real sample, and the higher the model is accurate. The action quality evaluation result based on self-attention and mark distribution learning is close to real sample data, and the effectiveness of the method is fully proved by the experimental result.

Claims

1. A motion quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:

F _l It is indicated that the l-th frame, then segmented into n mutually overlapping segments C = { C = { (C) } ₁ ,C ₂ ,…,C _n }，C _n Representing an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;

step two, inputting each video clip C preprocessed in the step one into a feature extraction module to generate a space-time feature sequence F of each extracted clip _α ＝{α ₁ ,α ₂ ,…,α _n }，α _m Representing the spatiotemporal characteristics of the mth segment;

step three, the space-time characteristic sequence F of each video clip _α The sequence is input to a self-attention module to obtain a self-attention characteristic sequence F containing context information between sequences _β ＝{β ₁ ,β ₂ ,…,β _n }，β _n Representing the self-attention feature of the nth segment;

step four, characterizing the sequence F from attention _β Splicing and inputting the data into a label distribution learning module, and outputting a prediction distribution s _pre ＝{p _pre (c _i ),p _pre (c ₂ ),…,p _pre (c _m )}，p _pre (c _i ) Is expressed as a score of c _i A predicted probability of (d);

step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian function _true ＝{p _true (c _i ),p _true (c ₂ ),…,p _true (c _m )}，p _true (c _i ) Is expressed as a score of c _i True probability of (d);

step six, calculating loss functions of the predicted distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;

2. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: in the first step, the original video is downsampled and divided into 10 sections, each section comprises 16 frames, the downsampling refers to the center cutting of the picture, and the data enhancement refers to the random selection of one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods to process data of each frame in each section.

3. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the feature extraction module in the second step is composed of a three-dimensional convolution neural network I3D with multiple receptive fields, and comprises a three-dimensional convolution layer, a three-dimensional maximum pooling layer, a three-dimensional average pooling layer and an inclusion layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence F _α 。

4. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of input space-time characteristics a _n Linear mapping out of Query (q) _n )、Key(k _n )、Value(v _n ) Query is used to matchMatching Key and Value to represent space-time characteristic a input from _n The information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referenced _β 。

5. The method of claim 1, wherein the method comprises: in the fourth step, the mark distribution learning module comprises a multilayer perceptron, and self-attention characteristic sequences F of all video segments are obtained _β And splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.

6. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the concrete method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.

7. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: in step six, for the predicted distribution s _pre And true distribution s _true Using the KL divergence loss function, the overall loss function is expressed as:

wherein, L ({ s) _pre Denotes the total loss for each training sample, the above loss function is optimized using an Adam optimizer.

8. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: and step seven, adopting the Spireman grade correlation coefficient as an evaluation index.