CN113642513A - Action quality evaluation method based on self-attention and label distribution learning - Google Patents

Action quality evaluation method based on self-attention and label distribution learning Download PDF

Info

Publication number
CN113642513A
CN113642513A CN202111000981.6A CN202111000981A CN113642513A CN 113642513 A CN113642513 A CN 113642513A CN 202111000981 A CN202111000981 A CN 202111000981A CN 113642513 A CN113642513 A CN 113642513A
Authority
CN
China
Prior art keywords
attention
self
distribution
video
true
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111000981.6A
Other languages
Chinese (zh)
Other versions
CN113642513B (en
Inventor
张宇
米思娅
徐天宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202111000981.6A priority Critical patent/CN113642513B/en
Publication of CN113642513A publication Critical patent/CN113642513A/en
Application granted granted Critical
Publication of CN113642513B publication Critical patent/CN113642513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an action quality evaluation method based on self-attention and label distribution learning. Firstly, preprocessing a video, inputting each video segment to a feature extraction module, and generating a space-time feature of each segment; then the space-time characteristics of each video clip are input to a self-attention module as sequences to obtain self-attention characteristics containing context information between the sequences; all self-attention characteristics are spliced and input to a mark distribution learning module, and prediction distribution is output; then, converting the real label into real distribution by using a Gaussian function, calculating loss functions of the predicted distribution and the real distribution, minimizing loss, and training the model; and finally, evaluating the test video by using the trained model to obtain the prediction distribution of the test set and further obtain the evaluation score in the test data set. According to the method, the spearman grade correlation coefficient is used as an evaluation index, so that a better evaluation result is obtained, and the effectiveness of the action quality evaluation method is shown.

Description

Action quality evaluation method based on self-attention and label distribution learning
Technical Field
The application relates to the field of computer vision, in particular to an action quality assessment method based on self-attention and mark distribution learning.
Background
Video Action Quality Assessment (AQA) is intended to assess the performance and completion quality of specific actions in a video. The automated action quality assessment effectively reduces the loss of human resources, and can more accurately and fairly assess video content. The technology has potential value and wide application in the fields of skill teaching, sports competition, medical surgery and the like, and becomes a new and attractive research topic in the field of computer vision.
Over the past few years, a number of AQA methods have been proposed. Most of the problems are that the quality evaluation is simple and is regarded as a regression problem, the obtained characteristics are regressed, a predicted action score is directly obtained, or the quality characteristics are learned through pairwise comparison. However, the above two methods have limited effect, the first method ignores the inherent ambiguity of the label, i.e. the different scores given by different officials and the subjectivity of the given score, and the second method selects a reference sample with uncertainty. In addition, most of the existing methods divide a video into a plurality of discrete video sequences and perform feature extraction by adopting three-dimensional convolution with a fixed receptive field size, and the methods have the problem of multi-scale space-time features, namely, different videos have the problems of different main body scales in spatial dimension and different duration and execution rates in time dimension, so that the model has limited understanding capability on samples, and the quality evaluation effect is reduced. To solve the above problems, an action quality assessment method based on self-attention and marker distribution learning is proposed.
Disclosure of Invention
The purpose of the invention is as follows: a new video motion quality evaluation model is designed, and a mark distribution learning method is used for predicting score distribution, so that the model can better deal with the problem of label ambiguity. By utilizing the self-attention module with positive correlation and negative correlation heads, the model can learn the positive and negative correlation of the video segment sequence to the whole sample video, so that the context information of the video segment is enhanced, and the multi-scale space-time problem in the traditional action quality evaluation task is solved.
The technical scheme is as follows: an action quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:
step one, preprocessing a video: downsampling an original video to obtain an input video with the total length of L
Figure BDA0003235591190000011
FlRepresents the l-th frame and then segments it into n mutually overlapping segments C ═ C1,C2,…,Cn},CnRepresenting an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;
step two, inputting each video clip C preprocessed in step one into a feature extraction module to generate an extracted space-time feature sequence F of each clipα={α12,…,αn},αnRepresenting spatiotemporal features of the nth segment;
step three, the space-time characteristic sequence F of each video clipαThe sequence is input to a self-attention module to obtain a self-attention characteristic sequence F containing context information between sequencesβ={β12,…,βn},βnRepresenting the self-attention feature of the nth segment;
step four, characterizing the sequence F from attentionβSplicing and inputting the data into a label distribution learning module, and outputting a prediction distribution spre={ppre(ci),ppre(c2),…,ppre(cm)},ppre(ci) Is expressed as a score of ciA predicted probability of (d);
step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian functiontrue={ptrue(ci),ptrne(c2),…,ptrue(cm)},ptrue(ci) Is expressed as a score of ciTrue probability of (d);
step six, calculating loss functions of the predicted distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;
and step seven, evaluating the test video by using the model trained in the step six to obtain the prediction distribution of the test set, and further obtaining the evaluation score in the test data set.
Further, in the first step, the original video is down-sampled and divided into 10 segments, each segment comprises 16 frames, the down-sampling refers to performing center clipping on the picture, and the data enhancement refers to randomly selecting one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods by using random numbers to perform data processing on each frame in each segment.
Further, the feature extraction module in the second step is composed of a three-dimensional convolutional neural network I3D with multiple receptive fields, and comprises a three-dimensional convolutional layer, a three-dimensional maximum pooling layer, a three-dimensional average pooling layer and an inclusion layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence Fα
Further, the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of the input spatio-temporal feature anLinear mapping out of Query (q)n)、Key(kn)、Value(vn) Query is used to match Key, Value represents space-time feature a from inputnThe information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referencedβ
Further, in step four, the label distribution learning module includes a multi-layer perceptron that combines the self-attention feature sequences F of the video segmentsβAnd splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.
Further, the specific method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.
Further, in step six, for the predicted distribution spreAnd true distribution strueUsing the KL divergence loss function, the overall loss function is expressed as:
Figure BDA0003235591190000031
wherein, L ({ s)pre}) represents the overall loss for each training sample, the loss function described above is optimized using an Adam optimizer.
Further, in the seventh step, the spearman grade correlation coefficient is used as an evaluation index.
Has the advantages that: compared with the prior art, the deep learning method for motion quality evaluation can extract the temporal self-attention features among video sequences, increase the video segment weight positively correlated with the evaluation result, and reduce the negative correlation video segment weight so as to improve the evaluation accuracy of the model. In addition, the method can acquire multi-scale space-time context information among video sequences and generate prediction distribution with fine granularity. The following examples show: the invention can effectively learn the high-level characteristics with action information identification in action quality evaluation. In addition, the method provided by the invention has better effect on a plurality of behavior assessment data sets.
Drawings
Fig. 1 is a flowchart of an action quality evaluation method based on self-attention and marker distribution learning according to an embodiment of the present invention;
fig. 2 is a diagram of an action quality evaluation model structure based on self-attention and marker distribution learning according to an embodiment of the present invention;
FIG. 3 is a block diagram of a feature extraction module according to an embodiment of the present invention;
FIG. 4 is a self-attention module architecture according to an embodiment of the present invention;
FIG. 5 is a block diagram of a label distribution learning module according to an embodiment of the present invention;
FIG. 6 is a comparison of the present method and other methods provided by embodiments of the present invention in the MTL-AQA data set;
FIG. 7 is a comparison of the present method and other methods provided by embodiments of the present invention in a JIGSAWS data set;
FIG. 8 is a visualization of assessment results provided by embodiments of the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment is based on a self-attention mechanism and label distribution learning and is used for motion quality evaluation of an MTL-AQA data set. In this example, the data set MTL-AQA used is a multi-label olympic diving sports data set, and the label is composed of a cutting score, a final score, a difficulty coefficient, a visual angle, an action type, a competition comment and the like. The final score is the sum of all referee scores after removing the two highest scores and the two lowest scores multiplied by a difficulty factor, the number of samples is 1412, and each sample is sampled to 103 frames. The example partitions the training set into the test set by 3 to 1. During training, the I3D model of the feature extraction module is pre-trained on the Kinects dataset. The number of training times is set to 100, and the learning rate of the Adam optimizer is 10-4The weight attenuation rate is set to 10-5The batch training size is 4 and the batch test size is 20. And evaluating the test video by using the trained model to obtain the prediction distribution and further obtain the evaluation score in the test data set. In order to explain the technical solution of the present invention, the following description is made with reference to the accompanying drawings and specific examples.
Fig. 1 shows a flowchart of an action quality assessment method based on self-attention and marker distribution learning according to an embodiment of the present invention, which includes the following steps:
step 1: total length of the tubeDown-sampling the original video of L to obtain a video V, and dividing the video V into n segments { C1,C2,…,CnAnd (4) cutting and enhancing data.
Step 2: inputting each video segment into I3D model to generate space-time characteristic sequence F of all segmentsα={α12,…,αn},αnRepresenting the spatio-temporal characteristics of the nth segment.
And step 3: the spatial feature sequence FαInputting the data into a self-attention module to obtain a self-attention feature sequence F containing context informationβ={β12,…,βn},βnThe self-attention feature of the nth segment is shown.
And 4, step 4: will self-attentive to characterize sequence FβSplicing, inputting to multi-layer perceptron and generating prediction distribution spre
And 5: if it is the test stage, select spreTaking the medium maximum probability score as a judgment score, multiplying the sum of all judgment scores after removing the two maximum values and the two minimum values by a difficulty coefficient DD, and outputting a final score sfinalAnd then, the process is ended.
Step 6: if the training stage is adopted, the real label is converted into a real distribution strue
And 7: calculating s using KL divergence loss functiontrueAnd spreThe model is trained using Adam optimizer.
And 8: if the training times are less than 100, returning to the step 2, otherwise, ending.
Fig. 2 shows a structure diagram of an action quality assessment model based on self-attention and marker distribution learning according to an embodiment of the present invention, which is described in detail as follows:
the invention mainly comprises three parts: the system comprises a feature extraction module, a multi-head self-attention module and a mark distribution learning module. The algorithm flow of the invention is described below, a sample video with the total length of L is given, a video segment V with 103 frames is obtained by downsampling, the video segment V is segmented into 10 partially overlapped video segments, each segment comprises 16 frames, data enhancement is performed, and the data enhancement is input to I3D for three-dimensional feature extraction, so that space-time features are generated. Then, extracting positive correlation and negative correlation self-attention features between sequences by using a multi-head self-attention mechanism, splicing all the features, inputting the features into a multi-layer perceptron to transform dimensionality, and finally generating prediction label distribution by using a Softmax layer, wherein fig. 2 is a schematic network structure diagram of the action quality evaluation method based on self-attention and label distribution learning provided by the invention.
FIG. 3 is a block diagram of a feature extraction module according to an embodiment of the present invention;
the characteristic extraction module is mainly composed of a three-dimensional convolutional neural network I3D with multiple receptive fields, and is provided with 4 three-dimensional convolutional layers (Conv3d), 4 maximum pooling layers (Maxpool3d), 1 average pooling layer (Avgpool3d) and 9 inclusion layers (Inc), the convolution and pooling steps of different dimensions are different, except the last layer of convolution layer, batch normalization is carried out behind other convolutional layers in the model, and as the ReLU activation function has the function of unilateral inhibition and has the characteristic of sparse activation, part of parameters can be reduced, the purpose of preventing overfitting is achieved, relatively wide excitation boundaries can be realized, and any input characteristic can be activated; since the gradient is constant, no phenomena of gradient disappearance or gradient explosion occur, the ReLU is used as an activation function, which is not shown in fig. 3 for the sake of simplicity of representation of the model. The Inc in the figure represents an Incep layer, the structure of the Incep layer is shown in the figure, after convolution is carried out on convolution kernels with different receptive field sizes of 3 x 3 and 1 x 1, all scale features are spliced to form a feature which can represent more information and has more depth, and therefore the problem of multi-scale space-time features is solved.
The specific process of the feature extraction module can be described as follows: in a segment of size [3,16,224 ]]The video Clip of (1) is input in the format of channel number, frame number, width and height, and the unit is frame and pixel, after passing through each layer, the final convolution is performed in time, and the average pooling is obtained to be the size of [1,1024 ]]Space-time feature of (a)nThe sequence of spatio-temporal features of all video segments can be denoted as Fa={α12,…,αn}. Due to the versatility of the I3D model, the I3D model of the present algorithm was pre-trained on the Kinetics dataset.
FIG. 4 is a self-attention module architecture according to an embodiment of the present invention;
in order to extract action information at different positions in a video, an action quality evaluation algorithm usually generates a plurality of video segment sequences in a segmented manner to extract features, and then an average pooling is used to achieve the purpose of feature aggregation. Inspired by seq2seq task in natural language processing, this can be considered as a seq2seq process, using a sequence model. Considering that parallel output cannot be realized in the RNN calculation process, and long-distance relationships are still difficult to obtain, the multi-head self-attention mechanism is used for calculating the relationship between video segment contexts. The attention mechanism can be understood as that a group of sequences are used as input, the dot product operation is carried out by using Query and Key mapped by a linear layer, the result and Value are subjected to weighted summation, and a group of vector sequences with weights among all the input sequences are output.
The space-time characteristics alpha of a certain segmentnInputting to a multi-head attention module, and respectively outputting the dimension d through three linear layerskQuery, Key of, with qnAnd knRepresents, and dimension is dvValue of (1), using vnExpressing, Query is used for matching Key, Value expresses space-time characteristic a input from inputnThe extracted information. Then, q is calculatednDot product of the Kay values of the other fragments in the sequence, the result being pn,mRepresenting that m belongs to 1 to n, in order to prevent the data from being too large, resulting in the result after subsequent calculations using the activation function taking constantly 0 or 1, the dot product result is divided by
Figure BDA0003235591190000051
Then, the result is calculated by using a Softmax function to obtain the weight of the Value of the sequence segment, and finally, the Value is compared with the Value v of the current segmentnPerforming dot product operation to obtain the self-attention sequence beta of the video segmentn
Figure BDA0003235591190000061
Compared with the method for extracting single Query, Key and Value, the algorithm of the method extracts two queries, keys and Value from each segment in parallel by using a linear layer. As shown on the right of fig. 4, the space-time feature α is expressednThe Positive correlation Head (Positive Head) and the Negative correlation Head (Negative Head) are simultaneously inputted. In the example, the athlete falling segment has a very high positive correlation with its neighbors, while the correlation with the first few segments is very low, since the sequence-header segments tend to be irrelevant background content, even negative. This is equivalent to adding a dimension that includes both positively and negatively correlated features between many segments in the sequence. Then, the calculation is carried out in the same way as the single self-attention mechanism in the method, and the calculation results of the two self-attention mechanisms are spliced together to obtain the attention characteristic of the segment. In the actual calculation process, two-dimensional video segment sequences are combined together to form a matrix, and generally, parallel operation is used to obtain all segment results as shown in the following formula:
Figure BDA0003235591190000062
in the formula, Q, K and V respectively represent the spatial characteristic alpha of each segmentnA matrix stacked by Query, Key, Value obtained by linear layer mapping, subscripts pos and neg respectively representing parameters in positive correlation and negative correlation headers, dkRepresenting the dimension of K. Finally, a self-attention feature sequence F with context information is obtained by referring to all the video segmentsβ={β12,…,βn}。
FIG. 5 is a diagram illustrating a multi-layered perceptron structure in a label distribution learning module according to an embodiment of the present invention;
in the label distribution learning module, a training sample is assumed, and the real label is s. In order to obtain the true distribution s of the sampletrueFirstly, a Gaussian equation with the sample mean value as a label score s and the variance as sigma is generated (in the experimental part, ablation experiment is carried out on the distribution function selection) As shown in the following formula:
Figure BDA0003235591190000063
where σ is both a variance and a hyperparameter, the uncertainty of how good an action is evaluated. And then uniformly discretizing the label score interval into a set of scores c ═ c1,c2,…,cmThe value range of the true label of the MTL-AQA action quality assessment dataset is 0 to 10, and the predicted score interval is set to 0.5, i.e., [0,0.5, …,9.5,10]And thus the output dimension m takes 21. Then using the vector gc={g(c1),g(c2),…,g(cm) Describing the degree (i.e. probability) of each score, for gcThe following normalization is performed to obtain the true distribution s of the training samplestrue={ptrue(c1),ptrue(c2),…,ptrue(cm)}。
Figure BDA0003235591190000064
To learn the predicted distribution spreN self-attention feature sequences F of respective segments learned by a multi-headed self-attention mechanismβ={β12,…,βnSplicing to form a large self-attention feature beta'. The input to the multi-layered perceptron as shown in FIG. 5 converts the magnitude of the self-attention feature β' to m, and strueIs the same, a ReLU activation function is used to add a nonlinear factor, and then a Softmax function is used for activation, so that a fine-grained prediction distribution is obtained:
spre={ppre(ci),ppre(c2),…,ppre(cm)}
in the MTL-AQA action quality evaluation data set, labels are given by a plurality of judges, and the final score is calculated by removing two maximum values and two minimum values from all the judge scores and adding the restThe scores are summed and multiplied by a difficulty factor. Therefore, after the self-attention model, K multi-layer perceptrons are trained in parallel to obtain K true distributions
Figure BDA0003235591190000071
Figure BDA0003235591190000072
And the predicted distribution
Figure BDA0003235591190000073
Can be represented as strue,kAnd spre,k. Since the label of the label distribution learning is probability distribution, and the Kullback-Leibler divergence is called information divergence or relative entropy and is used for measuring the asymmetry measurement of the difference between the two probability distributions, s is calculated by using the KL divergence as the loss function of the label distribution learning in the training stagetrue,kAnd spre,kThe Loss between the two probability distributions is optimized by minimizing KL Loss by using a gradient descent method, so that the difference between the two probability distributions is minimized, namely, the more similar the predicted distribution and the real distribution, the better. The loss function is shown as follows:
Figure BDA0003235591190000074
wherein, L ({ s)pre,kDenotes the overall loss for each training sample, spre,kRepresenting the k-th prediction distribution, s, in the sampletrue,kRepresenting the k-th true distribution, p, of the samplepre(ci,k) Representing the probability of the score at the ith scoring position in the kth predicted distribution. p is a radical oftrue(ci,k) Representing the score probability of the ith score position in the kth true distribution, the present invention optimizes the above-mentioned loss function using an Adam optimizer.
In the test phase, distribution s is predicted frompre,kOne of them with the highest probability is selected as the final prediction score s obtained by the first K refereefinal,kThen eliminating the maximum two items and the minimum two items in all the predicted referee scoresMultiplying the residual score by the difficulty coefficient for which the label is known to obtain a final predicted score sfinalAs shown in the following formula:
Figure BDA0003235591190000075
where DD represents the difficulty factor for the sample, and k ∈ U represents all scores after the two maxima and two minima are rejected.
FIG. 6 is a comparison of the present method and other methods provided by embodiments of the present invention in the MTL-AQA data set;
using the self-attentive mechanism and signature distribution learning based action quality assessment model presented herein the present method performs comparative experiments on MTL-AQA datasets with other mainstream action quality assessment models. Compared with models with different structures and depths, the method has the advantage that the influence of the algorithm on the action quality evaluation task is analyzed. The experiment takes the spearman grade correlation coefficient as an evaluation index, the experimental result shows that the posture characteristic extraction method based on manual design has the worst effect, the method using the regression model takes the final score as a supervision label, is superior to most methods at present, predicts the score by using the regression method, is superior to the action quality evaluation algorithm based on regression at present, proves the effectiveness of the self-attention mechanism with positive and negative correlation heads, shows that the self-attention module and the mark distribution learning module greatly improve the invention, and the method using the mark distribution learning is superior to the former, has the spearman grade correlation coefficient reaching 0.9384 and is superior to the current reference method with the best effect. The experimental results fully indicate the effectiveness of the method. The experimental results show that the method proposed and improved herein has better generalization than other models and demonstrates the effectiveness of the method.
FIG. 7 is a comparison of the present method and other methods provided by embodiments of the present invention in a JIGSAWS data set;
the method and other reference methods are tested in three scenes of knotting (knotting), Needle threading (Needle paging) and Suturing (Suturing) in the JIGSAWS data set. The number of video frames of the jitswaws dataset dynamically changes with the sample video length, so 160 frames are randomly sampled as input to the model, which then divides the video into 10 segments of 16 frames each, as with the MTL-AQA dataset. Comparing the experimental results with the structures of the reference methods, the results show that the method has better performance in the tasks of suturing (0.7806) and needle threading (0.8040) than all other reference methods, the average Spierman grade coefficient in the three tasks is 0.7762, and the experimental results fully prove the effectiveness of the method in the task of motion quality assessment.
FIG. 8 is a visualization of assessment results provided by embodiments of the present invention;
and visualizing the quality evaluation result of the method by using a scatter diagram, wherein each scatter represents a prediction sample used by the method, the y-axis represents a prediction score, the x-axis represents a real score, and the real sample is represented by a dotted line. The more concentrated the scatter distribution is, the closer the prediction result is to the real sample, and the higher the model is accurate. The action quality evaluation result based on self-attention and mark distribution learning is close to real sample data, and the effectiveness of the method is fully proved by the experimental result.

Claims (8)

1. An action quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:
step one, preprocessing a video: downsampling an original video to obtain an input video with the total length of L
Figure FDA0003235591180000011
FlRepresents the l-th frame and then segments it into n mutually overlapping segments C ═ C1,C2,…,Cn},CnRepresenting an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;
step two, inputting each video clip C preprocessed in step one into a feature extraction module to generate an extracted space-time feature sequence F of each clipα={α12,…,αn},αmRepresenting the spatiotemporal characteristics of the mth segment;
step three, the space-time characteristic sequence F of each video clipαThe sequence is input to a self-attention module to obtain a self-attention characteristic sequence F containing context information between sequencesβ={β12,…,βn},βnRepresenting the self-attention feature of the nth segment;
step four, characterizing the sequence F from attentionβSplicing and inputting the data into a label distribution learning module, and outputting a prediction distribution spre={ppre(ci),ppre(c2),…,ppre(cm)},ppre(ci) Is expressed as a score of ciA predicted probability of (d);
step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian functiontrue={ptrue(ci),ptrue(c2),…,ptrue(cm)},ptrue(ci) Is expressed as a score of ciTrue probability of (d);
step six, calculating loss functions of the predicted distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;
and step seven, evaluating the test video by using the model trained in the step six to obtain the prediction distribution of the test set, and further obtaining the evaluation score in the test data set.
2. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: in the first step, the original video is downsampled and divided into 10 sections, each section comprises 16 frames, the downsampling refers to the center cutting of the picture, and the data enhancement refers to the random selection of one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods to process data of each frame in each section.
3.The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the feature extraction module in the second step is composed of a three-dimensional convolution neural network I3D with multiple receptive fields, and comprises a three-dimensional convolution layer, a three-dimensional maximum pooling layer, a three-dimensional average pooling layer and an inclusion layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence Fα
4. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of input space-time characteristics anLinear mapping out of Query (q)n)、Key(kn)、Value(vn) Query is used to match Key, Value represents space-time feature a from inputnThe information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referencedβ
5. The method of claim 1, wherein the method comprises: in the fourth step, the mark distribution learning module comprises a multilayer perceptron, and self-attention characteristic sequences F of all video segments are obtainedβAnd splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.
6. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the concrete method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.
7. The learning based on self-attention and marker distribution of claim 1The action quality evaluation method is characterized in that: in step six, for the predicted distribution spreAnd true distribution strueUsing the KL divergence loss function, the overall loss function is expressed as:
Figure FDA0003235591180000021
wherein, L ({ s)pre}) represents the overall loss for each training sample, the loss function described above is optimized using an Adam optimizer.
8. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: and step seven, adopting the Spireman grade correlation coefficient as an evaluation index.
CN202111000981.6A 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning Active CN113642513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111000981.6A CN113642513B (en) 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111000981.6A CN113642513B (en) 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning

Publications (2)

Publication Number Publication Date
CN113642513A true CN113642513A (en) 2021-11-12
CN113642513B CN113642513B (en) 2022-11-18

Family

ID=78424319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111000981.6A Active CN113642513B (en) 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning

Country Status (1)

Country Link
CN (1) CN113642513B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463551A (en) * 2022-02-14 2022-05-10 北京百度网讯科技有限公司 Image processing method, image processing device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784121A (en) * 2020-06-12 2020-10-16 清华大学 Action quality evaluation method based on uncertainty score distribution learning
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784121A (en) * 2020-06-12 2020-10-16 清华大学 Action quality evaluation method based on uncertainty score distribution learning
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463551A (en) * 2022-02-14 2022-05-10 北京百度网讯科技有限公司 Image processing method, image processing device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN113642513B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
WO2021139069A1 (en) General target detection method for adaptive attention guidance mechanism
CN111259850B (en) Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
CN111611847B (en) Video motion detection method based on scale attention hole convolution network
US11270124B1 (en) Temporal bottleneck attention architecture for video action recognition
CN112784798A (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN112699956A (en) Neural morphology visual target classification method based on improved impulse neural network
CN109543602A (en) A kind of recognition methods again of the pedestrian based on multi-view image feature decomposition
AU2021379758A9 (en) A temporal bottleneck attention architecture for video action recognition
CN110046550A (en) Pedestrian's Attribute Recognition system and method based on multilayer feature study
CN110135369A (en) A kind of Activity recognition method, system, equipment and computer readable storage medium
Zhang et al. Cross-scale generative adversarial network for crowd density estimation from images
CN112149665A (en) High-performance multi-scale target detection method based on deep learning
Karkanis et al. Detecting abnormalities in colonoscopic images by textural description and neural networks
CN113642513B (en) Action quality evaluation method based on self-attention and label distribution learning
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
Sharma et al. Mango leaf diseases detection using deep learning
CN113537240B (en) Deformation zone intelligent extraction method and system based on radar sequence image
CN112597842B (en) Motion detection facial paralysis degree evaluation system based on artificial intelligence
CN111209433A (en) Video classification algorithm based on feature enhancement
Eghbali et al. Deep Convolutional Neural Network (CNN) for Large-Scale Images Classification
JP7466815B2 (en) Information processing device
CN117275681B (en) Method and device for detecting and evaluating honeycomb lung disease course period based on transducer parallel cross fusion model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant