CN113642513B - Action quality evaluation method based on self-attention and label distribution learning - Google Patents

Action quality evaluation method based on self-attention and label distribution learning Download PDF

Info

Publication number
CN113642513B
CN113642513B CN202111000981.6A CN202111000981A CN113642513B CN 113642513 B CN113642513 B CN 113642513B CN 202111000981 A CN202111000981 A CN 202111000981A CN 113642513 B CN113642513 B CN 113642513B
Authority
CN
China
Prior art keywords
attention
self
distribution
video
true
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111000981.6A
Other languages
Chinese (zh)
Other versions
CN113642513A (en
Inventor
张宇
米思娅
徐天宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202111000981.6A priority Critical patent/CN113642513B/en
Publication of CN113642513A publication Critical patent/CN113642513A/en
Application granted granted Critical
Publication of CN113642513B publication Critical patent/CN113642513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an action quality evaluation method based on self-attention and label distribution learning. Firstly, preprocessing a video, inputting each video segment to a feature extraction module, and generating a space-time feature of each segment; then the space-time characteristics of each video clip are input to a self-attention module as sequences to obtain self-attention characteristics containing context information between the sequences; all self-attention characteristics are spliced and input to a mark distribution learning module, and prediction distribution is output; then, converting the real label into real distribution by using a Gaussian function, calculating loss functions of the predicted distribution and the real distribution, minimizing loss, and training the model; and finally, evaluating the test video by using the trained model to obtain the prediction distribution of the test set and further obtain the evaluation score in the test data set. According to the method, the spearman grade correlation coefficient is used as an evaluation index, so that a better evaluation result is obtained, and the effectiveness of the action quality evaluation method is shown.

Description

Action quality evaluation method based on self-attention and label distribution learning
Technical Field
The application relates to the field of computer vision, in particular to an action quality assessment method based on self-attention and mark distribution learning.
Background
Video Action Quality Assessment (AQA) is intended to assess the performance and completion quality of specific actions in a video. The automated action quality assessment effectively reduces the loss of human resources, and can more accurately and fairly assess video content. The technology has potential value and wide application in the fields of skill teaching, sports competition, medical surgery and the like, and becomes a new and attractive research topic in the field of computer vision.
Over the past few years, a number of AQA methods have been proposed. Most of the problems are that the quality evaluation is simple and is regarded as a regression problem, the obtained characteristics are regressed, a predicted action score is directly obtained, or the quality characteristics are learned through pairwise comparison. However, the above two methods have limited effect, the first method ignores the inherent ambiguity of the label, i.e. the different scores given by different officials and the subjectivity of the given score, and the second method selects a reference sample with uncertainty. In addition, most of the existing methods divide the video into a plurality of discrete video sequences, and adopt three-dimensional convolution with fixed receptive field size to extract features, and these methods have the problem of multi-scale space-time features, that is, different videos may have the problems of different main body sizes in the spatial dimension and different durations and execution rates in the time dimension, so that the comprehension capability of the model to the sample is limited, and the quality evaluation effect is reduced. In order to solve the above problems, an action quality evaluation method based on self-attention and marker distribution learning is proposed.
Disclosure of Invention
The purpose of the invention is as follows: a new video motion quality evaluation model is designed, and a mark distribution learning method is used for predicting score distribution, so that the model can better deal with the problem of label ambiguity. The self-attention module with positive correlation and negative correlation heads is utilized to enable the model to learn the positive and negative correlation of the video clip sequence to the whole sample video, so that the context information of the video clip is enhanced, and the multi-scale space-time problem in the traditional action quality assessment task is solved.
The technical scheme is as follows: an action quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:
step one, preprocessing a video: downsampling an original video to obtain an input video with the total length of L
Figure BDA0003235591190000011
F l Represents the l-th frame and is then segmented into n mutually overlapping segments C = { C = { (C) } 1 ,C 2 ,…,C n },C n Representing an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;
step two, inputting each video clip C preprocessed in the step one into a feature extraction module to generate a space-time feature sequence F of each extracted clip α ={α 12 ,…,α n },α n Representing spatiotemporal features of the nth segment;
step three, the space-time characteristic sequence F of each video clip α The sequences are input to a self-attention module to obtain a self-attention feature sequence F containing context information between the sequences β ={β 12 ,…,β n },β n Representing the self-attention feature of the nth segment;
step four, characterizing the sequence F from attention β Splicing and inputting the data into a label distribution learning module, and outputting a prediction distribution s pre ={p pre (c i ),p pre (c 2 ),…,p pre (c m )},p pre (c i ) Is expressed as a score of c i A predicted probability of (a);
step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian function true ={p true (c i ),p trne (c 2 ),…,p true (c m )},p true (c i ) Is expressed as a score of c i True probability of (d);
step six, calculating loss functions of the prediction distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;
and step seven, evaluating the test video by using the model trained in the step six to obtain the prediction distribution of the test set, and further obtaining the evaluation score in the test data set.
Further, in the first step, the original video is down-sampled and divided into 10 segments, each segment comprises 16 frames, the down-sampling refers to performing center clipping on the picture, and the data enhancement refers to randomly selecting one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods by using random numbers to perform data processing on each frame in each segment.
Further, the feature extraction module in the second step is composed of a three-dimensional convolution neural network I3D with multiple receptive fields, and comprises a three-dimensional convolution layer, a three-dimensional maximum pooling layer and a third stepA dimensional average pooling layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence F α
Further, the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of the input spatio-temporal feature a n Linear mapping out of Query (q) n )、Key(k n )、Value(v n ) Query is used to match Key, value represents space-time feature a from input n The information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referenced β
Further, in step four, the label distribution learning module includes a multi-layer perceptron that combines the self-attention feature sequences F of the video segments β And splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.
Further, the specific method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.
Further, in step six, for the predicted distribution s pre And true distribution s true Using the KL divergence loss function, the overall loss function is expressed as:
Figure BDA0003235591190000031
wherein, L ({ s) pre }) represents the overall loss for each training sample, the loss function described above is optimized using an Adam optimizer.
Further, in the seventh step, the spearman rank correlation coefficient is used as an evaluation index.
Has the advantages that: compared with the prior art, the deep learning method for motion quality evaluation can extract the temporal self-attention features among video sequences, increase the video segment weight positively correlated with the evaluation result, and reduce the negative correlation video segment weight so as to improve the evaluation accuracy of the model. In addition, the method can acquire multi-scale space-time context information among video sequences and generate prediction distribution with fine granularity. The following examples show: the invention can effectively learn the high-level characteristics with action information identification in action quality evaluation. In addition, the method provided by the invention has better effect on a plurality of behavior assessment data sets.
Drawings
Fig. 1 is a flowchart of an action quality evaluation method based on self-attention and label distribution learning according to an embodiment of the present invention;
fig. 2 is a diagram of an action quality evaluation model structure based on self-attention and marker distribution learning according to an embodiment of the present invention;
FIG. 3 is a block diagram of a feature extraction module according to an embodiment of the present invention;
FIG. 4 is a self-attention module structure according to an embodiment of the present invention;
FIG. 5 is a block diagram of a label distribution learning module according to an embodiment of the present invention;
FIG. 6 is a comparison of the present method and other methods provided by embodiments of the present invention in the MTL-AQA data set;
FIG. 7 is a comparison of the present method and other methods provided by embodiments of the present invention in a JIGSAWS data set;
FIG. 8 is a visualization of assessment results provided by embodiments of the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
This embodiment is based on a self-attentive mechanism and label distributionThe method is used for evaluating the action quality of the MTL-AQA data set. In this example, the data set MTL-AQA used is a multi-label olympic diving sports data set, and the label is composed of a cutting score, a final score, a difficulty coefficient, a visual angle, an action type, a competition comment and the like. The final score is the sum of all referee scores after removing the two highest scores and the two lowest scores multiplied by a difficulty factor, the number of samples is 1412, and each sample is sampled to 103 frames. The example divides the training set into a test set by 3 to 1. During training, the I3D model of the feature extraction module is pre-trained to the Kinects dataset. The number of training times is set to 100, and the learning rate of the Adam optimizer is 10 -4 The weight attenuation rate is set to 10 -5 Batch training size is 4, and batch test size is 20. And evaluating the test video by using the trained model to obtain the prediction distribution and further obtain the evaluation score in the test data set. In order to explain the technical solution of the present invention, the following description is made with reference to the accompanying drawings and specific examples.
Fig. 1 shows a flowchart of an action quality assessment method based on self-attention and marker distribution learning according to an embodiment of the present invention, which includes the following steps:
step 1: the original video with the total length of L is downsampled to obtain a video V, and the video V is divided into n segments { C } 1 ,C 2 ,…,C n And (4) cutting and enhancing data.
Step 2: inputting each video segment into the I3D model to generate a space-time characteristic sequence F of all the segments α ={α 12 ,…,α n },α n Representing the spatiotemporal characteristics of the nth segment.
And step 3: the spatial feature sequence F α Inputting the data into a self-attention module to obtain a self-attention feature sequence F containing context information β ={β 12 ,…,β n },β n The self-attention feature of the nth segment is shown.
And 4, step 4: will self-attention feature sequence F β Splicing, inputting to multi-layer perceptron and generating prediction distribution s pre
And 5: if it is a testStage, select s pre Taking the medium maximum probability score as a judgment score, multiplying the sum of all judgment scores after removing the two maximum values and the two minimum values by a difficulty coefficient DD, and outputting a final score s final And then, the process is ended.
And 6: if the training stage is adopted, the real label is converted into a real distribution s true
And 7: calculating s using KL divergence loss function true And s pre The model is trained using Adam optimizer.
And step 8: if the training times are less than 100, returning to the step 2, otherwise, ending.
Fig. 2 shows a structure diagram of an action quality assessment model based on self-attention and marker distribution learning according to an embodiment of the present invention, which is described in detail as follows:
the invention mainly comprises three parts: the system comprises a feature extraction module, a multi-head self-attention module and a mark distribution learning module. The algorithm flow of the invention is described below, a sample video with the total length of L is given, a video segment V with 103 frames is obtained by downsampling, the video segment V is segmented into 10 video segments which are partially overlapped, each segment comprises 16 frames, data enhancement is carried out, and the data enhancement is input to I3D for three-dimensional feature extraction, so that space-time features are generated. Then, extracting positive correlation and negative correlation self-attention features between sequences by using a multi-head self-attention mechanism, splicing all the features, inputting the features into a multi-layer perceptron to transform dimensionality, and finally generating prediction label distribution by using a Softmax layer, wherein fig. 2 is a schematic network structure diagram of the action quality evaluation method based on self-attention and label distribution learning provided by the invention.
FIG. 3 is a block diagram of a feature extraction module according to an embodiment of the present invention;
the characteristic extraction module is mainly composed of a three-dimensional convolutional neural network I3D with multiple receptive fields, and is provided with 4 three-dimensional convolutional layers (Conv 3D), 4 maximum pooling layers (MaxPool 3D), 1 average pooling layer (AvgPool 3D) and 9 inclusion layers (Inc), the convolution and pooling steps in different dimensions are different, batch normalization is carried out on the back surfaces of other convolutional layers in the model except the last layer of convolution layer, and as the ReLU activation function has the function of unilateral inhibition and has the characteristic of sparse activation, part of parameters can be reduced, the purpose of over-fitting prevention is achieved, the excitation boundary is relatively wide, and any input characteristic can be activated; since the gradient is constant, no phenomena of gradient disappearance or gradient explosion occur, the ReLU is used as an activation function, which is not shown in fig. 3 for the sake of simplicity of representation of the model. The Inc in the figure represents an Incep layer, the structure of the Incep layer is shown in the figure, after convolution is carried out on convolution kernels with different receptive field sizes of 3 x 3 and 1 x 1, all scale features are spliced to form a feature which can represent more information and has more depth, and therefore the problem of multi-scale space-time features is solved.
The specific process of the feature extraction module can be described as follows: with a segment size of [3,16,224,224]The video Clip (Clip) n is input, the format is channel number, frame number, width and height, the unit is frame and pixel, after passing through each layer, the convolution is finally carried out on the time, and the average pooling is obtained to obtain the size of [1,1024]Space-time feature of (a) n The sequence of spatio-temporal features of all video segments can be denoted as F a ={α 12 ,…,α n }. Due to the generality of the I3D model, the I3D model of the algorithm is pre-trained to the Kinetics dataset.
FIG. 4 is a self-attention module architecture according to an embodiment of the present invention;
in order to extract action information at different positions in a video, an action quality evaluation algorithm usually generates a plurality of video segment sequences in a segmented manner to extract features, and then an average pooling is used to achieve the purpose of feature aggregation. Inspired by the seq2seq task in natural language processing, this can be considered as a seq2seq process, using a sequence model. Considering that parallel output cannot be realized in the RNN calculation process, and long-distance relationships are still difficult to obtain, the multi-head self-attention mechanism is used for calculating the relationship between video segment contexts. The attention mechanism can be understood as that a group of sequences are used as input, the dot product operation is carried out by using Query and Key mapped by a linear layer, the result and Value are subjected to weighted summation, and a group of vector sequences with weights among all the input sequences are output.
The space-time characteristics alpha of a certain segment n Inputting to a multi-head attention module, and respectively outputting the dimension d through three linear layers k Query, key of (1), using q n And k n Represents, and dimension is d v Value of (1), using v n Expressing, query is used to match Key, value expresses spatio-temporal feature a input from n The extracted information. Then, q is calculated n Dot product of the Kay values of the other fragments in the sequence, the result being p n,m Representing that m belongs to 1 to n, in order to prevent the data from being too large, resulting in the result after subsequent calculations using the activation function taking constantly 0 or 1, the dot product result is divided by
Figure BDA0003235591190000051
Then, the result is calculated by using a Softmax function to obtain the weight of the Value of the sequence segment, and finally, the Value is compared with the Value v of the current segment n Performing dot product operation to obtain the self-attention sequence beta of the video clip n
Figure BDA0003235591190000061
Compared with the method for extracting single Query, key and Value, the algorithm of the method extracts two queries, keys and Value from each segment in parallel by using a linear layer. As shown on the right of fig. 4, the space-time feature α is expressed n The Positive correlation Head (Positive Head) and the Negative correlation Head (Negative Head) are simultaneously inputted. In the example, the athlete falling segment has a very high positive correlation with its neighbors, while the correlation with the first few segments is very low, since the sequence-header segments tend to be irrelevant background content, even negative. This is equivalent to adding a dimension that contains both positively and negatively correlated features between many segments in the sequence. Then, the calculation is carried out in the same way as the single self-attention mechanism in the method, and the calculation results of the two self-attention mechanisms are spliced together to obtain the attention characteristic of the segment. In the actual calculation process, the two-dimensional video segment sequence groupTogether to form a matrix, typically using parallel operations that take all the fragment results as shown below:
Figure BDA0003235591190000062
in the formula, Q, K and V respectively represent the space characteristic alpha of each segment n A matrix stacked by Query, key, value obtained by linear layer mapping, subscripts pos and neg respectively representing parameters in positive correlation and negative correlation headers, d k Representing the dimension of K. Finally, a self-attention feature sequence F with context information is obtained by referring to all the video segments β ={β 12 ,…,β n }。
FIG. 5 is a diagram illustrating a multi-layered perceptron structure in a label distribution learning module according to an embodiment of the present invention;
in the label distribution learning module, a training sample is assumed, and the true label is s. In order to obtain the true distribution s of the sample true Firstly, a gaussian equation with the sample mean as the label score s and the variance as σ is generated (in the experimental part, the ablation experiment is performed on the distribution function selection), as shown in the following formula:
Figure BDA0003235591190000063
where σ is both a variance and a hyperparameter, the uncertainty of how good an action is evaluated. The label score interval is then uniformly discretized into a set of scores c = { c = { (c) } 1 ,c 2 ,…,c m The value range of the true label of the MTL-AQA action quality assessment dataset is 0 to 10, and the predicted score interval is set to 0.5, i.e., [0,0.5, …,9.5,10]And thus the output dimension m takes 21. Then using the vector g c ={g(c 1 ),g(c 2 ),…,g(c m ) Describing the degree (i.e. probability) of each score, for g c The following normalization is performed to obtain the true distribution s of the training samples true ={p true (c 1 ),p true (c 2 ),…,p true (c m )}。
Figure BDA0003235591190000064
To learn the predicted distribution s pre A sequence F of self-attention features of n segments learned by a multi-headed self-attention mechanism β ={β 12 ,…,β n Splicing, forming a large self-attention feature β'. The input to the multi-layered perceptron as shown in FIG. 5 converts the magnitude of the self-attention feature β' to m, and s true Is the same, a ReLU activation function is used to add a nonlinear factor, and then a Softmax function is used for activation, so that a fine-grained prediction distribution is obtained:
s pre ={p pre (c i ),p pre (c 2 ),…,p pre (c m )}
in the MTL-AQA action quality evaluation data set, labels are given by a plurality of referees, and the final score is calculated by removing two maximum values and two minimum values from all referee scores, summing the rest scores and multiplying by a difficulty coefficient. Therefore, after the self-attention model, K multi-layer perceptrons are trained in parallel to obtain K true distributions
Figure BDA0003235591190000071
Figure BDA0003235591190000072
And the predicted distribution
Figure BDA0003235591190000073
Can be represented as s true,k And s pre,k . Since the label of the label distribution learning is probability distribution, and the Kullback-Leibler divergence is called information divergence or relative entropy and is used for measuring the asymmetry measurement of the difference between the two probability distributions, s is calculated by using the KL divergence as the loss function of the label distribution learning in the training stage true,k And s pre,k The Loss between the two probability distributions is optimized by minimizing KL Loss by using a gradient descent method, so that the difference between the two probability distributions is minimized, namely, the more similar the predicted distribution and the real distribution, the better. The loss function is shown as follows:
Figure BDA0003235591190000074
wherein, L ({ s) pre,k Denotes the overall loss for each training sample, s pre,k Representing the k-th prediction distribution, s, in the sample true,k Representing the kth true distribution, p, of the sample pre (c i,k ) Representing the probability of the score at the ith scoring position in the kth predicted distribution. p is a radical of true (c i,k ) Representing the score probability of the ith score position in the kth true distribution, the present invention optimizes the above-mentioned loss function using an Adam optimizer.
In the test phase, distribution s is predicted from pre,k The one with the highest probability is selected as the final prediction score s obtained by the first K referee final,k Then, the maximum two items and the minimum two items in all the predicted referee scores are removed, the residual scores are multiplied by the difficulty coefficient with known labels, and the final predicted score s is obtained final As shown in the following formula:
Figure BDA0003235591190000075
where DD represents the difficulty factor for the sample, and k ∈ U represents all scores after the two maxima and two minima are rejected.
FIG. 6 is a comparison of the present method and other methods provided by embodiments of the present invention in the MTL-AQA data set;
using the self-attentive mechanism and signature distribution learning based action quality assessment model presented herein the present method performs comparative experiments on MTL-AQA datasets with other mainstream action quality assessment models. Compared with models with different structures and depths, the method has the advantage that the influence of the algorithm on the action quality evaluation task is analyzed. The experiment takes the spearman grade correlation coefficient as an evaluation index, the experimental result shows that the posture characteristic extraction method based on manual design has the worst effect, the method using the regression model takes the final score as a supervision label, is superior to most methods at present, predicts the score by using the regression method, is superior to the existing action quality evaluation algorithm based on regression, proves the effectiveness of the self-attention mechanism with positive and negative correlation heads, shows that the self-attention module and the mark distribution learning module greatly improve the invention, and the method using the mark distribution learning is superior to the former, has the spearman grade correlation coefficient reaching 0.9384, and is superior to the standard method with the best effect at present. The experimental results fully indicate the effectiveness of the method. The experimental results show that the method proposed and improved herein has better generalization than other models and demonstrates the effectiveness of the method.
FIG. 7 is a comparison of the present method and other methods provided by embodiments of the present invention in a JIGSAWS data set;
the method and other reference methods are tested in three scenes of knotting (knotting), needle threading (Needle paging) and Suturing (Suturing) in the JIGSAWS data set. The number of video frames of the jitswaws dataset dynamically changes with the sample video length, so 160 frames are randomly sampled as input to the model, which then divides the video into 10 segments of 16 frames each, as with the MTL-AQA dataset. Comparing the experimental results with the structures of the reference methods, the results show that the method has better performance in the tasks of suturing (0.7806) and threading (0.8040) than all other reference methods, the average Spireman grade coefficient in the three tasks is 0.7762, and the experimental results fully prove the effectiveness of the method in the task of motion quality evaluation.
FIG. 8 is a visualization of assessment results provided by embodiments of the present invention;
and visualizing the quality evaluation result of the method by using a scatter diagram, wherein each scatter represents a prediction sample used by the method, the y-axis represents a prediction score, the x-axis represents a real score, and the real sample is represented by a dotted line. The more concentrated the scattered point distribution is, the closer the prediction result is to the real sample, and the higher the model is accurate. The action quality evaluation result based on self-attention and mark distribution learning is close to real sample data, and the effectiveness of the method is fully proved by the experimental result.

Claims (8)

1. A motion quality assessment method based on self-attention and marker distribution learning is characterized by comprising the following steps:
step one, preprocessing a video: downsampling an original video to obtain an input video with the total length of L
Figure FDA0003235591180000011
F l It is indicated that the l-th frame, then segmented into n mutually overlapping segments C = { C = { (C) } 1 ,C 2 ,…,C n },C n Representing an nth segment, each segment containing M frames, each frame in each segment being downsampled and data enhanced;
step two, inputting each video clip C preprocessed in the step one into a feature extraction module to generate a space-time feature sequence F of each extracted clip α ={α 12 ,…,α n },α m Representing the spatiotemporal characteristics of the mth segment;
step three, the space-time characteristic sequence F of each video clip α The sequence is input to a self-attention module to obtain a self-attention characteristic sequence F containing context information between sequences β ={β 12 ,…,β n },β n Representing the self-attention feature of the nth segment;
step four, characterizing the sequence F from attention β Splicing and inputting the data into a label distribution learning module, and outputting a prediction distribution s pre ={p pre (c i ),p pre (c 2 ),…,p pre (c m )},p pre (c i ) Is expressed as a score of c i A predicted probability of (d);
step five, converting the real label S with the value range of 0 to m into real distribution S by utilizing a Gaussian function true ={p true (c i ),p true (c 2 ),…,p true (c m )},p true (c i ) Is expressed as a score of c i True probability of (d);
step six, calculating loss functions of the predicted distribution obtained in the step four and the real distribution obtained in the step five, minimizing loss, and training the model;
and step seven, evaluating the test video by using the model trained in the step six to obtain the prediction distribution of the test set, and further obtaining the evaluation score in the test data set.
2. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: in the first step, the original video is downsampled and divided into 10 sections, each section comprises 16 frames, the downsampling refers to the center cutting of the picture, and the data enhancement refers to the random selection of one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and Lanczos algorithm methods to process data of each frame in each section.
3. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the feature extraction module in the second step is composed of a three-dimensional convolution neural network I3D with multiple receptive fields, and comprises a three-dimensional convolution layer, a three-dimensional maximum pooling layer, a three-dimensional average pooling layer and an inclusion layer; wherein the inclusion layer comprises a plurality of convolutional layers and a max-pooling layer; the modules are connected according to an I3D network structure to obtain a space-time characteristic sequence F α
4. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the self-attention module in step three is composed of two self-attention heads of positive correlation and negative correlation, and the self-attention head of each segment is composed of input space-time characteristics a n Linear mapping out of Query (q) n )、Key(k n )、Value(v n ) Query is used to matchMatching Key and Value to represent space-time characteristic a input from n The information extracted in (a); the other segments are dot product, scaled and Softmax function, generating a self-attention feature sequence F extracted after all segments are referenced β
5. The method of claim 1, wherein the method comprises: in the fourth step, the mark distribution learning module comprises a multilayer perceptron, and self-attention characteristic sequences F of all video segments are obtained β And splicing, passing through a multilayer linear sensor, and finally generating prediction distribution through Softmax.
6. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: the concrete method for converting the real label into the real distribution in the step five is as follows: and generating real distribution by taking the sample real label S as a mean value and taking the hyper-parameter sigma as a variance.
7. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: in step six, for the predicted distribution s pre And true distribution s true Using the KL divergence loss function, the overall loss function is expressed as:
Figure FDA0003235591180000021
wherein, L ({ s) pre Denotes the total loss for each training sample, the above loss function is optimized using an Adam optimizer.
8. The motion quality assessment method based on self-attention and marker distribution learning according to claim 1, wherein: and step seven, adopting the Spireman grade correlation coefficient as an evaluation index.
CN202111000981.6A 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning Active CN113642513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111000981.6A CN113642513B (en) 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111000981.6A CN113642513B (en) 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning

Publications (2)

Publication Number Publication Date
CN113642513A CN113642513A (en) 2021-11-12
CN113642513B true CN113642513B (en) 2022-11-18

Family

ID=78424319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111000981.6A Active CN113642513B (en) 2021-08-30 2021-08-30 Action quality evaluation method based on self-attention and label distribution learning

Country Status (1)

Country Link
CN (1) CN113642513B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784121A (en) * 2020-06-12 2020-10-16 清华大学 Action quality evaluation method based on uncertainty score distribution learning
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784121A (en) * 2020-06-12 2020-10-16 清华大学 Action quality evaluation method based on uncertainty score distribution learning
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics

Also Published As

Publication number Publication date
CN113642513A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
WO2021139069A1 (en) General target detection method for adaptive attention guidance mechanism
CN111985369B (en) Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
CN111611847B (en) Video motion detection method based on scale attention hole convolution network
CN110516536B (en) Weak supervision video behavior detection method based on time sequence class activation graph complementation
CN112784798A (en) Multi-modal emotion recognition method based on feature-time attention mechanism
Dong et al. Crowd counting by using Top-k relations: a mixed ground-truth CNN framework
CN116686017A (en) Time bottleneck attention architecture for video action recognition
JP7252009B2 (en) Processing Text Images Using Line Recognition Max-Min Pooling for OCR Systems Using Artificial Neural Networks
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN110135369A (en) A kind of Activity recognition method, system, equipment and computer readable storage medium
CN116311483B (en) Micro-expression recognition method based on local facial area reconstruction and memory contrast learning
Zhang et al. Cross-scale generative adversarial network for crowd density estimation from images
CN112149664B (en) Target detection method for optimizing classification and positioning tasks
CN112149665A (en) High-performance multi-scale target detection method based on deep learning
CN115424209A (en) Crowd counting method based on spatial pyramid attention network
CN115222998A (en) Image classification method
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN113642513B (en) Action quality evaluation method based on self-attention and label distribution learning
Dastbaravardeh et al. Channel Attention-Based Approach with Autoencoder Network for Human Action Recognition in Low-Resolution Frames
CN113537240B (en) Deformation zone intelligent extraction method and system based on radar sequence image
CN112597842B (en) Motion detection facial paralysis degree evaluation system based on artificial intelligence
Dong et al. CCTwins: A Weakly-Supervised Transformer-based Crowd Counting Method with Adaptive Scene Consistency Attention
CN114463614A (en) Significance target detection method using hierarchical significance modeling of generative parameters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant