CN114821812A

CN114821812A - Deep learning-based skeleton point action recognition method for pattern skating players

Info

Publication number: CN114821812A
Application number: CN202210721105.0A
Authority: CN
Inventors: 虞博文; 翟天泰; 熊三玥; 林馨怡
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-07-29
Anticipated expiration: 2042-06-24
Also published as: CN114821812B

Abstract

The invention relates to a deep learning-based skeleton point action recognition method for pattern skating players, which comprises the steps of randomly dividing an action video set to be classified into a training video set and a testing video set, respectively using the training video set and the testing video set to calculate a training track and a testing track of an action, inputting the training track into an action recognition model for action recognition, then optimizing the action recognition model through the testing track, and obtaining a deep action recognition model for quickly extracting the action; and acquiring a motion video set to be tested through the depth motion recognition model, and recognizing the limb motion and the track of the object in the motion video set to be tested to obtain a recognition result of the test set. The method can realize quick recognition of the action, overcome the defects of non-uniform standard and low evaluation efficiency caused by pure manual evaluation in the prior art, and improve the evaluation efficiency and the result uniformity of the evaluation.

Description

Deep learning-based skeleton point action recognition method for pattern skating players

Technical Field

The invention relates to the technical field of sport assistance, in particular to a method for recognizing skeletal point actions of pattern skating players based on deep learning.

Background

Human body action recognition is an important research direction of multidisciplinary intersection of computer vision, mode recognition, image processing, artificial intelligence and the like, and has great application value and theoretical significance in the fields of human-computer interaction, intelligent monitoring and medical treatment. The method mainly aims at the motion image sequence containing people to carry out analysis processing, feature extraction and moving object classification, and realizes the recognition and understanding of individual actions of people, and interactive behaviors between people and external environment.

In recent years, many motion recognition methods based on human bones have been proposed, and the basic principle of these methods is to combine key posture features of bones into motion sequences, and distinguish different motions by comparing the probability of different postures appearing in the motions or the difference of postures. Compared with the prior motion identification method based on silhouette or outline, the skeleton static modeling method has a certain effect on improving the identification rate, but the skeleton static modeling method does not fully utilize the time and space characteristics of the skeleton, is difficult to identify similar motions such as waving hands and drawing symbols, and has limitation in application in a real environment.

A method for dynamically modeling a skeleton is proposed, wherein an action sequence is regarded as a dynamic problem of time and space, the motion characteristics of skeleton nodes are extracted, and then the recognition result is obtained through characteristic analysis and classification.

The method obviously improves the accuracy of motion recognition, but because the space-time characteristics of bones are complex, and robust motion characteristics are difficult to provide, more researchers are dedicated to establishing effective models to extract the characteristics at present. On the other hand, if the bone data is inaccurate due to occlusion or view angle change, the recognition result is also greatly influenced.

The pattern skating event is a high-athletic and high-ornamental event and is always important in international sports events, but when the motion of the pattern skating is guided and improved, the judgment standard of motion accuracy in the event highly depends on the experience of practitioners, so that negative results such as judgment subjectivity, low efficiency, scoring contradiction and the like are easily caused.

Disclosure of Invention

In order to overcome at least part of defects in the prior art, the method for recognizing the bone point actions of the figure skating player based on deep learning can improve the action recognition effect of the player and can grade the actions according to the defects in the recognized actions.

The invention relates to a deep learning-based skeleton point action recognition method for a pattern skating player, which comprises the following steps of:

s1, acquiring the action video set to be classified through the client, uploading the action video set to a cache area of the server, and storing the action video set to the cache area of the server;

s2, randomly dividing the action video set to be classified in the cache area into a training video set and a testing video set, respectively using the training video set and the testing video set to calculate a training track and a testing track of an action, inputting the training track into an action recognition model for action recognition, and then optimizing the action recognition model through the testing track to obtain a deep action recognition model for quickly extracting the action;

s3, acquiring a motion video set to be tested through the depth motion recognition model, and recognizing the limb motion and the track of an object in the motion video set to be tested to obtain a recognition result of the test set;

and S4, decomposing and comparing the identification result of the test set with the standard fancy skating motion model through a grading system, grading according to the matching degree of the track, and simultaneously outputting a grading result, wherein the test set is a single motion to be tested or a set of a plurality of motions to be tested.

Further, the method comprises the steps of obtaining daily training videos of athletes through a coach system, uploading the obtained daily training videos to a depth action recognition model arranged on a cloud server for skeletal point action analysis, obtaining improvement opinions according to scores of the actions of the athletes and defects of the actions given by the depth action recognition model, storing the daily training videos in the coach system, enabling a user to access the coach system through a human-computer interaction interface, and further enabling action reply and/or posture viewing through accessing the daily videos.

Further, the method further comprises the steps of filtering out background and invalid actions in the video set to be detected through a first filtering module arranged at the client side, and extracting features, and comprises the following steps:

a1, extracting three-dimensional coordinates of 16 relatively active bone joint points in a training video set or a test video set, wherein the 16 bone joint points are respectively head, middle shoulder, spine, middle hip, left shoulder, left elbow, left wrist, right shoulder, right elbow, right wrist, left hip, left knee, left ankle, right hip, right knee and right ankle;

a2, calculating the translation matrix and quaternion rotation of 16 bone joint points: the translation matrix represents the position change of the current frame and the previous frame of the skeletal joint point; the quaternion rotation represents the angle change of the current frame and the previous frame of the skeleton joint point, and the position change and the angle change of the current frame and the previous frame of the skeleton joint point form the motion characteristics of the skeleton joint point;

a2, forming motion characteristics based on human body parts: dividing a human body into 9 parts, and fusing the motion characteristics of skeletal joint points related to the 9 parts respectively to form motion characteristics based on the human body parts; the 9 parts of the human body are a trunk, a left upper arm, a left lower arm, a right upper arm, a right lower arm, a left upper leg, a left lower leg, a right upper leg and a right lower leg respectively.

Further, the depth motion recognition model comprises an ST-GCN bone point classification model and a noise reduction encoder, and the construction method of the ST-GCN bone point classification model comprises the following steps:

b1: preprocessing the acquired data in the training track by a noise reduction encoder, removing unsmooth end points and incomplete tracks in the training track, and mutually separating different action groups to acquire a plurality of sections of smooth training tracks;

b2: establishing an ST-GCN network and an action track fitting unit, and embedding the action track fitting unit into the back of an ST-GCN network convolution layer to build an overall network;

b3: training the network by using a training set, optimizing parameters and obtaining a skeleton behavior recognition network based on the action track;

b4: inputting the test set into the network obtained in the step B3 for prediction, and giving out the corresponding action category.

Furthermore, the client is connected with three-dimensional cameras, the number of the three-dimensional cameras is at least three, the three-dimensional cameras are arranged around the target to be detected, the position of each three-dimensional camera can track along with the target to be detected, the shot video is firstly cached in the three-dimensional cameras, then the time marking is carried out on the shot video according to the time period, the video is split according to the time marking, then the split video is subjected to disorder processing, and the split video is uploaded to a cache region of the server side.

Further, the method also comprises the construction of a standard fancy skating action model, and comprises the following steps:

c1, acquiring a standard action video set of the pattern skater with standard action;

and C2, calculating a training track of the action by taking the standard action video set as a training set, inputting the training track into the action recognition model for action recognition to obtain a standard fancy skating action model, and taking the obtained standard fancy skating action model as a grading reference of the grading system.

Furthermore, the scoring system comprises a plurality of scoring modules, the scoring modules are used for scoring the performance of different evaluation dimensions of the action, and the evaluation dimensions comprise the completion degree of a single action, the completion degree of an overall action, the fluency of action connection, the difficulty of the single action and the difficulty of the overall action.

Further, the method further comprises performing manifold mapping on a training action video set or a testing action video set through the noise reduction encoder, and specifically comprises the following steps: and each action in the training action video set or the test action video set is represented as a set based on the motion characteristics of the 9 parts, the motion characteristics of the 9 parts in each action in the training action video set or the test action video set are mapped onto a low-dimensional manifold through a local linear embedding algorithm, each action forms 9 parts of track corresponding to the 9 parts, the part track related to the action is a curve, and the part track unrelated to the action is a point, so that the training track and the test track are obtained.

Furthermore, the ST-GCN network is used for predicting the track, the action track fitting unit is used for fitting the track obtained through ST-GCN network prediction and the track of the test set to be tested obtained through the noise reduction encoder to obtain a fitted track, then the predicted track and the fitted track are subjected to differentiation processing to obtain difference data, the obtained difference data are transmitted to the scoring system, and the scoring system scores the action according to different dimensions.

Further, the construction method of the ST-GCN skeletal point classification model comprises the following steps:

d1, inputting a skeleton sequence, normalizing the input matrix and constructing a topological graph structure;

d2, transforming the time and space dimensions by ST-GCN unit alternately using GCN and TCN;

d3, classifying the features by using an average pooling and full connection layer, and then outputting the action classification result through the improved Softmax.

The invention has the advantages that: the method has the advantages that the deep learning algorithm is introduced to quickly and effectively identify the actions of the pattern skating, so that the action identification effect of athletes can be improved, and the actions can be scored according to the defects in the identified actions; meanwhile, the athlete can be guided in a targeted manner through the recognition result of the action of the athlete, so that the training effect of the athlete can be strengthened, and the sport injury caused by long-term nonstandard action can be reduced.

In order to make the aforementioned and other objects, features and advantages of the invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic step diagram of a method for identifying skeletal point actions of a pattern skater based on deep learning.

FIG. 2 is a schematic diagram of a construction method of an ST-GCN bone point classification model.

FIG. 3 is a schematic diagram of the structure and flow of the ST-GCN cell of FIG. 2.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a preferred embodiment of the present invention, a method for identifying skeletal point actions of a pattern skating player based on deep learning comprises the following steps:

and S4, decomposing and comparing the identification result of the test set with the standard fancy skating motion model through a grading system, grading according to the matching degree of the track, and simultaneously outputting a grading result, wherein the test set is a single motion to be tested or a set of a plurality of motions.

In the embodiment, the method further comprises the steps of obtaining daily training videos of the athletes through a coach system, uploading the obtained daily training videos to a depth action recognition model arranged on a cloud server for skeletal point action analysis, obtaining improvement opinions according to scores of the actions of the athletes and action defects given by the depth action recognition model, storing the daily training videos in the coach system, and enabling a user to access the coach system through a human-computer interaction interface so as to be capable of performing action reply and/or posture viewing through accessing the daily videos. In the practical implementation process, the coaching system can be mounted on a server side or a terminal device of a user. Through preset voice and/or character guidance information in the coach system, voice or character guidance can be provided for the athlete after the difference between the action of the athlete and the actual action is recognized through the depth action recognition model, so that the action of the athlete is corrected in time, and the training effect of the athlete is improved.

In the above embodiment, the method further includes filtering out a background and an invalid action in the video set to be detected by a first filtering module arranged at the client, and performing feature extraction, and includes the following steps:

In the above embodiment, the depth motion recognition model includes an ST-GCN skeleton point classification model and a noise reduction encoder, and the construction method of the ST-GCN skeleton point classification model includes:

b4: the test set is input to the network obtained in step S3 for prediction, and the corresponding action category is given.

In the embodiment, the client is connected with three-dimensional cameras, the number of the three-dimensional cameras is at least three, the three-dimensional cameras are installed around a target to be detected, the position of each three-dimensional camera can track the target to be detected, a shot video is firstly cached in the three-dimensional cameras, then time marking is carried out on the shot video according to a time period, the video is split according to the time marking, then out-of-order processing is carried out on the split video, and the split video is uploaded to a cache area of the server. In the actual implementation process, the number of the three-dimensional cameras is at least 4, one of the three-dimensional cameras is used as a reference, the positions of the other 3 or more than 3 cameras are calibrated, and the calibrated cameras track and shoot along with the actual motion trail of the athlete, so that more complete limb action data of the athlete can be acquired.

In the above embodiment, the method further comprises the step of constructing a standard fancy skating motion model, which comprises the following steps:

and C2, calculating a training track of the action by taking the standard action video set as a training set, inputting the training track into the action recognition model for action recognition to obtain a standard fancy skating action model, and taking the obtained standard fancy skating action model as a grading reference of the grading system. In the actual implementation process, the fancy skating motion model is optimized through different combination modes of the same standard motion video set, and is divided into different sub-modules according to the length of each motion, so that when the motions of the athlete are evaluated through the evaluation system, different motions can be evaluated independently, and the response time in the evaluation process is reduced.

In the above embodiment, the scoring system includes a plurality of scoring modules, and the plurality of scoring modules are configured to score the performance of different evaluation dimensions of the action, where the evaluation dimensions include a degree of completion of a single action, a degree of completion of an overall action, a smoothness of action engagement, a difficulty of a single action, and a difficulty of an overall action. In the actual implementation process, the total score can be calculated by adding the completion degree of a single action, the completion degree of the whole action, the smoothness of action connection, the difficulty of the single action and the difficulty of the whole action, and the total score can also be calculated by multiplication according to evaluation coefficients of different score dimensions.

In the above embodiment, the method further includes performing manifold mapping on the training motion video set or the test motion video set by using a noise reduction encoder, and specifically includes the following steps: each action in the training action video set or the test action video set is represented as a set based on motion characteristics of 9 parts, the motion characteristics of the 9 parts in each action in the training action video set or the test action video set are mapped onto a low-dimensional manifold through a local linear embedding algorithm, each action forms 9 parts of track corresponding to the 9 parts, the track of the part related to the action is a curve, and the track of the part unrelated to the action is a point, so that the training track and the test track are obtained.

In the above embodiment, the ST-GCN network is configured to predict a trajectory, the motion trajectory fitting unit is configured to fit the trajectory obtained through prediction by the ST-GCN network and the trajectory of the test set to be tested obtained through the noise reduction encoder to obtain a fitted trajectory, then perform differentiation processing on the predicted trajectory and the fitted trajectory to obtain difference data, and then transmit the obtained difference data to the scoring system, and the scoring system scores the motion according to different dimensions.

Referring to fig. 2, in an actual implementation process, a method for constructing an ST-GCN skeleton point classification model includes: inputting a skeleton sequence, normalizing an input matrix and constructing a topological graph structure; transforming, by an ST-GCN unit, the temporal and spatial dimensions alternately using GCN and TCN; classifying the features by using an average pooling and full connection layer, and then outputting an action classification result through improved Softmax; in the actual use process, the number of the ST-GCN units is set to be 9, the number of the ST-GCN units is set to be 1-9, wherein the stides of the 4 th time domain convolution layer and the 7 th time domain convolution layer is 2; the input and output of each ST-GCN cell is shown in FIG. 2.

Adopting improved Dropout in the process of constructing an ST-GCN skeleton point classification model, selecting a Huber loss function, measuring accuracy by top1 and top5, and reducing random gradient of additional momentum into an optimization function; initializing the weight, loading data, a model and an optimizer, and performing end-to-end training. In the practical implementation process, in order to avoid that the outlier has a large influence on the result and improve the robustness of the model, a Huber loss function is used, and the formula after correction is as follows:

v _t+1 in the form of an actual value of the value,

is the predicted value of the model, delta is the threshold,

the Huber loss function value, which represents the multiplication operation;

when the MSE is used as a loss function, the model often forcibly fits singular point data because the loss function value needs to be reduced, so that the prediction result is influenced; huber Loss is a parameterized piecewise Loss function used in the regression problem, which has the advantage of enhancing the robustness of the mean square error Loss function (MSE) to outliers. Given a delta, it takes the squared error when the prediction bias is less than delta, and reduces Loss when the prediction bias is greater than delta, using a linear function. The method can reduce the weight of singular data points on Loss calculation, avoids model overfitting compared with least square linear regression, and reduces the punishment degree of outliers by Huberloss.

In ST-GCN all channels share an adjacency matrix, which means that all channels will share the same aggregation core, a case known as coupled aggregation. However, in the convolutional neural network, the convolution kernel parameters of each channel are different, and the diversity of extracted features is ensured, so that different channel data are processed by using different adjacency matrixes, the adjacency matrixes and convolution kernels are similar, parameters can be trained and changed, and the diversity of the adjacency matrixes is greatly increased. When n = C, each channel itself can generate a spatial aggregation kernel of a large number of redundant parameters; when n =1, the convolution is degraded to an aggregated graph convolution, and the formula of the corrected graph convolution is as follows:

，

wherein the content of the first and second substances,

is the new characteristic diagram information obtained by calculation.

The number of channels of the original characteristic diagram is shown, n is the number of channels which divide the original characteristic diagram into n groups according to the channels,

meaning rounding down, "," meaning separating the two ": means. Wherein

To decouple adjacent matrices, then

And the first step of the method represents that the selected adjacent matrix is sliced identically to obtain information, the second step of the method represents that the selected adjacent matrix is sliced to obtain a partial adjacent matrix, and n represents that the selected decoupling adjacent matrix corresponding to the nth group of channels is obtained.

Calculating information between different channels, i.e. channel correlation, wherein

A diagram of the characteristics is shown,

represents the weight of each keypoint in the channel, i.e., the variable convolution kernel. Then

Representing a1 st group of channels, wherein the first represents all joint information in each channel, and the second represents grouping of channels, i.e. from the first channel to the second channel

A channel, the same as follows;

representing group 2 channels, the second of which represents the second from

From the channel to the

A plurality of channels, each of which is provided with a plurality of channels,

representing the nth set of channels, the second of which represents the channel from the second

From one channel to the last channel.

And

is expressed using a python representation,

indicating per-channel connections, C is the number of image channels.

Meanwhile, because the graph neural network is adapted to a non-euclidean space structure, in the graph convolution process, the characteristics of the nodes and the neighbor nodes are mixed, and overfitting cannot be avoided by deleting only one node, so that a Dropout mechanism can be changed to enhance the regularization result:

when a certain node is deleted, part of the nodes around the node are deleted at the same time. Therefore, two parameters are introduced, respectively: node drop probability

Neighbor node range k for the discarded node, and weekSurrounding node drop probability

. Here we use k =1 directly. We pass in the hyper-parameters: probability of node being reserved

And probability of node drop

Assuming that the average degree of each node is

Where n represents the total number of nodes and e represents the total number of edges, the average number of discarded nodes is

Where denotes a multiplication operation, the probability that each node is dropped can be calculated as:

。

when classifying the features, average pooling is adopted, the weight L2 is regularized after the average pooling, the bias is set to be 0, an angle interval coefficient m and a cosine interval distance are introduced, and when the features are classified, the average pooling is carried out, and when the angle interval coefficient m and the cosine interval distance are set to be 0

The derivation is inconvenient when the angle interval coefficient m reversely propagates, and the derivation does not change when the angle interval coefficient m reversely propagates, so that the Softmax function is changed into:

，

where s is a scale factor, scaling cosine values are performed, s needs to be set to be large, we use 25 here to accelerate and stabilize optimization,

indicates the class to which the sample belongs, then

The corresponding target angle in sample space for the class to which it belongs,

in which

，

I.e. the cosine distance, N is the data amount of the training sample, e refers to the natural constant, c represents a number of outputs or classes of the neural network,

means for removing

All other classification categories, then

To represent

The corresponding class is the corresponding angle in the sample space.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for recognizing skeletal point actions of pattern skaters based on deep learning is characterized by comprising the following steps:

2. The method for recognizing skeletal point actions of pattern skaters based on deep learning of claim 1, further comprising obtaining daily training videos of athletes through a coach system, uploading the obtained daily training videos to a deep action recognition model arranged on a cloud server for skeletal point action analysis, and then obtaining improved opinions according to scores of the actions of the athletes and defects of the actions given by the deep action recognition model, wherein the daily training videos are stored in the coach system, and a user can access the coach system through a human-computer interaction interface, so that action reply and/or posture check can be performed through accessing the daily videos.

3. The method for recognizing skeletal points of pattern skaters based on deep learning of claim 1, further comprising filtering out background and invalid actions in a video set to be tested by a noise reduction encoder arranged at a server end, and performing feature extraction, comprising the following steps:

4. The method for identifying skeletal points of a pattern skater based on deep learning according to claim 2, wherein the deep motion identification model comprises an ST-GCN skeletal point classification model and a noise reduction encoder, and the ST-GCN skeletal point classification model is constructed by the method comprising:

b1: preprocessing the acquired data of the training video set and the test video set by a noise reduction encoder, removing unsmooth end points and incomplete tracks in a training track, and mutually separating different action groups to acquire a plurality of sections of smooth training tracks;

b3: training an ST-GCN network by using a training set, optimizing parameters and obtaining a skeleton behavior recognition network based on an action track;

b4: inputting the test set into the network obtained in step B3 for prediction, and giving out the corresponding action category.

5. The method for recognizing the bone point actions of the pattern skating players based on the deep learning as claimed in claim 4, wherein the client is connected with at least three-dimensional cameras, the three-dimensional cameras are installed around the target to be detected, the positions of the three-dimensional cameras can track along with the target to be detected, the shot videos are firstly cached in the three-dimensional cameras, then the shot videos are time-stamped according to time periods, the videos are split according to the time stamps, then the split videos are subjected to disorder processing and uploaded to a cache region of a server side.

6. The deep learning-based pattern skating player skeletal point action recognition method according to claim 1, further comprising the construction of a standard fancy skating action model, comprising the steps of:

7. The method for identifying skeletal points of a figure skating player based on deep learning as claimed in claim 1, wherein the scoring system comprises a plurality of scoring modules for scoring the performances of different evaluation dimensions of the motion, and the evaluation dimensions comprise the completion degree of a single motion, the completion degree of an overall motion, the fluency degree of motion joint, the difficulty of a single motion and the difficulty of an overall motion.

8. The method for recognizing skeletal points of a pattern skater based on deep learning according to claim 3, characterized by further comprising performing manifold mapping on a training motion video set or a test motion video set through the noise reduction encoder, and specifically comprising the following steps: and each action in the training action video set or the test action video set is represented as a set based on the motion characteristics of the 9 parts, the motion characteristics of the 9 parts in each action in the training action video set or the test action video set are mapped onto a low-dimensional manifold through a local linear embedding algorithm, each action forms 9 parts of track corresponding to the 9 parts, the part track related to the action is a curve, and the part track unrelated to the action is a point, so that the training track and the test track are obtained.

9. The method for identifying skeletal points of a pattern skating player based on deep learning as claimed in claim 4, wherein the ST-GCN network is used for predicting a track, the motion track fitting unit is used for fitting the track obtained through the ST-GCN network prediction and the track of the test set to be tested obtained through a noise reduction encoder to obtain a fitted track, then differentiating the predicted track and the fitted track to obtain differential data, then transmitting the obtained differential data to the scoring system, and the scoring system scores motions according to different dimensions.

10. The method for identifying skeletal point actions of a pattern skater based on deep learning according to claim 1, characterized in that the method for constructing the ST-GCN skeletal point classification model comprises the following steps: