CN114783046A

CN114783046A - CNN and LSTM-based human body continuous motion similarity scoring method

Info

Publication number: CN114783046A
Application number: CN202210195641.1A
Authority: CN
Inventors: 谢铭; 索帅; 董建武; 吴林涛; 郑博文; 王立刚; 蔡荣华; 胡小勇
Original assignee: Beijing Scistor Technologies Co ltd
Current assignee: Beijing Scistor Technologies Co ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-07-22
Anticipated expiration: 2042-03-01
Also published as: CN114783046B

Abstract

The invention discloses a CNN and LSTM-based human body continuity action similarity scoring method, which is used for aligning key frames of a standard action video and an action video to be detected to obtain a corresponding action sequence frame set. And respectively detecting the key points of the human body by using a convolutional neural network for each frame in the standard action sequence frame set and the action sequence frame set to be detected, and converting the coordinate information of the obtained key points of the human body into key included angle information of the human body. And (4) sending the key included angle sequence obtained by each video into a recurrent neural network to obtain the human body continuous action characteristic vector of each video. And (4) performing distance calculation on the human body continuous motion characteristic vectors of the standard motion video and the motion video to be detected, and converting the human body continuous motion characteristic vectors into motion similarity scores. The invention can give the competition result aiming at the competition related to the action standard. Scoring may be performed instead of or in addition to the referees of the action standard correlation game.

Description

CNN and LSTM-based human body continuous motion similarity scoring method

Technical Field

The invention relates to the field of video data analysis, in particular to a human body continuous motion similarity scoring method based on a convolutional neural network and a cyclic neural network.

Background

In many areas of sports, fitness, dance events and teaching, some of which require athletes or trainees to perform in standard movements, referees or instructors are required to score the completion of their set of continuous movements. In practice, the referee needs a certain expertise and needs to rely on the eyes and brain memory to compare the standard continuous action with the set of continuous actions of the performer for comparative scoring. The difficulty lies in that one referee with professional knowledge needs to grade a plurality of athletes one by one, and the efficiency is not high.

In addition, in recent years, the middle video teaching of the sports teaching software is more popular, students need to know the relevant evaluation condition of self sports actions, and judges cannot directly evaluate the sports actions of mass users on the internet.

Disclosure of Invention

The invention provides a human body continuous action similarity scoring method based on a convolutional neural network and a cyclic neural network, which combines two-dimensional image information of a standard action video and an action video to be tested with time sequence information of action continuity and can well assist or replace a referee with professional knowledge to finish motion action scoring.

The invention discloses a CNN and LSTM-based human body continuous motion similarity scoring method, which comprises a data preparation stage, a human body key information extraction stage, a continuous motion characteristic vector extraction stage and a scoring stage.

And in the data preparation stage, key frames of the standard action video and the action video to be detected are aligned to generate a standard action sequence frame set and an action sequence frame set to be detected.

And in the human body key information extraction stage, coordinates of human body action key points in the action frame are extracted through a trained convolutional neural network, and the human body key point coordinate information is converted into human body key included angle information to obtain a standard action sequence human body key included angle set and a to-be-detected action sequence human body key included angle set.

And the continuous action characteristic vector extraction stage extracts a characteristic vector of a video through a trained recurrent neural network (LSTM).

The grading stage establishes a distance grading mapping relation; and simultaneously, distance calculation is carried out on the standard motion video characteristic vector and the motion video characteristic vector to be measured, and final mapping from distance to score is completed according to the distance score mapping relation, so that the score is completed.

The invention has the advantages that:

(1) according to the invention, through an intelligent and automatic mode, the human body continuous action similarity scoring can be completed only by inputting the standard action video and the action video to be tested. The method has the advantages that the task load of referees can be effectively reduced, the efficiency is high, and the labor cost is reduced.

(2) The method is not limited by time and regions, and can finish the scoring of the similarity of the continuous actions of the human body. The online motion evaluation method has the advantages that the online motion evaluation method is not only suitable for an offline motion evaluation scene, but also suitable for an online motion evaluation scene, and is small in use limitation.

(3) The invention converts the coordinate information of the key points of the human body into the information of the key included angles of the human body. The human body key included angle information is not influenced by the body type of a human body compared with the coordinate information of the human body key points, the information is more accurate, and the scoring accuracy can be effectively improved.

(4) The invention sends the human body key included angle sequence information into a recurrent neural network for feature extraction. The method has the advantages that the consistency information of the movement actions can be effectively fused to further improve the scoring accuracy.

Drawings

FIG. 1 is a general flowchart of the human body continuous motion similarity score method of the present invention;

FIG. 2 is a flow chart of extracting human body key angle information according to the present invention;

FIG. 3 is a schematic diagram illustrating a process of converting coordinate information of key points of a human body into information of key included angles of the human body according to the present invention;

FIG. 4 is a flowchart of the present invention for extracting continuous motion feature vectors;

FIG. 5 is a flowchart of the scoring stage of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the overall process of the human body continuous motion similarity scoring method based on the convolutional neural network and the cyclic neural network is divided into four stages, namely a data preparation stage, a human body key information extraction stage, a continuous motion feature vector extraction stage and a scoring stage.

In the data preparation stage, the action video to be detected is aligned with the key frame according to the frame number of the standard action video, and the specific method is as follows:

a. the number of the motion video frames to be tested is N_dFrame, standard action video frame number N_sFrame, the frame number difference is:

N_diff＝N_s-N_d (1)

if the frame number is different by N_diffIf 0, the frame interpolation and frame deletion are not performed.

If the frame number is different by N_diff>0, the video to be tested needs to be subjected to frame interpolation treatment, namely the first frame of the video frame sequence to be tested

At the frame, inserting frame sequences as

The frame data of (1); wherein i represents the frame order, i is the [1, N ]_diff]。

If the frame number is different by N_diff<0, frame deletion processing needs to be carried out on the video to be detected, namely the video frame sequence to be detected is set as

The frame erasure at (a).

And finally, respectively storing the frames in the processed motion video to be detected and the standard motion video as motion sequence frame sets in sequence.

In the human body key information extraction stage, as shown in fig. 2, the specific method is as follows:

A. and sending the processed standard action sequence frame set and the action sequence frame set to be tested in the data preparation stage into a trained convolutional neural network.

B. The convolutional neural network extracts the coordinate information of the key point of the single human body from each frame in the standard action sequence frame set and the action sequence frame set to be detected, stores the coordinate information of the key point of the human body of each frame according to the frame sequence, and respectively obtains the coordinate set of the key point of the human body of the standard action sequence and the coordinate set of the key point of the human body of the action sequence to be detected. The key points of the human body are set as the top of the head, the upper neck, the lower neck, the left shoulder, the right shoulder, the left elbow, the right elbow, the left wrist, the right wrist, the left crotch, the right crotch, the pelvis, the left knee, the right knee, the left ankle and the right ankle.

The convolutional neural network adopts a SimplePose single human body key point extraction network. The convolutional neural network model adopts an Encode-Decode structure. In the Encode stage, a ResNet network structure is adopted for feature coding, the ResNet is a convolutional neural network with a residual error structure, and the ResNet has a good effect on feature extraction in the aspect of images. In the Decode stage, 3 layers of transposition convolution are adopted to perform up-sampling on the feature map obtained in the Encode stage, so that the probability that each pixel point is a key of a human body is obtained, and finally the coordinate position of the key point of the human body is obtained.

C. And C, converting the 16 human body key point coordinates of each frame obtained in the step B into 11 key included angles (the angle 1 to the angle 11) information, wherein the specific conversion method is that one included angle and two edges are determined according to the three point coordinates, and the radian of the included angle can be obtained by using an inverse trigonometric function. Thus, a standard action sequence human body key included angle set and a to-be-detected action sequence human body key included angle set can be obtained; as shown in fig. 3, the left graph includes coordinate information of key points of the human body, and the right graph includes included angle information of key points of the human body after conversion.

As shown in fig. 4, in the continuous motion feature vector extraction stage, the specific method is as follows:

a. inputting each frame of information in the processed standard action sequence human body key included angle set and the action sequence human body key included angle set to be detected to the LSTM model according to the sequence order;

b. the cyclic neural network outputs a 32-dimensional feature vector to key angle information of each frame in the standard action sequence human body key angle set and the action sequence human body key angle set to be detected, the feature vector comprises information of a previous frame, the 32-dimensional feature vectors output by the key angle information of each frame are sequentially spliced to obtain a standard action sequence feature vector and an action sequence feature vector to be detected, and feature extraction of continuous actions is completed.

The LSTM model described above is a special recurrent neural network structure. The input of one LSTM module in the LSTM model comprises the input of the current moment, the output state of the hidden layer at the last moment and the cell state at the last moment, and simultaneously, the output of the current moment, the output state of the hidden layer at the current moment and the cell state of the current layer can be output. The output state of the hidden layer is mainly responsible for transmitting short-term information, the cell state is responsible for transmitting long-term state information, and all LSTMs are called long-term and short-term memory networks. A forgetting gate, an input gate and an output gate are arranged in an LSTM module to control the transmission of information in the aspect of time sequence. The forgetting gate is mainly responsible for discarding which information of the cell state at the last moment. The input gate determines how much of the currently input information needs to be added to the current cell state. The output gate decides which information should be output at the current time.

Therein, training of the LSTM model uses a homemade data set. The specific way of making the data set is: the 11 key angles shown in fig. 3 are used as a sample, wherein each angle is classified by range; if < 2 is a category per pi/6 angle (step size), i.e., [0, pi/6), [ pi/6, 2 pi/6), [2 pi/6, 3 pi/6), [3 pi/6, 4 pi/6), [4 pi/6, 5 pi/6), [5 pi/6, 6 pi/6), [6 pi/6, 7 pi/6), [7 pi/6, 8 pi/6), [8 pi/6, 9 pi/6), [9 pi/6, 10 pi/6), [10 pi/6, 11 pi/6), [11 pi/6, 12 pi/6) is divided into 12 intervals, each interval is an angle category, and the total of the interval comprises 2 pi/(pi/6) to 12 categories. The number of categories of each angle is freely configurable_iIs the number of categories of the ith angle, beta is the step length of the interval, and has the following relation of 2 pi/a_iβ; if the number of types configured by ═ 2 is 12, then ×/2 is generated for 12 intervals by using 2 pi/12 ═ pi/6 as a step length; the number of the configured categories of the angle 1 is 6, and the angle 1 can generate 6 intervals by using 2 pi/6 pi/3 as a step length; the number of categories of the training data set of the sample is:

wherein, a_iIndicating the number of categories for the ith angle.

Randomly generating a certain amount of training sample data for each category according to the value range of each angle of each category of the training data set; such as: for one of the samples [ theta ]_x,1，θ_x,2，θ_x,3，θ_x,4，θ_x,5，θ_x,6，θ_x,7，θ_x,8，θ_x,9，θ_x,10，θ_x,11]Wherein each angle θ_x,yAll the angles are randomly generated, each angle is classified according to the interval of the angle, each sample has 11 angles, the 11-angle classes are combined, and a data set can be generated

Different classes of samples were used.

Training the LSTM model by using the self-made data set, specifically comprising the following steps: adding a full link layer to the output of the LSTM model to make the training model have the function of finishing classification, wherein the classification refers to the classification of a certain sample [ theta ] in the data set_x,1，θ_x,2，θ_x,3，θ_x,4，θ_x,5，θ_x,6，θ_x,7，θ_x,8，θ_x,9，θ_x,10，θ_x,11]Classifying to make the model learn that the sample belongs to

Which of the specific categories is. And after the LSTM model is trained, removing the full connection layer, and only keeping the LSTM with the capability of outputting the feature vectors to obtain the well-trained LSTM model.

The scoring stage calculates Euclidean distance between the standard motion sequence feature vector and the motion sequence feature vector to be measured, and then finishes the mapping from the distance to the final scoring of the two videos according to the distance scoring mapping relation set in advance to obtain scoring scores, as shown in figure 5

The Euclidean distance determining method of the two vectors comprises the following steps: and substituting the obtained characteristic vector of the standard action sequence and the characteristic vector of the action sequence to be detected into an Euclidean distance calculation formula to obtain the Euclidean distance between the two vectors.

The distance score mapping relation is a mapping relation between Euclidean distances of the two action videos and scores. The specific method comprises the following steps: giving X groups of videos, wherein each group of videos comprises a standard action video and an action video to be tested, if n sets of standard actions need to be scored, n standard action videos of the actions need to be given, and each standard action gives m action videos to be tested corresponding to the standard action, so that X-n-m groups of videos are generated, wherein the X-n-m groups of videos are respectively: [ standard motion video 1, video 1_1 to be tested ], [ standard motion video 1, video 1_2 to be tested ], …, [ standard motion video i, video i _ j to be tested ], …, [ standard motion video n, video n _ m to be tested ], wherein i belongs to [1, n ]; j belongs to [1, m ]. And (3) scoring the action similarity of the X groups of videos by 1-100 by experts of the sports projects in the videos, and then obtaining the Euclidean distance of the X groups of videos through the process. And finding out Euclidean distances corresponding to a plurality of groups of videos with expert scores of 100, and averaging the Euclidean distances, wherein the average value is called full-scale truncation distance. And finding out Euclidean distances corresponding to a plurality of groups of videos with the expert score of 0, and averaging the Euclidean distances, wherein the average value is called zero-score truncation distance. And fitting a one-dimensional quadratic equation by adopting a least square method corresponding to the expert scores and Euclidean distance data of the rest groups of videos. The specific scoring mapping relation is as follows:

in the formula, A, B and C are parameters of quadratic term, first order term and zero order term of a quadratic equation of a unary respectively; d is the Euclidean distance; score is the score, score ∈ [0,100 ].

Claims

1. A CNN and LSTM-based human body continuous motion similarity scoring method is characterized in that: the method comprises a data preparation stage, a human body key information extraction stage, a continuous action characteristic vector extraction stage and a grading stage;

in the data preparation stage, key frame alignment is carried out on the standard action video and the action video to be detected, and a standard action sequence frame set and an action sequence frame set to be detected are generated;

in the human body key information extraction stage, coordinates of human body action key points in an action frame are extracted through a trained convolutional neural network, and the human body key point coordinate information is converted into human body key included angle information to obtain a standard action sequence human body key included angle set and a to-be-detected action sequence human body key included angle set;

the continuous action characteristic vector extraction stage extracts a characteristic vector of a video through a trained recurrent neural network (LSTM);

2. The human body continuous motion similarity scoring method based on CNN and LSTM as claimed in claim 1, characterized in that: in the data preparation stage, the method for aligning the key frames of the standard action video and the action video to be detected comprises the following steps:

N_diff＝N_s-N_d

if the frame number is different by N_diffIf the value is 0, the frame interpolation and frame deletion processing is not carried out;

At the frame, inserting frame sequences as

The frame data of (a); wherein i represents the frame order, i is the [1, N ]_diff]；

The frame erasure at (a).

3. The human body continuous motion similarity scoring method based on CNN and LSTM as claimed in claim 1, characterized in that: in the human body key information extraction stage, a SimplePose single human body key point extraction network is adopted for extracting the human body key points.

4. The human body continuous motion similarity scoring method based on CNN and LSTM as claimed in claim 1, characterized in that: in the human body key information extraction stage, the method for converting the human body key point coordinate information into the human body key included angle information comprises the following steps: and determining an included angle and two edges according to the coordinates of the three points, and obtaining the radian of the included angle by using an inverse trigonometric function.

5. The human body continuous motion similarity scoring method based on CNN and LSTM as claimed in claim 1, characterized in that: in the continuous action characteristic vector extraction stage, a self-made data set is used for training the recurrent neural network LSTM, and the self-made data set is prepared in the following mode: all key angles in the picture are used as a sample, and each angle is classified according to the range; let a be_iIf β is the step length of the interval, the number of the i-th angle is: 2 pi/a_iβ; the number of categories of the training data set of the final sample is:

6. the human body continuous motion similarity scoring method based on CNN and LSTM as claimed in claim 1, characterized in that: in the continuous action characteristic vector extraction stage, the training method of the recurrent neural network LSTM comprises the following steps: adding a full connection layer to the output of the LSTM model to enable the training model to have the function of finishing classification, wherein the classification refers to classifying a certain sample in a data set; and after the LSTM model is trained, removing the full-connection layer, and only keeping the capability of the LSTM for outputting the feature vectors, thereby obtaining the well-trained LSTM model.

7. The human body continuous motion similarity scoring method based on CNN and LSTM as claimed in claim 1, characterized in that: in the continuous action characteristic vector extraction stage, the distance score mapping relation is a mapping relation between Euclidean distances of two action videos and scores, and the specific method is as follows: giving X groups of videos, wherein each group of videos comprises a standard action video and an action video to be detected, and scoring the action similarity of the X groups of videos by 1-100 by an expert of a motion project in the videos; then calculating the Euclidean distance of the X groups of videos; finding out Euclidean distances corresponding to a plurality of groups of videos with expert scores of 100, and averaging the Euclidean distances, wherein the average value is called a full-scale truncation distance; then finding out Euclidean distances corresponding to a plurality of groups of videos with expert scores of 0, averaging the Euclidean distances, and calling the average value as a zero-score truncation distance; and (3) fitting a quadratic equation of a single element by adopting a least square method corresponding to the expert scores and Euclidean distance data of the rest groups of videos, wherein the mapping relation of the distance scores is as follows: