CN116597507A

CN116597507A - Human body action normalization evaluation method and system

Info

Publication number: CN116597507A
Application number: CN202310441995.4A
Authority: CN
Inventors: 孟继成; 何中山
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-08-15

Abstract

The invention discloses a human body action normalization evaluation method and a human body action normalization evaluation system, which comprise the following steps: s1, modeling a human body in an action video to be evaluated and a standard action video to obtain a human skeleton key point sequence; s2, performing missing value interpolation on positions with missing values in all the action videos; s3, dividing the acquired human skeleton key point sequences into sequence blocks with consistent lengths; s4, acquiring characteristic information only comprising human action elements from a human skeleton key point sequence; s5, comparing the acquired characteristic information of the motion video to be evaluated and the characteristic information of the standard motion video after time sequence adjustment, and generating a motion standardability evaluation score. The method and the device effectively solve the problems that in the prior art, the competition result is not fair due to subjective factors or manual auditing fatigue and the like, and different results can appear in the video to be evaluated under different visual angles.

Description

Human body action normalization evaluation method and system

Technical Field

The invention relates to the field of human motion analysis, in particular to a human motion normalization evaluation method and system.

Background

With the emphasis of health and physical fitness, more and more people are beginning to actively participate in various forms of athletic activities. Among these sports, normative assessment of sports plays a vital role. However, the conventional scoring method mainly relies on manual referee to perform subjective scoring, but the scoring method has a plurality of disadvantages: first, the subjective factors of the human referee are large, and their judgment may be affected by personal preference, emotion and other factors, thus causing an uneven competition result. Secondly, the human referee may suffer from fatigue, errors, etc., affecting their judgment of the game. Even an experienced referee cannot ensure that a high level of operation is maintained throughout a long period of continuous operation. Meanwhile, in many scenes, human motion videos to be evaluated are usually photographed and collected under different viewing angles due to the influence of factors such as environment and human beings.

Disclosure of Invention

The invention provides a human motion normalization evaluation system, comprising: the system comprises a gesture estimation module, a human skeleton key point sequence coordinate missing value interpolation module, a human skeleton joint point sequence segmentation module, a human joint angle characteristic extraction module and a human action normalization evaluation module.

Further, the gesture estimation module is used for modeling the human body action gesture in the video.

Further, the human skeleton key point sequence coordinate missing value interpolation module is used for complementing missing key points appearing in the video frame.

Further, the human body action normalization evaluation module is used for comparing the human body characteristics of the specified video with the human body characteristics of the video to be evaluated, and performing score evaluation according to the comparison result.

Further, the system also comprises an automatic coding and decoding network module; the automatic coding and decoding network module is provided with an automatic coding and decoding network model for extracting and analyzing the action information characteristics.

Further, the automatic codec network model includes: a motion information encoder, a skeleton structure information encoder, a camera view information encoder and a decoder; the action information encoder, the skeleton structure information encoder and the camera view angle information encoder decouple the key point sequence of the human skeleton into the following three feature vectors: A1. motion information feature vectors related to time represent motion related information of a human body; A2. time-independent human skeleton structure information features, representing the structure information of human skeleton; A3. the time-independent camera view angle feature represents camera angle information at the time of motion video capture.

Further, the decoder is used for orderly recombining the three independent feature vectors obtained by encoding to reconstruct a corresponding human skeleton key point sequence, comparing the key point sequence with a given real sample, and calculating corresponding loss.

A human motion normalization evaluation method, comprising the steps of: s1, modeling a human body in an action video to be evaluated and a standard action video to obtain a human skeleton key point sequence; s2, performing missing value interpolation on positions with missing values in all the action videos; s3, dividing the acquired human skeleton key point sequences into sequence blocks with consistent lengths; s4, acquiring characteristic information only comprising human action elements from a human skeleton key point sequence; s5, comparing the acquired characteristic information of the motion video to be evaluated and the characteristic information of the standard motion video after time sequence adjustment, and generating a motion standardability evaluation score.

Further, in the step S1, human skeleton information in the action video is extracted through an openPose algorithm, so as to obtain a human skeleton key point sequence.

Further, the step S2 includes the following substeps: s21, finding out the front and rear nearest neighbor frames of the key point coordinate missing value, and carrying out feature weighting calculation on the front and rear nearest neighbor frames to obtain P _ave ：

And t is _j T, wherein T is the frame where the coordinate of the missing key point is located, T1 and T2 respectively represent two frames which are nearest to the frame before and after the frame T and have no missing corresponding to the coordinate of the key point, and T is the total frame number of the video; s22, segmenting the sequence two according to the position of the missing value of the key point coordinate, performing polynomial regression on key point data corresponding to each segment of data, and obtaining regression predicted values of the time sequence of the front segment and the rear segment according to the missing key point: p (P) _before ＝y _j ；j＝0,1，...，i-1；P _after ＝y _j The method comprises the steps of carrying out a first treatment on the surface of the j=i+1, i+2,.. T, in which y _j The prediction result of polynomial regression is that T is the total frame number of the video; s23, predicting the obtained P _ave 、P _before 、P _after Weighting calculation is carried out, and a final prediction result is obtained: p (P) _t，i ＝ ¹ / ₂ P _ave + ¹ / ₄ P _before + ¹ / ₄ P _after 。

In the step S3, the input two-dimensional human skeleton key point sequence is divided into sequence blocks with consistent length through a sliding window with a window size w and a step length r.

Further, the step S5 includes the following substeps: s51, calculating two human motion information feature vector sets F through a DTW algorithm ₁ And F ₂ A best matching path matrix W between w= { (1, 1), (x, y), (n 1, n 2) }, where x, y are elements of the motion information feature vector, and x is aligned with y; n1 and n2 are the last elements of the two motion information feature vectors which need to be aligned regularly; s52, calculating cosine similarity between all successfully matched motion information feature vector pairs on the optimal matching path matrix W to obtain a human skeleton data sequence block similarity set S:wherein x is _k And y _k Respectively two motion information featuresThe corresponding kth element in the vector; s53, averaging elements in the set S, and normalizing: />Wherein n is _s Is the number of elements in the set S.

The invention provides a human body action normalization evaluation method and system, which effectively solve the problems that in the prior art, the competition result is not fair due to subjective factors or manual auditing fatigue and the like, and different results can appear in videos to be evaluated under different visual angles.

Drawings

FIG. 1 is a flow chart of a method and system for evaluating normalization of human actions according to the present invention;

FIG. 2 is a schematic diagram of a system structure of a method and a system for evaluating normalization of human actions according to the present invention;

fig. 3 is a two-dimensional human body coordinate diagram of a human body motion normalization evaluation method and system according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention, taken in conjunction with the accompanying drawings, illustrates only some, but not all embodiments, and for the sake of clarity, illustration and description not related to the invention is omitted in the drawings and description.

As shown in fig. 1, the invention provides a human motion normalization evaluation method, which comprises the following steps: s1, modeling a human body in an action video to be evaluated and a standard action video to obtain a human skeleton key point sequence; s2, performing missing value interpolation on positions with missing values in all the action videos; s3, dividing the acquired human skeleton key point sequences into sequence blocks with consistent lengths; s4, acquiring characteristic information only comprising human action elements from a human skeleton key point sequence; s5, comparing the acquired characteristic information of the motion video to be evaluated and the characteristic information of the standard motion video after time sequence adjustment, and generating a motion standardability evaluation score.

In the step S1, human skeleton information in the action video is extracted through an OpenPose algorithm, and a human skeleton key point sequence is obtained.

Common time series missing value interpolation algorithms include mean interpolation, median interpolation, regression interpolation, neighbor interpolation, and the like. Since the video naturally has good context information, it is easily conceivable that an average interpolation algorithm may be employed to implement interpolation of missing coordinates. If in the t-th frame image, the ith key point coordinate P _t，i In the absence of the key point data corresponding to the previous and subsequent frames, P _t，i The predicted values of (2) are:

the step S2 comprises the following substeps: s21, finding out the front and rear nearest neighbor frames of the key point coordinate missing value, and carrying out feature weighting calculation on the front and rear nearest neighbor frames to obtain P _ave ：And t is _j T, wherein T is the frame where the coordinate of the missing key point is located, T1 and T2 respectively represent two frames which are nearest to the frame before and after the frame T and have no missing corresponding to the coordinate of the key point, and T is the total frame number of the video; s22, segmenting the sequence two according to the position of the missing value of the key point coordinate, performing polynomial regression on key point data corresponding to each segment of data, and obtaining regression predicted values of the time sequence of the front segment and the rear segment according to the missing key point: p (P) _before ＝y _j ；j＝0,1，...，i-1；P _after ＝y _j The method comprises the steps of carrying out a first treatment on the surface of the j=i+1, i+2,.. T, in which y _j The prediction result of polynomial regression is that T is the total frame number of the video; s23, predicting the obtained P _ave 、P _before 、P _after Weighting calculation is carried out, and a final prediction result is obtained: p (P) _t，i ＝ ¹ / ₂ P _ave + ¹ / ₄ P _before + ¹ / ₄ P _after 。

And S3, dividing the input two-dimensional human skeleton key point sequence into sequence blocks with consistent length through a sliding window with the window size w and the step length r.

The step S5 comprises the following substeps: s51, calculating two human motion information feature vector sets F through a DTW algorithm ₁ And F ₂ A best matching path matrix W between w= { (1, 1), (x, y), (n 1, n 2) }, where x, y are elements of the motion information feature vector, and x is aligned with y; n1 and n2 are the last elements of the two motion information feature vectors which need to be aligned regularly; s52, calculating cosine similarity between all successfully matched motion information feature vector pairs on the optimal matching path matrix W to obtain a human skeleton data sequence block similarity set S:wherein x is _k And y _k The corresponding kth element in the two motion information feature vectors is respectively; s53, averaging elements in the set S, and normalizing: />Wherein n is _s Is the number of elements in the set S.

As shown in fig. 2, the present invention provides a human motion normalization evaluation system including: the system comprises a gesture estimation module, a human skeleton key point sequence coordinate missing value interpolation module, a human skeleton joint point sequence segmentation module, a human joint angle characteristic extraction module and a human action normalization evaluation module.

The gesture estimation module is used for modeling the human body action gesture in the video. The human skeleton key point sequence coordinate missing value interpolation module is used for complementing missing key points appearing in the video frame. The human body action normalization evaluation module is used for comparing the human body characteristics of the specified video with the human body characteristics of the video to be evaluated and performing score evaluation according to the comparison result.

The system also comprises an automatic coding and decoding network module; the automatic coding and decoding network module is provided with an automatic coding and decoding network model for extracting and analyzing the action information characteristics. The automatic codec network model includes: a motion information encoder, a skeleton structure information encoder, a camera view information encoder and a decoder; the action information encoder, the skeleton structure information encoder and the camera view angle information encoder decouple the key point sequence of the human skeleton into the following three feature vectors: A1. motion information feature vectors related to time represent motion related information of a human body; A2. time-independent human skeleton structure information features, representing the structure information of human skeleton; A3. the time-independent camera view angle feature represents camera angle information at the time of motion video capture.

The decoder is used for orderly recombining the three independent feature vectors obtained by encoding to reconstruct a corresponding human skeleton key point sequence, comparing the key point sequence with a given real sample and calculating corresponding loss.

Extracting a two-dimensional human skeleton key point sequence: modeling the corresponding human body motion gestures in the standard motion video and the motion video to be evaluated according to the existing gesture estimation algorithm, namely extracting human body skeleton information in the motion video through the gesture estimation algorithm to obtain a human body skeleton key point sequence, as shown in fig. 3.

Interpolation of human skeleton key point sequence missing coordinate values: although the current OpenPose algorithm has higher detection precision when in single person detection, coordinate value deletion is still possible in a human skeleton key point sequence extracted by an attitude estimation algorithm due to factors such as motion blur caused by the shielding of a human body or high-speed motion of the human body. Since the key point coordinate data of the same position in the video stream has the characteristic of continuous change, if a certain frame has the key point coordinate missing, both the upper frame and the lower frame are affected. Therefore, only if the missing key points are complemented, the extracted skeleton data can completely express the action gesture information of the human body in the action video, so that the accuracy of the subsequent human body action similarity evaluation task is improved.

Is provided withTwo-dimensional human skeleton key point sequences corresponding to the standard action video and the action video to be evaluated respectively, wherein the time length is T respectively ₁ And T ₂ First windowSliding window handle with w step length r>And->Divided into two human skeleton key point data sequence block (patch) sets X ₁ And X ₂ ：

Then X is taken up ₁ And X ₂ Sequentially inputting the elements in the sequence to a motion encoder in a trained coding and decoding network to extract motion information feature vectors to obtain two human motion information feature vector sets F ₁ And F ₂ ：

The invention constrains and separates three potential features and reconstructs corresponding two-dimensional human skeleton key point sequences through a loss function consisting of three components of cross reconstruction loss, reconstruction loss and triplet loss.

Cross reconstruction loss: when training the automatic coding and decoding network, each iteration randomly selects a pair of samples from the data set, and a two-dimensional human body action sequence is obtained through the coder and decoder to be output. The purpose of using cross-reconstruction losses is to minimize the difference between input and output:

reconstruction loss: in addition to cross-reconstruction, at each iterative training, a network is also required to reconstruct the original inputSample entering:thus, the total reconstruction loss function is: l (L) _{rec_corss} ＝L _rec +L _cross 。

Triplet loss: the use of cross reconstruction loss and reconstruction loss ensures that the corresponding two-dimensional human skeleton key point sequences are recombined after the actions of the same category are coded and decoded, but no separation requirement is explicitly given among different potential features, so that the potential space of one attribute may contain information of other two attributes. In order to enhance feature separation and enable action samples of the same class to present a good clustering effect in a potential space, the invention introduces triplet loss in depth measurement learning to increase the distance between classes and reduce the distance within the classes. The final goal of the triplet loss is to make the distance between the positive sample and the anchor sample smaller and the distance between the negative sample and the anchor sample larger and larger under the embedding space; the distance between the anchor sample and the positive sample in the embedding space is smaller than the distance between the anchor sample and the negative sample in the embedding space, so that similarity calculation is realized:wherein (1)>And->Is an anchor sample; m is the maximum separation between positive and negative samples, here taking a value of 0.3. Similarly, for skeleton and camera view encoders, the same triplet loss function is used to clarify the separation of features, whose formulas are:

thus, the total triplet loss function is: l (L) _triplct ＝L _{triplct_M} +L _{triplct_S} +L _{triplct_V} Adding the two loss functions to obtain a total loss function as follows: l=l _{rec_cross} +L _triplct 。

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A human motion normalization evaluation system, comprising: the system comprises a gesture estimation module, a human skeleton key point sequence coordinate missing value interpolation module, a human skeleton joint point sequence segmentation module, a human joint angle characteristic extraction module and a human action normalization evaluation module.

2. The human motion normalization assessment system of claim 1, wherein the pose estimation module is configured to model human motion poses in a video.

3. The human motion normalization evaluation system according to claim 1, wherein the human skeleton key point sequence coordinate deficiency value interpolation module is configured to complement a deficiency key point occurring in a video frame.

4. The human motion normalization evaluation system according to claim 1, wherein the human motion normalization evaluation module is configured to compare the human body feature of the specified video with the human body feature of the video to be evaluated, and perform the score evaluation according to the comparison result.

5. The human action normalization evaluation system of claim 1, further comprising an automatic codec network module; the automatic coding and decoding network module is provided with an automatic coding and decoding network model for extracting and analyzing the action information characteristics.

6. The human action normalization evaluation system of claim 5, wherein the automatic codec network model comprises: a motion information encoder, a skeleton structure information encoder, a camera view information encoder and a decoder; the action information encoder, the skeleton structure information encoder and the camera view angle information encoder decouple the key point sequence of the human skeleton into the following three feature vectors: A1. motion information feature vectors related to time represent motion related information of a human body; A2. time-independent human skeleton structure information features, representing the structure information of human skeleton; A3. the time-independent camera view angle feature represents camera angle information at the time of motion video capture.

7. The human motion normalization evaluation system according to claim 6, wherein the decoder is configured to sequentially reconstruct three independent feature vectors obtained by encoding into corresponding human skeleton key point sequences, and compare the corresponding human skeleton key point sequences with given real samples to calculate corresponding losses.

8. A human motion normalization evaluation method based on a human motion normalization evaluation system according to any one of claims 1 to 7, characterized by comprising the steps of: s1, modeling a human body in an action video to be evaluated and a standard action video to obtain a human skeleton key point sequence; s2, performing missing value interpolation on positions with missing values in all the action videos; s3, dividing the acquired human skeleton key point sequences into sequence blocks with consistent lengths; s4, acquiring characteristic information only comprising human action elements from a human skeleton key point sequence; s5, comparing the acquired characteristic information of the motion video to be evaluated and the characteristic information of the standard motion video after time sequence adjustment, and generating a motion standardability evaluation score.

9. The method for evaluating the normalization of human motion according to claim 8, wherein in the step S1, human skeleton information in the motion video is extracted by using an openPose algorithm to obtain a human skeleton key point sequence.

10. The method of evaluating the normalization of human actions according to claim 8, wherein said step S2 comprises the sub-steps of: s21, finding out the front and rear nearest neighbor frames of the key point coordinate missing value, and carrying out feature weighting calculation on the front and rear nearest neighbor frames to obtain P _ave ：And t is _j T, wherein T is the frame where the coordinate of the missing key point is located, T1 and T2 respectively represent two frames which are nearest to the frame before and after the frame T and have no missing corresponding to the coordinate of the key point, and T is the total frame number of the video; s22, segmenting the sequence two according to the position of the missing value of the key point coordinate, performing polynomial regression on key point data corresponding to each segment of data, and obtaining regression predicted values of the time sequence of the front segment and the rear segment according to the missing key point: p (P) _before ＝y _j ；j＝0,1，...，i-1；P _after ＝y _j The method comprises the steps of carrying out a first treatment on the surface of the j=i+1, i+2,.. T, in which y _j The prediction result of polynomial regression is that T is the total frame number of the video; s23, predicting the obtained P _ave 、P _before 、P _after Weighting calculation is carried out, and a final prediction result is obtained: p (P) _t，i ＝ ¹ / ₂ P _ave + ¹ / ₄ P _before + ¹ / ₄ P _after 。

11. The method for evaluating the normalization of human motion according to claim 8, wherein in the step S3, the input two-dimensional human skeleton key point sequence is divided into sequence blocks with consistent length through a sliding window with a window size w and a step size r.

12. The method of evaluating the normalization of human actions according to claim 8, wherein said step S5 comprises the sub-steps of: s51, calculating two human motion information feature vector sets F through a DTW algorithm ₁ And F ₂ A best matching path matrix W between w= { (1, 1), (x, y), (n 1, n 2) }, where x, y are elements of the motion information feature vector, and x is aligned with y; n1 and n2 are the last elements of the two motion information feature vectors which need to be aligned regularly;

s52, calculating cosine similarity between all successfully matched motion information feature vector pairs on the optimal matching path matrix W to obtain a human skeleton data sequence block similarity set S:wherein x is _k And y _k The corresponding kth element in the two motion information feature vectors is respectively;

s53, averaging elements in the set S, and normalizing:wherein n is _s Is the number of elements in the set S.