CN114611600A

CN114611600A - Self-supervision technology-based three-dimensional attitude estimation method for skiers

Info

Publication number: CN114611600A
Application number: CN202210229185.8A
Authority: CN
Inventors: 鲍文霞; 马中玉; 王年; 朱明�
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-10

Abstract

The invention relates to a skiing person three-dimensional posture estimation method based on an automatic supervision technology, which overcomes the defect that the skiing person three-dimensional posture based on a video is difficult to estimate compared with the prior art. The invention comprises the following steps: acquiring a training data set; constructing a three-dimensional human body posture estimation network model; training a three-dimensional human body posture estimation network model; acquiring a skiing motion image to be estimated; and obtaining the estimation result of the three-dimensional posture of the skier. According to the method, under the condition that a three-dimensional real label is not needed, the accurate estimation of the three-dimensional human body posture of the skier is realized by using the two-dimensional data set.

Description

Self-supervision technology-based three-dimensional attitude estimation method for skiers

Technical Field

The invention relates to the technical field of three-dimensional human body posture estimation, in particular to a skiing athlete three-dimensional posture estimation method based on an automatic supervision technology.

Background

The skiing sport is based on the multidisciplinary theory, comprehensively analyzes the influence factors of the achievement of the air skill sport, and provides the training practice service for the coach to guide the sport. Pose estimation and analysis of skiers is important for better performance and to avoid joint injury due to inappropriate movements.

Human body posture estimation mainly refers to the detection of the position, the bone direction and the angle information of each joint point of a human body from an image, and due to the appearance of large-scale two-dimensional human body posture marking and a deep neural network, the problem of two-dimensional human body posture estimation is greatly successful in recent years. In contrast, the progress of three-dimensional body pose estimation is still limited. On one hand, the semantic ambiguity problem of recovering three-dimensional information from a single image is caused, and on the other hand, the annotation information of the three-dimensional data set is difficult to obtain and high in cost, so that the large-scale data set with three-dimensional ground truth data annotation is lacked.

Disclosure of Invention

The invention aims to solve the problem that the three-dimensional posture of a skier based on a video is difficult to estimate in the prior art, and provides a method for estimating the three-dimensional posture of the skier based on an automatic supervision technology to solve the problem.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a three-dimensional posture estimation method for skiers based on an automatic supervision technology comprises the following steps:

acquiring a training data set: constructing a two-dimensional training data set by using image parts in the public data set MPII and the Human3.6M data set, and preprocessing;

constructing a three-dimensional human body posture estimation model: the model is based on ResNet50 and WASP modules, an attention mechanism CBAM is introduced, and a three-dimensional label is constructed by utilizing epipolar geometry knowledge to realize an automatic supervision technology;

training a three-dimensional human body posture estimation model: pre-training a three-dimensional human body posture estimation model on a data set MPII, then performing self-supervision training by using a Human3.6m data set, and performing synthetic occlusion processing on data during training;

acquiring a skiing moving image to be estimated: acquiring video data captured by a high-speed camera in a skiing field, extracting the video data frame by frame into pictures, and taking the pictures as skiing motion images to be estimated;

obtaining the three-dimensional posture estimation result of the skier: inputting the skiing motion image to be estimated into the 3D posture estimation network of the trained three-dimensional human body posture estimation model to obtain a skiing athlete three-dimensional posture estimation result, and calculating the space angle of the key joint.

The method for constructing the three-dimensional human body posture estimation model comprises the following steps:

setting a three-dimensional human body posture estimation model to comprise an upper branch structure and a lower branch structure, wherein the upper branch structure is a 2D posture estimation network, and the lower branch structure is a 3D posture estimation network;

setting an upper branch structure as a 2D network, extracting features by using a basic network, obtaining a volume heat map H after deconvolution operation, and applying soft argmax to two dimensions of the volume heat map H to obtain a two-dimensional posture U:

extracting features by adopting ResNet50 as a basic network, and introducing an attention mechanism CBAM module combining spatial and channel in front of a Layer1 Layer and behind a Layer4 Layer of ResNet 50;

adding a waterfall model-based cavity space pooling module WASP behind a ResNet module of a basic network, acquiring a larger visual field for the extracted features by using the WASP module, and capturing multi-scale context information of the picture;

connecting a deconvolution network behind the WASP module, and acting the extracted features on the deconvolution network to obtain a volume heat map H;

operating the x dimension and the y dimension of the volume heat map H through a soft argmax function to obtain a two-dimensional attitude U;

setting a lower branch structure as a 3D network, and after obtaining the volume heat map H, applying soft argmax to three dimensions of the volume heat map H to obtain a three-dimensional posture V:

extracting features by adopting ResNet50 as a basic network, and introducing an attention mechanism CBAM before a Layer1 and after a Layer4 of ResNet 50;

adding a WASP module behind a basic network ResNet module, and capturing multi-scale context information of pictures by using the WASP module for the extracted features;

inputting the extracted features into a deconvolution network to obtain a volume heat map H;

applying soft argmax to the x, y, z dimensions of the volumetric heat map H to obtain the three-dimensional pose V.

The training three-dimensional human body posture estimation model comprises the following steps:

pre-training the 2D network: pre-training a 2D network of the three-dimensional human body posture estimation model by using a data set MPII to enable the network to realize accurate estimation of a two-dimensional human body posture;

setting the hyper-parameters of training: training by using an ADAM optimizer, setting the learning rate to be 0.001, setting the batch processing size of each training iteration to be 16, setting the batch processing size of the test to be 32, and setting the total iteration number of the training to be 140 epochs;

loading a pre-training model: transferring the model parameters pre-trained on the MPII data set by the 2D network into the 3D network;

and (3) carrying out data enhancement processing: performing salt and pepper noise and Gaussian noise adding and brightness adjusting processing, and performing synthetic occlusion processing on a Human3.6m data set;

the synthetic occlusion processing filters people and objects marked as difficult or truncated from segmented objects extracted from the Pacal VOC data set by using the Pacal VOC data set, pastes the rest 2638 objects to random positions of the Human3.6M data set by using occlusion probability Pocc, and synthesizes a training image with random occlusion, wherein the occlusion probability Pocc is set to be 0.5, and the occlusion degree is between 0% and 70%;

self-supervised training 3D pose estimation network:

acquisition of images I from different perspectives from a human3.6m dataset_i、I_i+1As input, inputting the three-dimensional human body posture estimation model into a 2D network and a 3D posture estimation network simultaneously;

the 2D network applies antipodal geometric knowledge to obtain a three-dimensional label: obtaining two-dimensional attitude U by 2D network_iAnd U_i+1Carrying out antipodal geometric transformation on the plurality of two-dimensional gestures to obtain a three-dimensional gesture in a global coordinate system, caching the three-dimensional gesture as a three-dimensional real tag, and recording the three-dimensional real tag as V_gt；

3D attitude estimation network uses three-dimensional label V obtained in last step_gtCarrying out unsupervised training to realize the self-supervision of the network and obtain a predicted three-dimensional posture V;

calculating a loss function: loss function Using smooth L₁Loss, smooth L₁(x) The formula (2) is shown in the following formula (1);

changing x to V_gtSubstitution of-V into smoothL₁(x) Performing calculation and minimizing smooth L₁(V_gtV) to train a 3D pose estimation network, where V is the predicted three-dimensional pose of the 3D pose estimation network projected onto the corresponding camera space, V_gtFor three-dimensional tags in 2D networks derived from epipolar geometry,

the method for obtaining the three-dimensional attitude estimation result of the skier comprises the following steps:

inputting a skiing motion image to be estimated into a 3D posture estimation network of a trained three-dimensional human posture estimation model to obtain skiing athlete three-dimensional posture estimation, and visually representing the posture of the skiing athlete three-dimensional posture estimation model by using a skeleton diagram;

according to the analysis of the skiing gesture, selecting key joints and bones, and calculating the spatial angle information of joint points and a trunk:

selecting a space angle of a knee joint and an elbow joint as a key joint and skeleton, wherein joint points E, F, G are a hip joint, a knee joint and an ankle joint respectively, and joint points A, B, C are a shoulder joint, an elbow joint and a wrist joint respectively;

let the three-dimensional coordinate of the hip joint E be (x)_e,y_e,z_e) Three-dimensional sitting of the knee joint FIs marked as (x)_f,y_f,z_f) The three-dimensional coordinate of the knee joint G is (x)_g,y_g,z_g) The skeleton vector FE ═ x_f-x_e,y_f-y_e,z_f-z_e)， FG＝(x_f-x_g,y_f-y_g,z_f-z_g)；

Substituting joint point coordinates output by the three-dimensional attitude estimation network into a space angle formula

Solving the degree of a spatial angle & lt EFG of the knee joint F;

let the coordinates of the shoulder, elbow and wrist joints be A ═ x_a,y_a,z_a)、B＝(x_b,y_b,z_b)、 C＝(x_c,y_c,z_c) Bone vector BA ═ x_b-x_a,y_b-y_a,z_b-z_a)，BC＝(x_b-x_c,y_b-y_c,z_b-z_c) Substituting the coordinates of the joint points into the formula

And then solving the degree of the spatial angle & lt ABC of the shoulder joint B.

The 2D network uses epipolar geometry knowledge to obtain the three-dimensional label, and the method comprises the following steps:

let the two-dimensional coordinate of j-th joint point of the input i-th picture be [ x_i,j,y_i,j],U＝[x_i,j,y_i,j]Is 2D fuse, three-dimensional coordinate is [ x ]_i,j,y_i,j,z_i,j],V＝[x_i,j,y_i,j,z_i,j]Is 3D dose;

the following formula is obtained from the pinhole image projection model:

U_i,jFU_i+1,j＝0 (2)

E＝K^TFK (3)

wherein w_i,jThe depth of the joint point j in the ith picture when the camera is taken as a reference system, K is a known camera internal reference matrix, f_x、f_yIs a focal length, c_x、c_yThe offset of the optical axis of the camera in an image coordinate system is shown, R and T are external parameters of the camera, R is a rotation matrix, and T is a translation vector;

in the case of not using camera extrinsic parameters, it is assumed that the first camera is located at the center of the coordinate system, i.e. R of the camera is constant;

will U_iAnd U_i+1Substituting the formula (2) into the matrix to obtain a basic matrix F;

then substituting the known camera internal parameters K and the basic matrix F into the formula (3) to obtain a necessary matrix E;

then, performing SVD on the necessary matrix E to obtain four possible solutions of the external parameters R of the camera;

finally, the image I_i、I_i+1Corresponding 2D poseU_i、U_i+1Substituting all the joint points into an antipodal geometric formula (4), obtaining the three-dimensional coordinates of the corresponding joint points through polynomial triangulation, and caching the three-dimensional coordinates to serve as a three-dimensional label V_gt。

Advantageous effects

Compared with the prior art, the three-dimensional posture estimation method of the skier based on the self-supervision technology realizes the accurate estimation of the three-dimensional human posture of the skier by using the two-dimensional data set under the condition of not needing a three-dimensional real label. The method uses the two-dimensional RGB image of the skiing motion to estimate the three-dimensional posture of the athlete, and utilizes the antipodal geometric knowledge to reconstruct the three-dimensional label during model training to realize the self-supervision technology, thereby solving the problem that the three-dimensional real label of a data set in the three-dimensional posture estimation technology is difficult to obtain.

The invention also has the following advantages:

1. a three-dimensional human body posture estimation network model is constructed, the model is based on ResNet50 and WASP modules, an attention mechanism CBAM is introduced, the receptive field is increased, and meanwhile important pixel points with fine granularity are selected, so that the three-dimensional posture estimation of the model is more accurate;

2. a three-dimensional label training model is constructed by utilizing epipolar geometry knowledge between two views, any three-dimensional real label and camera external parameters are not needed, and real self-supervision is realized;

3. firstly, pre-training a 2D network on a data set MPII, and then transferring parameters into a 3D posture estimation network, wherein the transferred parameters not only can improve the accuracy of joint detection, but also can make the generalization effect of model parameters better;

4. the model is applied to three-dimensional attitude estimation of skiers, so that the three-dimensional attitude of a sportsman is accurately estimated from a two-dimensional RGB image, the three-dimensional angle information of key joints is calculated, and the movement analysis of the sportsman is facilitated.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a general framework of the present invention;

FIG. 3 is an overall structure diagram of a constructed three-dimensional human body posture estimation model;

FIG. 4 is a detailed schematic diagram of an implementation of an auto-supervision technique;

FIG. 5 is a network architecture diagram of ResNet50 after the CBAM module has been introduced;

FIG. 6 is a general block diagram of a CBAM module;

FIG. 7 is a general block diagram of the WASP module;

FIG. 8 is a human bone topology of the present invention;

FIG. 9a is a prior art input skiing motion image;

FIG. 9b is a three-dimensional pose graph output using the method of the present invention;

FIG. 10 is a visualization of the calculation of the ski player joint angle.

Detailed Description

So that the manner in which the above recited features of the present invention can be understood and readily understood, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings, wherein:

the invention relates to a skiing athlete three-dimensional posture estimation method based on an automatic supervision technology, which comprises the following steps of:

firstly, acquiring a training data set: the two-dimensional training dataset of the invention is constructed with the image portions in the public dataset MPII and the human3.6m dataset and preprocessed. MPII is a public dataset used to evaluate two-dimensional body pose estimates, annotated with a joint count of 16, with approximately 2.5 million annotated images. Human3.6m is the most widely used public data set for three-dimensional human pose estimation, containing 360 million RGB images from 4 different perspectives captured by a motion capture system in an indoor environment, annotated with a number of joints of 17. Because the network directly learns from the non-label data during the self-supervision training without three-dimensional labeling data, only the image part in the training set of the public data set Human3.6M is needed to be downloaded. To reduce data redundancy, key frames are extracted from the data, one frame is extracted every 5 frames in the training set, and 51765 samples exist in the redundant training set of data.

Secondly, constructing a three-dimensional human body posture estimation model: the model is based on ResNet50 and WASP modules, an attention mechanism CBAM module is introduced to construct a three-dimensional human body posture estimation network model, and antipodal geometric knowledge is utilized to construct a three-dimensional label to realize an automatic supervision technology. The overall structure of the constructed three-dimensional human body posture estimation model is shown in fig. 3.

The three-dimensional human posture estimation model adopts ResNet50 as a basic network to extract features, an attention mechanism CBAM combining space (spatial) and channel (channel) is introduced in front of a Layer1 Layer and behind a Layer4 Layer of ResNet50, and the overall structure of ResNet50 after the CBAM is introduced is shown in FIG. 5. The CBAM module is composed of two parts, a channel attention module CA and a space attention module SA, and as shown in fig. 6, the result output by the convolutional layer passes through a channel attention module to obtain a weighted result, and then passes through a space attention module to finally obtain a weighted result.

The attention mechanism is similar to the selective visual attention mechanism of human beings in nature, and aims to select information which is more critical to the current task target from a plurality of information, effectively inhibit useless information and greatly improve the efficiency and accuracy of visual information processing. When the human body posture is estimated, the attention mechanism enables the network to pay more attention to the key point information of the human body, and selects important pixel points with fine granularity, so that the estimation accuracy of the model is higher. The CBAM is an attention mechanism module combining space and channels, not only considers the importance of channel pixels, but also considers the importance of pixels at different positions of the same channel, and compared with other attention mechanisms only focusing on the channels, the module can achieve better effects.

The method comprises the following specific steps:

(1) the three-dimensional human body posture estimation model comprises an upper branch structure and a lower branch structure, wherein the upper branch structure is a 2D posture estimation network, and the lower branch structure is a 3D posture estimation network.

(2) Setting an upper branch structure as a 2D network, extracting features by using a basic network, then obtaining a volume heat map H by performing deconvolution operation on the extracted features, and applying soft argmax to two dimensions of the volume heat map H to obtain a two-dimensional posture U:

A1) extracting features by taking ResNet50 as a basic network, and introducing an attention mechanism CBAM module combining spatial and channel into front of a Layer1 and behind a Layer4 of ResNet50 to select important pixels with fine granularity;

A2) a hollow space pooling module WASP (waterfall atmospheric Spatial Pooling) based on a waterfall model is added behind a ResNet module of a basic network, the WASP module is used for obtaining a larger visual field for the extracted features, and multi-scale context information of the picture is captured;

A3) connecting a deconvolution network behind the WASP module, and acting the extracted features on the deconvolution network to obtain a volume heat map H;

A4) operating the x dimension and the y dimension of the volume heat map H through a soft argmax function to obtain a two-dimensional attitude U;

A5) constructing a three-dimensional posture by utilizing epipolar geometry knowledge, caching the three-dimensional posture as a three-dimensional label, and marking the three-dimensional label as V_gt；

(3) Setting a lower branch structure as a 3D attitude estimation network, and after obtaining a volume heat map H, applying soft argmax to three dimensions of the volume heat map H to obtain a three-dimensional attitude V:

B1) extracting features by adopting ResNet50 as a basic network, introducing an attention mechanism CBAM in front of a Layer1 and behind a Layer4 of ResNet50, and capturing multi-scale context information of pictures by using a WASP module for the extracted features to increase a receptive field;

B2) adding a WASP module behind a basic network ResNet module, and capturing multi-scale context information of pictures by using the WASP module for the extracted features;

B3) inputting the extracted features into a deconvolution network to obtain a volume heat map H;

B4) applying soft argmax to the x, y, z dimensions of the volumetric heat map H to obtain the three-dimensional pose V.

The CBAM module introduced by the invention respectively passes the input characteristic diagram through max porous and average porous, and then respectively passes through shared MLP. Adding the features output by the MLP, and performing sigmoid activation operation to generate final channel attention features; and then taking the channel attention characteristic and the characteristic input by the module as weighted operation as input, inputting the input into the space attention module, firstly taking max power and average power on the input, and then performing convolution operation to reduce the dimension into 1 channel. Generating a spatial attention feature through sigmoid; and finally, weighting the space attention characteristic and the input characteristic of the module to obtain the finally generated characteristic.

The addition of the WASP module after the Resnet module allows the network to have a larger view and multi-scale features to make good use of contextual information to predict joint positions. The network structure of the WASP is shown in fig. 7, for a given input, the hollow convolution parallel sampling with different sampling rates is performed, the obtained results are combined together, the number of channels is enlarged, then the number of channels is reduced to an expected value through 1 × 1 convolution, which is equivalent to capturing context information of images in multiple proportions, so that the network can better utilize global information, but not pay too much attention to a small part of features, and the important information is not ignored when making a decision. Meanwhile, a deconvolution network is connected behind the WASP module, and the extracted features are acted on the deconvolution network to obtain a volume heat map H.

A 2D network (upper branch) in the model, and a two-dimensional posture U can be obtained by applying soft argmax to only two dimensions x and y of the volume heat map H; the 3D network in the model will (lower branch), applying soft argmax to the three dimensions of the volumetric heatmap H may result in a three-dimensional pose V. The 2D network carries out epipolar geometric transformation on the obtained 2D pos Ui and Ui +1 of different views to obtain a 3D pos V, and the V is cached down to be used as a three-dimensional label to train the 3D network, so that the self-supervision training of the 3D network can be realized.

Thirdly, training a three-dimensional human body posture estimation model: as shown in fig. 2, the three-dimensional human body posture estimation model is pre-trained on a two-dimensional data set MPII, then self-supervised training is performed on a human3.6m data set, and synthetic occlusion processing is performed on data during training to enhance a training image.

On one hand, the model can realize accurate estimation of the two-dimensional posture, so that a three-dimensional label constructed by the two-dimensional posture is more accurate, and the accuracy of joint point detection is improved; on the other hand, the difference between the MPII data set and the Human3.6M data set is large, the environment of the MPII data set has diversity, a large number of outdoor environments exist, the background is complex, the Human3.6M data set is difficult to acquire and high in cost, the MPII data set is acquired by a dynamic capture system in an indoor closed environment mostly, and the environment is single, so that the model is pre-trained on the MPII data set, and then self-supervision training is performed by using the Human3.6M data set, so that the generalization effect of model parameters is better. The synthetic occlusion is also an effective data enhancement mode, and the synthetic occlusion processing is carried out on the training image, so that the accuracy of the model is improved, and the robustness of the model is also improved.

The supervision information of the self-supervision training is not labeled manually, but is automatically constructed in unsupervised data by using an algorithm to conduct supervision training. By using the self-supervision learning technology, 3D estimation of human body posture can be realized by using 2D data, and the problem that tag data labeled in a three-dimensional data set is difficult to obtain is solved. The self-supervision technology of the invention is specifically realized as shown in fig. 4, a three-dimensional label is constructed by using 2D posture and epipolar geometry knowledge estimated by a 2D network, and supervision training is carried out on a 3D network.

The method comprises the following specific steps:

(1) pre-training the 2D network: the 2D network of the three-dimensional human body posture estimation model is pre-trained by using the data set MPII, so that the network can realize accurate estimation of the two-dimensional human body posture.

(2) Setting the hyper-parameters of training:

in the laboratory, the Lunix system was selected for the experimental environment and performed on NVIDIA 2070 video cards. The network models are all built in the pytorch, an ADAM optimizer is used for training, the learning rate is set to be 0.001, the batch processing size of each training iteration is 16, the batch processing size of the test is 32, and the total iteration number of the training is set to be 140 epochs.

(3) Loading a pre-training model: parameters of the 2D network pre-trained on the MPII dataset are migrated to the 3D pose estimation network.

(4) And (3) carrying out data enhancement processing: performing salt and pepper noise and Gaussian noise adding and brightness adjusting processing, and performing synthetic occlusion processing on a Human3.6m data set;

data enhancement is an effective learning strategy, and besides the operations of adding salt-pepper noise, Gaussian noise, adjusting brightness and the like, synthetic occlusion is used for a Human3.6m data set to enhance a training image. The synthetic occlusion processing filters out people and objects marked as difficult or truncated using the Pacal VOC dataset, pastes the remaining 2638 objects to random positions of the human3.6m dataset with an occlusion probability Pocc, wherein the occlusion probability Pocc is set to 0.5 and the degree of occlusion is between 0% and 70%.

(5) Self-supervised training 3D pose estimation network:

C1) acquisition of images I from different perspectives from a human3.6m dataset_i、I_i+1As input, simultaneously inputting to a 2D network and a 3D pose estimation mesh of a three-dimensional human pose estimation modelLinking in the collateral;

C2) the 2D network applies antipodal geometric knowledge to obtain a three-dimensional label: obtaining two-dimensional attitude U by 2D network_iAnd U_i+1Carrying out antipodal geometric transformation on the plurality of two-dimensional gestures to obtain a three-dimensional gesture in a global coordinate system, caching the three-dimensional gesture as a three-dimensional real tag, and recording the three-dimensional real tag as V_gt。

C21) let the two-dimensional coordinate of j-th joint point of the input i-th picture be [ x_i,j,y_i,j],U＝[x_i,j,y_i,j]Is 2D fuse, three-dimensional coordinate is [ x ]_i,j,y_i,j,z_i,j],V＝[x_i,j,y_i,j,z_i,j]Is 3D dose;

the following formula is obtained from the pinhole image projection model:

U_i,jFU_i+1,j＝0 (2)

E＝K^TFK (3)

C22) in the case of not using camera extrinsic parameters, it is assumed that the first camera is located at the center of the coordinate system, i.e. R of the camera is constant;

C23) will U_iAnd U_i+1Substituting the formula (2) into the matrix to obtain a basic matrix F;

C24) then substituting the known camera internal parameters K and the basic matrix F into the formula (3) to obtain a necessary matrix E;

C25) then, carrying out SVD on the necessary matrix E to obtain four possible solutions of the external parameter R of the camera;

C26) finally, the image I_i、I_i+1Corresponding 2D poseU_i、U_i+1Substituting all the joint points into an antipodal geometric formula (4), obtaining the three-dimensional coordinates of the corresponding joint points through polynomial triangulation, and caching the three-dimensional coordinates to serve as a three-dimensional label V_gt。

C3) The 3D attitude estimation network uses the three-dimensional labels obtained by the 2D network to perform unsupervised training, so that the network is automatically supervised, and the predicted three-dimensional attitude V is obtained.

(6) Calculating a loss function: loss function Using smooth L₁Loss, smooth L₁(x) The formula (2) is shown in the following formula (1);

changing x to V_gtSubstitution of-V into smoothL₁(x) Performing calculation and minimizing smooth L₁(V_gtV) to train the upper branch network, where V is the predicted three-dimensional pose of the lower branch structure projected into the corresponding camera space, V_gtIs a three-dimensional label in an upper branch structure obtained by epipolar geometry,

and step four, acquiring the skiing motion image to be estimated: video data captured by a high-speed camera at a ski field is acquired, and the video data is extracted frame by frame into pictures which are used as ski moving images to be estimated.

Although the multi-view data set is required to be input during the self-supervision training, the three-dimensional posture of the athlete can be obtained only by inputting a monocular skiing motion image during the application, so that the difficulty in obtaining the skiing motion image is greatly reduced, and the application value of the invention is higher.

And fifthly, obtaining the three-dimensional attitude estimation result of the skier: inputting the skiing motion image to be estimated into the 3D posture estimation network (lower branch) of the trained three-dimensional human body posture estimation model to obtain the estimation result of the three-dimensional posture of the skier, and calculating the space angle of the key joint. The method comprises the following specific steps:

(1) inputting the skiing moving image to be estimated into the trained 3D posture estimation network of the three-dimensional human posture estimation model to obtain the skiing athlete three-dimensional posture estimation, and visually representing the posture of the skiing athlete by using a skeleton map, as shown in fig. 9, wherein 9a is the input skiing moving image, and 9b is the obtained three-dimensional posture.

As the model is pre-trained by the MPII data set and then self-supervised training is carried out by the Human3.6M data set, the bone topological graph of the MPII data set is formed by connecting 16 joint points according to a certain relation, and the bone topological graph of the Human3.6M data set is formed by 17 joint points, the bone topological graph finally output by the network model has 16 joint points, as shown in figure 8.

(2) According to the analysis of the skiing gesture, selecting key joints and bones, and calculating the spatial angle information of joint points and a trunk:

in skiing, the standardization of the athlete's movements, especially the spatial angles of the torso, such as shoulder-elbow-wrist and hip-knee-ankle, have a significant impact on the final performance. Wherein, the spatial angle of the knee joint and the elbow joint is selected as the key joint and skeleton. As shown in fig. 10, joint points E, F, G are a hip joint, a knee joint, and an ankle joint, respectively, and joint points A, B, C are a shoulder joint, an elbow joint, and a wrist joint, respectively.

Taking the hip joint as an example, let the three-dimensional coordinate of the hip joint E be (x)_e,y_e,z_e) The three-dimensional coordinate of the knee joint F is (x)_f,y_f,z_f) The three-dimensional coordinate of the knee joint G is (x)_g,y_g,z_g) Bone vector FE ═ x_f-x_e,y_f-y_e,z_f-z_e)，FG＝(x_f-x_g,y_f-y_g,z_f-z_g) (ii) a Substituting joint point coordinates output by the three-dimensional attitude estimation network into a space angle formula

The degree of the spatial angle ≈ EFG of the knee joint F is obtained as shown in fig. 10.

Similarly, the spatial angle of the shoulder joint B in fig. 10 can be obtained. Let the coordinates of the shoulder, elbow and wrist joints be A ═ x_a,y_a,z_a)、B＝(x_b,y_b,z_b)、C＝(x_c,y_c,z_c) Bone vector BA ═ x_b-x_a,y_b-y_a,z_b-z_a)，BC＝(x_b-x_c,y_b-y_c,z_b-z_c) Substituting the coordinates of the joint points into the formula

And then the degree of the spatial angle ABC of the shoulder joint B can be calculated.

Through the calculation of the formula, the specific degrees of the space angles of the key joints such as the knees and the elbows of the skiers can be obtained, and compared with the method of observing the human body posture and the angles of all the joints by naked eyes on the graph of fig. 9b, the accurate degrees can be more visual, and the posture of the skiers can be better analyzed.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A three-dimensional posture estimation method for skiers based on an automatic supervision technology is characterized by comprising the following steps:

11) acquiring a training data set: constructing a two-dimensional training data set by using image parts in the public data set MPII and the Human3.6M data set, and preprocessing;

12) constructing a three-dimensional human body posture estimation model: the model is based on ResNet50 and WASP modules, an attention mechanism CBAM is introduced, and a three-dimensional label is constructed by utilizing epipolar geometry knowledge to realize an automatic supervision technology;

13) training a three-dimensional human body posture estimation model: pre-training a three-dimensional human body posture estimation model on a data set MPII, then performing self-supervision training by using a Human3.6m data set, and performing synthetic occlusion processing on data during training;

14) acquiring a skiing moving image to be estimated: acquiring video data captured by a high-speed camera in a skiing field, extracting the video data frame by frame into pictures, and taking the pictures as skiing motion images to be estimated;

15) obtaining the three-dimensional posture estimation result of the skier: inputting the skiing motion image to be estimated into the 3D posture estimation network of the trained three-dimensional human body posture estimation model to obtain a skiing athlete three-dimensional posture estimation result, and calculating the space angle of the key joint.

2. The self-supervision technology based three-dimensional pose estimation method for skiers according to claim 1, wherein the building of the three-dimensional human pose estimation model comprises the following steps:

21) setting a three-dimensional human body posture estimation model to comprise an upper branch structure and a lower branch structure, wherein the upper branch structure is a 2D posture estimation network, and the lower branch structure is a 3D posture estimation network;

22) setting an upper branch structure as a 2D network, extracting features by using a basic network, obtaining a volume heat map H after deconvolution operation, and applying soft argmax to two dimensions of the volume heat map H to obtain a two-dimensional posture U:

221) extracting features by adopting ResNet50 as a basic network, and introducing an attention mechanism CBAM module combining spatial and channel in front of a Layer1 Layer and behind a Layer4 Layer of ResNet 50;

222) adding a waterfall model-based cavity space pooling module WASP behind a ResNet module of a basic network, acquiring a larger visual field for the extracted features by using the WASP module, and capturing multi-scale context information of the picture;

223) connecting a deconvolution network behind the WASP module, and acting the extracted features on the deconvolution network to obtain a volume heat map H;

224) operating the x dimension and the y dimension of the volume heat map H through a soft argmax function to obtain a two-dimensional attitude U;

23) setting a lower branch structure as a 3D network, and after obtaining the volume heat map H, applying the soft argmax function to three dimensions of the volume heat map H to obtain a three-dimensional posture V:

231) extracting features by adopting ResNet50 as a basic network, and introducing an attention mechanism CBAM before a Layer1 and after a Layer4 of ResNet 50;

232) adding a WASP module behind a basic network ResNet module, and capturing multi-scale context information of pictures by using the WASP module for the extracted features;

233) inputting the extracted features into a deconvolution network to obtain a volume heat map H;

234) and applying the soft argmax function to the x, y and z dimensions of the volume heat map H to obtain the three-dimensional posture V.

3. The self-supervision technology based three-dimensional pose estimation method for skiers as claimed in claim 1, wherein the training of three-dimensional human pose estimation model comprises the following steps:

31) pre-training the 2D network: pre-training a 2D network of the three-dimensional human body posture estimation model by using a data set MPII (message passing interface II) to enable the network to realize accurate estimation of a two-dimensional human body posture;

32) setting the hyper-parameters of training: training by using an ADAM optimizer, setting the learning rate to be 0.001, setting the batch processing size of each training iteration to be 16, setting the batch processing size of the test to be 32, and setting the total iteration number of the training to be 140 epochs;

33) loading a pre-training model: transferring the model parameters pre-trained on the MPII data set by the 2D network into the 3D network;

34) and (3) carrying out data enhancement processing: performing salt and pepper noise and Gaussian noise adding and brightness adjusting processing, and performing synthetic occlusion processing on a Human3.6m data set;

35) self-supervised training 3D pose estimation network:

351) acquisition of images I from different perspectives from a human3.6m dataset_i、I_i+1As input, inputting the three-dimensional human body posture estimation model into a 2D network and a 3D posture estimation network simultaneously;

352) the 2D network applies antipodal geometric knowledge to obtain a three-dimensional label: obtaining two-dimensional attitude U by 2D network_iAnd U_i+1Carrying out antipodal geometric transformation on the plurality of two-dimensional gestures to obtain a three-dimensional gesture in a global coordinate system, caching the three-dimensional gesture as a three-dimensional real tag, and recording the three-dimensional real tag as V_gt；

353)3D attitude estimation network uses three-dimensional label V obtained in last step_gtCarrying out unsupervised training to realize the self-supervision of the network and obtain a predicted three-dimensional posture V;

36) calculating a loss function: loss function Using smooth L₁Loss, smooth L₁(x) The formula (2) is shown in the following formula (1);

by changing x to V_gtSubstitution of V into smooth L₁(x) Performing calculation and minimizing smooth L₁(V_gtV) to train a 3D pose estimation network, where V is the predicted three-dimensional pose of the 3D pose estimation network projected onto the corresponding camera space, V_gtFor three-dimensional tags in 2D networks derived from epipolar geometry,

4. the self-supervision technology based three-dimensional posture estimation method for skiers according to claim 1, characterized in that the obtaining of the three-dimensional posture estimation result for skiers comprises the following steps:

41) inputting a skiing motion image to be estimated into a 3D posture estimation network of a trained three-dimensional human posture estimation model to obtain skiing athlete three-dimensional posture estimation, and visually representing the posture of the skiing athlete three-dimensional posture estimation model by using a skeleton diagram;

42) according to the analysis of the skiing gesture, selecting key joints and bones, and calculating the spatial angle information of joint points and a trunk:

let the three-dimensional coordinate of the hip joint E be (x)_e,y_e,z_e) The three-dimensional coordinate of the knee joint F is (x)_f,y_f,z_f) The three-dimensional coordinate of the knee joint G is (x)_g,y_g,z_g) Bone vector FE ═ x_f-x_e,y_f-y_e,z_f-z_e)，FG＝(x_f-x_g,y_f-y_g,z_f-z_g)；

Solving the degree of a spatial angle & lt EFG of the knee joint F;

let the coordinates of the shoulder, elbow and wrist joints be A ═ x_a,y_a,z_a)、B＝(x_b,y_b,z_b)、C＝(x_c,y_c,z_c) Bone vector BA ═ x_b-x_a,y_b-y_a,z_b-z_a)，BC＝(x_b-x_c,y_b-y_c,z_b-z_c) Substituting the coordinates of the joint points into the formula

5. The method for estimating the three-dimensional pose of a skier based on the self-supervision technology as claimed in claim 3, wherein the 2D network using the epipolar geometry to obtain the three-dimensional label comprises the following steps:

51) let the two-dimensional coordinate of j-th joint point of the input i-th picture be [ x ]_i,j,y_i,j],U＝[x_i,j,y_i,j]Is 2D fuse, three-dimensional coordinate is [ x ]_i,j,y_i,j,z_i,j],V＝[x_i,j,y_i,j,z_i,j]Is 3 Dpos;

the following formula is obtained from the pinhole image projection model:

U_i,jFU_i+1,j＝0 (2)

E＝K^TFK (3)

wherein, w_i,jThe depth of the joint point j in the ith picture when the camera is taken as a reference system, K is a known camera internal reference matrix, f_x、f_yIs a focal length, c_x、c_yThe offset of the optical axis of the camera in an image coordinate system is shown, R and T are external parameters of the camera, R is a rotation matrix, and T is a translation vector;

52) in the case of not using camera extrinsic parameters, it is assumed that the first camera is located at the center of the coordinate system, i.e. R of the camera is constant;

53) will U_iAnd U_i+1Substituting the formula (2) to obtain a basic matrix F;

54) then substituting the known camera internal parameters K and the basic matrix F into the formula (3) to obtain a necessary matrix E;

55) then, carrying out SVD on the necessary matrix E to obtain four possible solutions of the external parameter R of the camera;

56) finally, the image I_i、I_i+1Corresponding 2D poseU_i、U_i+1All the joint points are substituted into an antipodal geometric formula (4) and are obtained by polynomial triangulationTo the three-dimensional coordinates of the corresponding joint point and cached as a three-dimensional label V_gt。