CN112766153A

CN112766153A - Three-dimensional human body posture estimation method and system based on deep learning

Info

Publication number: CN112766153A
Application number: CN202110067988.3A
Authority: CN
Inventors: 刘晓平; 王冬; 谢文军; 蔡有城; 沈子祺
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-05-07
Anticipated expiration: 2041-01-19
Also published as: CN112766153B

Abstract

The invention discloses a three-dimensional human body posture estimation method and a system based on deep learning, which comprises a two-dimensional joint extraction module, an image acquisition module, a two-dimensional joint extraction module and a three-dimensional body posture estimation module, wherein the two-dimensional joint extraction module is used for acquiring an image and extracting a two-dimensional joint from the acquired image; performing joint point transformation on the two-dimensional joint acquired by the two-dimensional joint extraction module by using a joint point transformation module; and performing combined deep learning training on the two-dimensional joint after the joint point transformation is performed by the joint point transformation module by using the three-dimensional joint extraction module and the three-dimensional joint pre-training module, and extracting the three-dimensional human body posture. The transformation parameters can be automatically learned, the method is more suitable for the transformation process of the two-dimensional posture, and the problem of overlarge error in the deep learning process can be prevented by directly adaptively transforming the coordinate points of the two-dimensional posture in the limited transformation process.

Description

Three-dimensional human body posture estimation method and system based on deep learning

Technical Field

The invention relates to the technical field of three-dimensional human body postures, in particular to a three-dimensional human body posture estimation method and system based on deep learning.

Background

The three-dimensional human body posture is an important part in the existing computer vision, and generally, the human body posture is divided into three important research fields of extraction from an image to a two-dimensional posture, extraction from the image to a three-dimensional posture and extraction from the two-dimensional posture to the three-dimensional posture. In the process of extracting the three-dimensional human body, the accuracy of extracting the three-dimensional posture from the image is poor, so the research of the application is based on the extraction from the image to the two-dimensional posture and the extraction from the two-dimensional posture to the three-dimensional posture.

In the current computer vision, Data enhancement is almost applied to the image level, for example, in an article, "adaptive selective Data evaluation for Human dose Estimation", published in 2020, a Semantic enhancement method is adopted to enhance Data of an original image, so that a trained network has robustness; in Human body posture, research in 2018 indicates that Data are enhanced through image training, and the Data enhancement of the GAN network on images is clearly explained in the article of "adaptive Data evaluation in Human away Estimation", so that the original image Data can be learned under the conditions of rotation, scale and masking, and the further improvement of the network is realized.

However, for the extraction of the existing two-dimensional human body posture, the image is taken as a research object, and the image can be enhanced by using the scheme, but because the accuracy of directly obtaining the three-dimensional human body posture from the image cannot be guaranteed and the data from the two-dimensional posture to the three-dimensional posture cannot be directly operated on the image, the estimation of the human body posture from the two-dimensional posture to the three-dimensional posture in the prior art has no good data enhancement scheme, so that the research is in a blank stage.

In 2020, a study has gained attention, and a sentence of "shielded Deep cellular 3D Human away Estimation with evolution Training Data" is published in CVPR2020, in which a transformation process is given to two-dimensional body posture Data, as shown in fig. 8 of the accompanying drawings of the specification, new body posture Data constructed by using the transformation of joint points is given, which indicates a study direction to the two-dimensional body posture. However, it is obvious that this research is a direct mathematical operation, and if an original data set is subjected to a mathematical transformation in a centralized manner, and an original two-dimensional human body posture is subjected to a fixed transformation through a formula given in the text, so that a transformed human body posture is obtained, an adaptive network parameter learning two-dimensional human body posture cannot be given, and a corresponding three-dimensional human body posture cannot be given as a supervision signal for a new network to learn.

Therefore, the method and the system for estimating the three-dimensional human body posture based on deep learning are provided based on the characteristics of the two-dimensional human body posture and by combining a data enhancement scheme in an image.

Disclosure of Invention

Aiming at the problems, the invention provides a three-dimensional human body posture estimation method and a system based on deep learning, which can automatically learn transformation parameters, are more suitable for a two-dimensional posture transformation process, can directly and adaptively transform coordinate points of a two-dimensional posture by limiting the transformation process, prevent the problem of overlarge error in the deep learning process and effectively solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a three-dimensional human body posture estimation method based on deep learning comprises an image acquisition module, a two-dimensional joint extraction module and a three-dimensional body posture estimation module, wherein the image acquisition module acquires an image and performs two-dimensional joint extraction on the acquired image to obtain a two-dimensional joint;

performing joint point transformation on the two-dimensional joint acquired by the two-dimensional joint extraction module by using a joint point transformation module;

and performing combined deep learning training on the two-dimensional joint after the joint point transformation is performed by the joint point transformation module by using the three-dimensional joint extraction module and the three-dimensional joint pre-training module, and extracting the three-dimensional human body posture.

As a preferred embodiment of the present invention, the performing joint point transformation on the two-dimensional joint acquired by the two-dimensional joint extraction module by using the joint point transformation module includes

The joint adaptive transformation unit executing the two-dimensional joint points carries out joint point transformation on the two-dimensional joint acquired by the two-dimensional joint extraction module to obtain a minimum unit after two-dimensional transformation after joint point transformation;

updating the minimum unit after the two-dimensional transformation by using a two-dimensional joint updating module to obtain an updated minimum unit after the two-dimensional transformation and obtain a two-dimensional transformation joint point;

predicting by using a three-dimensional joint pre-training module according to the two-dimensional transformation joint points to obtain a minimum unit after three-dimensional posture transformation;

carrying out posture adjustment on the three-dimensional posture group before transformation and the minimum unit after the three-dimensional posture transformation to obtain a three-dimensional transformation joint point;

and inputting the two-dimensional transformation joint points into a three-dimensional joint extraction module, performing Loss supervision by using the three-dimensional transformation joint points, and recording as Loss1 to obtain the three-dimensional human body posture.

As a preferred embodiment of the present invention, the joint adaptive transformation unit for executing the two-dimensional joint point transforms the joint point of the two-dimensional joint acquired by the two-dimensional joint extraction module, and the minimum unit after the two-dimensional transformation for obtaining the joint point transform includes

Acquiring the number of two-dimensional joint points and the positions of corresponding joint points;

synthesizing the number of the two-dimensional joint points and the positions of the corresponding joint points into a minimum unit before two-dimensional transformation, and performing rotation and scale transformation learning on the minimum unit before two-dimensional transformation to obtain a joint adaptability transformation unit;

inputting the minimum unit before two-dimensional transformation into the joint adaptive transformation unit to obtain a minimum unit after two-dimensional transformation;

and performing Loss supervision on the minimum unit after the two-dimensional transformation by using the minimum unit before the two-dimensional transformation, and recording the minimum unit as Loss2 d.

As a preferable aspect of the present invention, the step of inputting the minimum unit before the two-dimensional transformation into the joint adaptive transformation unit to obtain the minimum unit after the two-dimensional transformation includes

Combining the number of the two-dimensional joint points and the positions of the corresponding joint points, and performing convolution and linear transformation on the two-dimensional joint points and the positions of the corresponding joint points to obtain the number of control parameters of the joint adaptive transformation unit;

the control parameters of the joint adaptability transformation unit are respectively used as learning parameters of rotation and scale transformation;

and performing position conversion on the learned rotation and scale and the coordinate of the two-dimensional joint of the minimum unit before two-dimensional transformation to obtain the minimum unit after two-dimensional transformation.

As a preferred technical solution of the present invention, the updating the minimum unit after the two-dimensional transformation by using the two-dimensional joint updating module to obtain the updated minimum unit after the two-dimensional transformation, and obtaining the two-dimensional transformation joint point includes:

acquiring a minimum unit after two-dimensional transformation;

according to the image frame number obtained by the image obtaining module, random frame number mixing is carried out on the minimum unit after two-dimensional transformation and the minimum unit before two-dimensional transformation, and the minimum unit after two-dimensional transformation after frame number mixing is obtained;

LOSS supervision is carried out on the minimum unit after two-dimensional transformation after the minimum unit before two-dimensional transformation and the frame number are mixed, and the minimum unit is recorded as LOSS 2; obtaining a minimum unit after two-dimensional transformation after updating;

and disassembling the number of the two-dimensional joint points of the updated two-dimensional transformed minimum unit and the positions of the corresponding joint points to obtain the two-dimensional transformed joint points.

As a preferred technical solution of the present invention, the obtaining of the three-dimensional transformation joint point by performing the posture adjustment on the three-dimensional posture group before transformation and the minimum unit after the three-dimensional posture transformation includes

Inputting the two-dimensional transformation joint points into a three-dimensional joint pre-training module to obtain a minimum unit after three-dimensional posture transformation;

performing LOSS solution on the minimum unit after the three-dimensional posture is transformed and the three-dimensional posture group before the transformation, and recording as Loss 3;

and obtaining the three-dimensional transformation joint points.

As a preferred technical solution of the present invention, the joint adaptability transformation unit includes a convolution prediction layer, a linear alignment layer, and a parameter merging layer;

the convolution prediction layer performs convolution operation on the number of the two-dimensional joint points and the positions of the corresponding joint points to acquire space position correlation information of the two-dimensional joint points; the linear alignment layer carries out linear regular output on the spatial position correlation information of the two-dimensional joint points, obtains the alignment of the number of the two-dimensional joints and the number of the prediction parameters, and obtains the learning parameters of rotation and scale; and the parameter merging layer constructs the rotation angle and the scale characteristic of the learning parameters of the two-dimensional joint point rotation and scale to obtain the joint adaptive transformation unit.

As a preferred technical solution of the present invention, the two-dimensional joint point rotation and scale learning parameter is 3, wherein 2 parameters are used for constructing rotation angle features, 1 parameter is used for constructing scale features, the size of the constructed joint adaptability transformation unit is consistent with the batch input dimension and the frame number dimension of the input image, and the minimum unit after the two-dimensional joint change is obtained by performing position transformation on the coordinates of the two-dimensional joint of the minimum unit before the two-dimensional joint change.

As a preferred technical solution of the present invention, the sequence of the images acquired by the image acquisition module is a video sequence continuous frame.

The invention also provides a system comprising

The image acquisition module is used for acquiring continuous frame images of the video sequence;

the two-dimensional joint extraction module is used for extracting two-dimensional joint points in the continuous frame images;

the joint point transformation module comprises a joint adaptive transformation unit and a two-dimensional joint point updating unit;

synthesizing the number of two-dimensional joint points and the positions of corresponding joint points in the continuous frame images into a minimum unit before two-dimensional transformation by the two-dimensional joint extraction module;

obtaining the joint adaptive transformation unit through parameter learning of the minimum unit before two-dimensional transformation, transforming the minimum unit before two-dimensional transformation into a minimum unit after two-dimensional joint transformation by using the joint adaptive transformation unit, wherein Loss supervision is executed by the minimum unit before two-dimensional transformation and is marked as Loss2 d;

according to the number of image frames input in each training, random frame number mixing is carried out on the minimum unit after two-dimensional transformation and the minimum unit before two-dimensional transformation, a two-dimensional joint point updating unit is constructed, and the minimum unit after two-dimensional transformation after frame number mixing is obtained; performing LOSS supervision on the minimum unit after two-dimensional transformation after the minimum unit before two-dimensional transformation and the frame number are mixed, marking as LOSS2, and updating the minimum unit after two-dimensional joint transformation by using a two-dimensional joint point updating unit to obtain the minimum unit after two-dimensional transformation;

disassembling the number of the two-dimensional joint points of the updated two-dimensional transformed minimum unit and the positions of the corresponding joint points to obtain two-dimensional transformed joint points;

the three-dimensional joint pre-training module extracts a neural network model for a pre-trained three-dimensional joint, predicts two-dimensional transformation joint points by using the three-dimensional joint pre-training module to obtain a minimum unit after three-dimensional posture transformation, supervises and learns the minimum unit after three-dimensional posture transformation and a group of three-dimensional posture before transformation, and records as Loss3 to obtain three-dimensional transformation joint points;

the three-dimensional joint extraction module inputs the two-dimensional transformation joint points to the three-dimensional joint extraction module for training, and the three-dimensional transformation joint points are used as supervision signals and are marked as Loss 1;

and performing joint supervision training through Loss1, Loss2, Loss3 and Loss2d to obtain the three-dimensional human body posture.

Compared with the prior art, the invention has the beneficial effects that:

1. joint point transformation is executed on the two-dimensional joint points by combining the data transformation thought of the image, a parameter learning process can be constructed, and adaptive transformation can be realized on the two-dimensional joint points extracted from the image through a joint adaptive transformation unit;

2. the two-dimensional joint point updating unit is utilized to mix the random frame number of the minimum unit after the two-dimensional transformation and the minimum unit before the two-dimensional transformation; the minimum unit after two-dimensional transformation after frame number mixing is more in line with the learning of the network;

3. the method comprises the steps of constructing a joint adaptive transformation unit according to a two-dimensional joint of an initial image, limiting a transformation process, and enabling the joint adaptive transformation unit to be more in line with the transformation of the two-dimensional joint of the initial image, so that the difference between the obtained two-dimensional transformation joint and the two-dimensional joint of the initial image is not too large, the difference between the minimum unit after the three-dimensional posture transformation predicted by a three-dimensional joint pre-training module and the group of the three-dimensional posture before the transformation is also not too large, and the condition that the whole data is enhanced on the initial image and the learning cannot be caused by the too large transformation is avoided. Compared with the prior art, the method has the advantages that the two-dimensional joint point is firstly applied to the joint adaptive transformation unit to learn the new two-dimensional joint through the neural network, the time is saved and the accuracy is guaranteed under the condition of small data amount, and meanwhile, the method enables the transformed data to conform to the data type from the two-dimensional human posture to the three-dimensional human posture, so that the data constructed by parameter learning is more real.

4. The method has the advantages that each bitchsize and the image frame number are unchanged in the model learning process according to the constructed joint adaptive transformation unit, the number and the positions of joint points only need to be learned, the obtained constructed joint adaptive transformation unit can directly adaptively transform the coordinate points of the two-dimensional joints of the initial image, and the problem of overlarge error in the deep learning process is solved; and according to the frame number, the two-dimensional joint point updating unit is used for mixing random frame numbers of the minimum unit after two-dimensional transformation and the minimum unit before two-dimensional transformation to obtain the minimum unit after two-dimensional transformation mixed with the frame number, so that the transformed two-dimensional joint is mixed with multiple transformed absent postures, and simultaneously contains the two-dimensional joint of the initial image, so that the two-dimensional transformation joint point is easier to learn when constructed, the loss calculation with the three-dimensional transformation joint point can be kept to be reduced rapidly, and the network learning speed is accelerated.

Drawings

FIG. 1 is a first schematic flow chart of the method of the present invention;

FIG. 2 is a second schematic flow chart of the method of the present invention;

FIG. 3 is a schematic diagram of minimum unit acquisition after two-dimensional transformation according to the present invention;

FIG. 4 is a schematic view of an adaptive joint change unit according to the present invention;

FIG. 5 is a diagram illustrating a two-dimensional joint update unit according to the present invention;

FIG. 6 is a schematic diagram of three-dimensional transformed joint acquisition according to the present invention;

FIG. 7 is a schematic diagram of the system of the present invention;

FIG. 8 is a schematic diagram of prior art two-dimensional data enhancement;

FIG. 9 is a three-dimensional human posture estimation effect diagram of a human3.6M data set image by the method of the invention.

FIG. 10 is a diagram illustrating the effect of the two-dimensional joint acquisition method on the COCO data set image acquisition.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, in the process of data enhancement of most images, the images are generally static, that is, the images are turned, scaled and translated to construct an affine matrix, and the original images are transformed. Some dynamic enhanced networks, such as ADA for short and ASDA for short for "adaptive Data evaluation in Human dose evaluation", but it is necessary to introduce a GAN network into an initial image and then introduce new enhanced Data for GAN network training to obtain trained enhanced Data, which is effective for obtaining two-dimensional joints on the image, but for the prediction process of two-dimensional joints to three-dimensional joints, two-dimensional joints are predicted by actual joint positions, if we rely on GAN network in advance, not only the dependence on GAN is increased, but most importantly, the prediction of image can make the image and two-dimensional joints group real Data transform simultaneously, for the task of two-dimensional to three-dimensional joints, it is difficult to transform the joints, so that the application of the existing three-dimensional image transformation method can not make the prediction of joint points break through the effect of the pre-training network, the network effect of the three-dimensional joint extraction can not reach the expected effect.

The method realizes multi-end reverse transmission of each stage by superposing loss calculation through parameter learning and joint training among a plurality of units and modules and realizes multi-end reverse transmission of each stage, so that the synthesized two-dimensional human body posture data can have good adaptability, the data enhancement of the method is utilized to realize original data transformation and transform the two-dimensional human body posture under the condition of not increasing data quantity, a pre-training model is introduced only through the change of the two-dimensional human body posture data, the result of three-dimensional human body posture group treth is corrected to adapt to the two-dimensional human body posture transformation, the whole model is more robust, the network prediction performance can be improved under the condition of not increasing data quantity in prediction, the three-dimensional joint extraction module not only considers the training data result, but also can carry out adaptive transformation on test data, so as to accurately extract the position of the two-dimensional human posture joint.

Example (b):

referring to fig. 1 to 10, the present invention provides a technical solution:

a three-dimensional human body posture estimation method based on deep learning comprises an image acquisition module, a two-dimensional joint extraction module and a three-dimensional body posture estimation module, wherein the image acquisition module acquires an image and performs two-dimensional joint extraction on the acquired image to obtain a two-dimensional joint; the method for acquiring the two-dimensional joint is the most popular CPN in the prior art, appears in the text of Cascade farm Network for Multi-Person position Estimation, well identifies two-dimensional key points of multiple persons, and obtains champions on a COCO data set in 2018, and a related acquisition effect is shown in figure 10 of the application.

The application needs to particularly point out that the Data enhancement scheme of the application is used in the process of retraining any model, like a TCN network in 3D Human joint Estimation in video with temporal connections and semi-collaborative Training and a TAG network in case of Deep mammalian 3D Human joint Estimation with evolution Training Data, so the three-dimensional joint pre-Training module of the application refers to the pre-Training module which is trained by the existing network, the whole process of the application is given by taking the TCN network as an example, because the experiment given by the application is performed on S1 Data on Human3.6M Data, the three-dimensional joint pre-Training module is obtained after the TCN network is trained on S1 Data on Human3.6M Data, and the internal parameters of the three-dimensional joint pre-Training module are not updated in the experimental process of the application, so as to extract the joint parameters of the Training module, the three-dimensional joint extraction module can be further trained under the condition that the three-dimensional joint pre-training module fixes parameters, so that the performance of the three-dimensional joint extraction module is improved, and the three-dimensional joint pre-training module refers to a network which realizes convergence on data. Meanwhile, verification is carried out, and the method has better performance under the condition of data missing. The main idea of TCN network design is given here, and a specific scheme is given in "3D human position estimation in video with temporal relationships and semi-temporal routing", in which 17 joint points and coordinate points of data [1024,27,17,2] are extended by linear network, 27 frames of continuous images are convolved through CNN network to synthesize 1 frame of 3D data, and then 1024 dimensions are reduced to 51 dimensions to form [1024,51,1] data, and three-dimensional posture data [1024,1,17,3] is obtained through dimension conversion, that is, 27 frames of continuous frames of 2D data are synthesized into 1 frame of 3D data through TCN network.

It should be noted that the improvement of the present application is not directed to the TCN network itself, and the direct conversion of the TCN network to the 2D key point is adapted through the transformation parameter learning, so that the data learned by the whole network is the enhanced data while the TCN performance is not reduced, the two-dimensional joint parameter learning process is further made more robust, when the original network structure is retained, the learning with large parameters enhances the network performance, and also enhances the learning of the 2D joint, and the adaptability is strong.

Preferably, the joint point transformation of the two-dimensional joint acquired by the two-dimensional joint extraction module by using the joint point transformation module comprises

Preferably, the joint adaptive transformation unit for executing the two-dimensional joint point transforms the two-dimensional joint acquired by the two-dimensional joint extraction module, and the minimum unit after the two-dimensional transformation after the joint point transformation is obtained includes

synthesizing the number of two-dimensional joint points and the positions of corresponding joint points into a minimum unit before two-dimensional transformation, synthesizing [1024,27,17,2] data into [1024,27,34], wherein 1024 refers to bitchsize, 27 refers to the number of continuous frames, 17 refers to the number of two-dimensional joint points, and 2 refers to two-dimensional coordinate points of corresponding joints, and because 34 dimensions directly contain coordinate point positions, the minimum unit before two-dimensional transformation can be rotated and subjected to scale transformation learning to obtain a joint adaptive transformation unit; the learning of the joint adaptive transformation unit is mainly based on the coordinate points and the joints themselves, and therefore the learned joint adaptive transformation unit can better recognize the relationship between the minimum unit joints before the two-dimensional transformation, and this point will be described again later with respect to the structure of the joint adaptive transformation unit.

and performing Loss supervision on the minimum unit after the two-dimensional transformation as the minimum unit before the two-dimensional transformation, and recording as Loss2d, wherein the dimensions of the minimum unit after the two-dimensional transformation are consistent with each other, so that supervised learning of the L2 norm is performed on the minimum unit.

The three-dimensional joint extraction module is a network model which needs to be trained, the network is reversely transmitted under the joint calculation of a plurality of loss functions, the three-dimensional human body posture is obtained, two-dimensional and three-dimensional corresponding data after data enhancement is realized according to two-dimensional transformation joint points and three-dimensional transformation joint points after transformation, the training parameters of the three-dimensional joint extraction module are continuously updated, and meanwhile, the parameter learning of the joint adaptive transformation unit is most important, which is similar to static data transformation in images b. c ] to control rotation and scale, the formula is as follows:

rotating:

angles＝arctan(a/b)

rot_mat＝cos(angles)，-sin(angles),

sin(angles),cos(angles)

and simultaneously, the scale is as follows:

scaling＝exp(c)

the method constructs a parameter matrix with the tail dimension of 2 x 2 [ the bitchsize and the frame number are kept unchanged ] through the learning of only three parameters, and the bitchsize and the frame number can be found in an input format in a 3D human position estimation in video with temporal relationships and semi-collaborative training paper. Thereby obtaining the joint adaptive transformation unit.

The above operation enables the joint adaptability transformation unit to directly calculate the minimum unit before the two-dimensional transformation, because the two-dimensional joint is data taking a coordinate point as the end, each bitchsize and the image frame number are not changed in the model learning process according to the constructed joint adaptive transformation unit, the number and the position of the joint points are only required to be learned, the joint point relation of the minimum unit before the two-dimensional transformation is learned, and the resulting constructed joint adaptive transformation element is a matrix of [1024,27,2,2], the method is completely matched with the coordinate point of the minimum unit before two-dimensional transformation, so that the problem of precision loss caused by unavoidable dimension conversion during data enhancement in the image is well solved, the direct adaptive transformation of the coordinate point of the two-dimensional joint of the initial image can be realized, and the problem of overlarge error in the deep learning process is prevented.

Preferably, the step of inputting the minimum unit before the two-dimensional transformation into the joint adaptive transformation unit to obtain the minimum unit after the two-dimensional transformation includes

According to the method, a joint adaptive transformation unit is constructed according to the two-dimensional joint of the initial image, compared with the transformation of an STN network, only 3 parameters are learned for rotation and scale learning, the transformation process is limited, a transformation matrix suitable for two-dimensional transformation is constructed, the joint adaptive transformation unit is more consistent with the transformation of the two-dimensional joint of the initial image, the difference between the obtained two-dimensional transformation joint and the two-dimensional joint of the initial image is not too large, the difference between the minimum unit after the three-dimensional posture transformation predicted by a three-dimensional joint pre-training module and the group pitch of the three-dimensional posture before transformation is also not too large, and the condition that the whole data is enhanced on the initial image and the learning cannot be caused by too large transformation is avoided. The data types from the two-dimensional human body posture to the three-dimensional human body posture are also met among the transformed data, so that the data constructed by parameter learning is more real. This is not of any concern in prior art images, and is less likely to be easily envisioned in data enhancement of two-dimensional joints.

It should be noted that fig. 8 illustrates the two-dimensional data enhancement in the prior art, and also illustrates in the background art that although it has a good enhancement effect, it takes too much time, and the enhanced data is implemented by a fixed formula, and it is not possible to automatically synthesize enhanced data conforming to the network according to the network training process, so we propose the present application, which is implemented mainly by taking the idea of network training learning parameters in combination with the characteristics of two-dimensional joint data, and fig. 10 illustrates the label before the 2-dimensional data transformation in the present application, which also illustrates that the two-dimensional data acquisition in the present application is problem-free, and further verifies that the implementation on the three-dimensional posture is effective.

Preferably, the updating the two-dimensional transformed minimum unit by using the two-dimensional joint updating module to obtain an updated two-dimensional transformed minimum unit, and obtaining the two-dimensional transformed joint point includes:

acquiring a minimum unit after two-dimensional transformation;

LOSS supervision is carried out on the minimum unit after two-dimensional transformation after the minimum unit before two-dimensional transformation and the frame number are mixed, and the minimum unit is recorded as LOSS 2; according to the method, the format of the minimum unit after two-dimensional transformation is consistent after the minimum unit before two-dimensional transformation is mixed with the frame number, and Loss2 calculation is also carried out by adopting an L2 norm to obtain the updated minimum unit after two-dimensional transformation;

The two-dimensional joint point updating unit is utilized to mix the random frame number of the minimum unit after the two-dimensional transformation and the minimum unit before the two-dimensional transformation; the minimum unit after two-dimensional transformation after frame number mixing is more in line with the learning of the network; and according to the frame number, a two-dimensional joint point updating unit is used for mixing random frame numbers of a minimum unit after two-dimensional transformation and a minimum unit before two-dimensional transformation, an instruction random in the pytorch can randomly generate mixed data, 27 frames are randomly mixed, generally, the middle 9 frames are reserved, the first 9 frames and the second 9 frames of data input each time are randomly replaced, or the data format of [1024,27,17,2] is kept, and the two-dimensional transformed minimum unit after frame number mixing is obtained, so that the transformed two-dimensional joint is mixed with various non-existing postures, and simultaneously contains the two-dimensional joint of an initial image, the two-dimensional transformed joint point is easier to learn to construct, loss calculation with the three-dimensional transformed joint point can be kept to be rapidly reduced, and the network learning speed is accelerated.

In the application, although LOSS2 that the LOSS supervision is performed on the minimum unit after two-dimensional transformation after the minimum unit and the frame number are mixed before two-dimensional transformation does not restrict the process that the minimum unit after two-dimensional transformation and the minimum unit before two-dimensional transformation perform random frame number mixing, on the contrary, the random frame number mixing process can generate a two-dimensional posture which never appears in the network before, so that the trained three-dimensional joint extraction module is more robust.

However, since the minimum unit after the two-dimensional transformation is obtained through the joint adaptive transformation unit, the Loss2 further feeds back the transformation parameters of the joint adaptive transformation unit, and the minimum unit after the two-dimensional transformation after the updating is considered in the self-updating process of the joint adaptive transformation unit, so that the minimum unit after the two-dimensional transformation after the updating promotes the learning of the network, thereby promoting the parameters of the three-dimensional joint extraction module to be better trained and obtaining the three-dimensional human body posture meeting the reduction of the Loss function.

Preferably, the obtaining of the three-dimensional transformation joint point by performing the posture adjustment on the three-dimensional posture group before transformation and the minimum unit after the three-dimensional posture transformation includes

performing LOSS solution on the minimum unit after the three-dimensional posture is transformed and the three-dimensional posture group before the transformation, and recording as Loss 3; in the application, the format of the minimum unit after three-dimensional posture transformation is consistent with that of the three-dimensional posture group before transformation, and the minimum unit is a data format [1024,1,17,3] and represents one frame of 3D data of each bitchsize, so that an L2 norm is also adopted to calculate Loss 3.

And obtaining the three-dimensional transformation joint points.

In the present application, it has been described above that the three-dimensional joint pre-training module employs a TCN pre-training module, i.e. the trained model itself, through the module, the frame number of the application is 27, the bitchsize is 1024, the input two-dimensional joint point is 17, the size of the obtained two-dimensional transformation joint point is [1024,27,17,2] and is input into the TCN network which is pre-trained by the previous network, a frame of three-dimensional posture transformed minimum unit with the bitchsize of 1024 and the format of [1024,1,17,3] is obtained, the three-dimensional posture transformed minimum unit is supervised with the three-dimensional posture groudtuth in the network, the condition that the obtained three-dimensional posture transformed minimum unit is transformed too large with the groudtuth needs to be pointed out, the parameters of the pre-trained TCN network adopted by the application are not changed, therefore, the calculation of Loss3 is actually fed back to the parameter update of the three-dimensional joint extraction module and the joint adaptive conversion unit of the present application.

Preferably, the joint adaptability transformation unit comprises a convolution prediction layer, a linear alignment layer and a parameter merging layer;

In the application, 3 parameters for learning rotation and scale are actually obtained by learning two-dimensional joints of [1024,27,17,2] in an input image sequence, 17 joint points and 2-dimensional (x, y) coordinate points are combined into 34-dimensional space and input into a convolution prediction layer to obtain the spatial position relationship among 34-dimensional data, the application uses Conv2d toolkit in pytorch to realize the acquisition of the associated information of the 34-dimensional space, the step length is 1, the Conv2d convolution kernel is 3, the acquisition of the associated information of the joints is sequentially realized by two same convolutions to obtain the dimension of [1024,27,30], and through the operation, the spatial position associated information of the two-dimensional joint points is linearly normalized and output by a linear alignment layer And in the 512- >3 output process, the learning parameters of two-dimensional joint point rotation and scale are 3, wherein 2 parameters are used for constructing rotation angle features, 1 parameter is used for constructing scale features, and the constructed joint adaptive transformation unit is consistent with the dimensions of batch input dimensions and frame number dimensions of input images. Compared with the method for directly learning 6 parameters in the STN, the method for transforming the two-dimensional joint in the data transformation process has the advantages that the information fusion of the joint is realized, the STN cannot directly transform the two-dimensional joint in the data transformation process in a learning matrix (similar to the joint adaptive transformation unit in the application), the precision loss is caused in the data transformation process, the training of the two-dimensional joint is very unfavorable, the joint adaptive transformation unit obtained in the application is the transformation matrix of [1027,27,2 and 2], and the position of the coordinate of the two-dimensional joint of the minimum unit before the transformation of the two-dimensional joint is directly transformed to obtain the minimum unit after the two-dimensional joint is transformed.

In the application, Loss supervision is executed on the minimum unit after two-dimensional transformation by the minimum unit before two-dimensional transformation, which is recorded as Loss2d, and the calculation of the Loss2d avoids the Loss of spatial position in the transformation process and improves the transformation precision because the transformation matrix of the joint adaptive transformation unit can be directly operated with the original minimum unit before two-dimensional transformation, and in addition, the dimension of the minimum unit after two-dimensional transformation is [1024,27,17,2], which is completely consistent with the dimension of the minimum unit before two-dimensional transformation, so that the calculation of a Loss function is more convenient, and the operation does not bring extra spatial precision Loss, so that the calculation of the Loss2d is more consistent with the gradient reduction of a neural network, namely the dimension consistency can directly adopt L2 Loss (mean square error).

Preferably, the sequence of images acquired by the image acquisition module is a video sequence of consecutive frames. According to the idea of the TCN network in the above paper, the use of the image between the consecutive frames can enhance the acquisition of the adjacent information of the image for the consecutive frames, and in the present application, although the data of 27 frames is used, the present application does not destroy any data between the adjacent frames, but only acquires the information between the joint points, so that the present application can easily adapt to the training of the network no matter the consecutive frames are 27 frames or 243 frames.

In the application, when the final three-dimensional human body posture is carried out, Loss calculation is carried out in a mode of adding the Loss1, the Loss2, the Loss3 and the Loss2d, Loss return of each link is realized, data are transformed to be close to original data, recognition capabilities of different postures are learned, a real value of 3-dimensional data which is in accordance with 2-dimensional data enhancement is further constructed for supervision, and when the data are rare, the recognition capability of a network can be enhanced to a great extent.

The loss function design in the application is completely adapted to the data format of the two-dimensional joint, which is completely incomparable in the existing data enhancement, particularly loss2d and loss2 in the application take the minimum unit before two-dimensional transformation as a supervision reference object, and take the minimum unit before the two-dimensional transformation as supervision in the calculation process of the joint adaptive transformation unit and the two-dimensional joint point updating unit, so that the transformed and updated two-dimensional joints approach to the minimum unit before the two-dimensional transformation in the training process, thus the training data can not be arbitrarily expanded without margin, in the neural network learning process, the application obtains transformation parameters through the minimum unit before the two-dimensional transformation and takes the minimum unit before the two-dimensional transformation as supervision data, thereby the application integrally learns the association between the two-dimensional joints of the original image, and can ensure that the enhanced data can inherit the inheritance relationship of the two-dimensional joints of the original image, obtaining two-dimensional transformation joint points which meet the training requirements; in addition, under the supervision training of loss3 and loss1, the three-dimensional human posture learning can obtain the supervision data similar to the original three-dimensional posture group before transformation by taking the three-dimensional posture group before transformation as a reference, so that the three-dimensional transformation joint points can not be arbitrarily expanded without margin, and no neural network can directly simulate the excellent three-dimensional posture at present. The loss3 of the application supervises that the two-dimensional transformation joint points obtained at loss2d and loss2 are in a posture similar to the original minimum unit before two-dimensional transformation, the posture is not changed too much, three-dimensional transformation joint points meeting the conditions can be simulated through a three-dimensional pre-training module and then supervised by the loss1, so that the three-dimensional joint extraction module of the application is trained, the descending speed of loss2d, loss2 and loss3 can be further limited, integral gradient feedback is realized, loss of each link can be learned by a network after data enhancement, and continuous descending and rapid convergence can be guaranteed.

The invention also provides a three-dimensional human body posture estimation system based on deep learning, which comprises

in the method, joint supervision training is carried out through Loss1, Loss2, Loss3 and Loss2d, the method prevents the two-dimensional joint updating and the original data transformation from being overlarge due to the fact that the transformation range of the joint adaptive transformation unit on the original joint is overlarge to a great extent, the situation that the training process and the final test process data are not under the same standard is prevented, the reliability and robustness of model training can be well improved, and the correct three-dimensional human body posture is obtained. Adding the obtained Loss1, Loss2, Loss3 and Loss2d to obtain final Loss calculation, using the final Loss calculation as a supervision signal, and realizing gradient return for each link:

Loss＝Loss1+Loss2+Loss3+Loss2d

in the experimental process, the application realizes that the real loss of the final three-dimensional human posture is reduced by 70-68-66 under S1 data on single human3.6M data in sequence [ original TCN network training is 71-69-68], and under the condition that the scheme of the application is not adopted, only S1 data of human3.6M is provided, it needs to be noted that the introduction of human3.6M is easy to inquire in the prior art, and the data S1 refers to one unit of human3.6M related to three-dimensional calibration, and generally adopts a plurality of unit training, but in order to save time and verify the effectiveness of the application, S1 is adopted as a training set, while a test set can be carried out by adopting other unit data, and the application keeps consistent with the arrangement of the original TCN network. Even if the original network with 80 epochs obtains MPJPE (mm) [ MPJPE (mean Per Joint Point error) ] in a semi-supervised training mode, the literal meaning of the original network with 80 epochs can also show that the average Euclidean distance between a predicted key point and a group route ] is 64.7, and after data enhancement is adopted in the method, the MPJPE (mm) obtained through the original network with 80 epochs is 63.1, so that the improvement of the network predictive performance is realized. The method does not need to reform all data in advance, but only adds the joint adaptive transformation unit to learn the transformation parameters, realizes the supervision of the whole training process, saves a large amount of time, by combining the supervision of the minimum unit after two-dimensional transformation, the supervision of the minimum unit after two-dimensional transformation after updating, the supervision of the minimum unit after three-dimensional posture transformation and the supervision of three-dimensional human body posture prediction, the parameters of the joint adaptability transformation unit are learned to the minimum unit after the two-dimensional transformation which is most consistent with the model, the method comprises the following steps that a network learns to a proper minimum unit after three-dimensional posture transformation, and further, in the supervision process of the method, a two-dimensional transformation joint point which accords with the learning of a three-dimensional joint extraction module and a three-dimensional transformation joint point which serves as a network supervision signal are constructed, and the performance of the three-dimensional joint extraction module exceeds that of an original network which is not subjected to data enhancement through the enhanced two-dimensional transformation joint point and the three-dimensional transformation joint point which serves as the network supervision signal. The usefulness of the present application is further illustrated.

The following provides further improvements of the three-dimensional joint extraction module, and the details of constructing the three-dimensional joint extraction module of the present application are further provided on the basis of taking a TCN network as a framework, as shown in fig. 10, it should be noted that the present application utilizes a two-dimensional joint extraction module (prior art) to extract two-dimensional joints, and the subsequent training of the present application is directed to two-dimensional joint implementation (innovative point of the present application), the acquisition of the input image [ figure 9 left ] and its two-dimensional pose by the present invention is shown in figure 9, then the acquisition of the three-dimensional human body posture (in figure 9) and the comparison with the real value (right in figure 9) are realized, although the two methods have slight difference, the effect is improved to a certain extent compared with the existing TCN network due to the experiment only carried out under the data of S1, and the effectiveness of the application is further verified from the perspective of visualization.

When the parameters of the joint adaptive transformation unit and the three-dimensional joint extraction module are trained, the parameters of the three-dimensional joint pre-training model are always unchanged, namely, a pre-trained TCN network (through adopting S1 data on human3.6M data for supervision and training in advance), the three-dimensional joint extraction module adopts an untrained TCN network, and the three-dimensional joint extraction module is trained.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A three-dimensional human body posture estimation method based on deep learning is characterized in that: the two-dimensional joint extraction method comprises the steps that an image acquisition module acquires an image and a two-dimensional joint extraction module performs two-dimensional joint extraction on the acquired image to obtain a two-dimensional joint;

and performing combined deep learning training on the two-dimensional joint after the joint point transformation is performed by the joint point transformation module by using the three-dimensional joint extraction module and the three-dimensional joint pre-training module, and estimating the three-dimensional human body posture.

2. The three-dimensional human body posture estimation method based on deep learning of claim 1, wherein: the joint point transformation of the two-dimensional joint acquired by the two-dimensional joint extraction module by using the joint point transformation module comprises

3. The three-dimensional human body posture estimation method based on deep learning of claim 2, characterized in that: the joint adaptive transformation unit for executing the two-dimensional joint point transforms the two-dimensional joint acquired by the two-dimensional joint extraction module, and the minimum unit after the two-dimensional transformation after the joint point transformation is acquired comprises

4. The method for estimating the three-dimensional human body posture based on the deep learning of claim 3, wherein: inputting the minimum unit before the two-dimensional transformation into the joint adaptive transformation unit to obtain the minimum unit after the two-dimensional transformation comprises

5. The method for estimating the three-dimensional human body posture based on the deep learning of claim 4, wherein the method comprises the following steps: the updating the minimum unit after the two-dimensional transformation by using the two-dimensional joint updating module to obtain the minimum unit after the two-dimensional transformation after the updating, and obtaining the two-dimensional transformation joint point comprises the following steps:

acquiring a minimum unit after two-dimensional transformation;

6. The method for estimating the three-dimensional human body posture based on the deep learning of claim 5, wherein: the three-dimensional transformation joint point obtained by carrying out posture adjustment on the three-dimensional posture group before transformation and the minimum unit after three-dimensional posture transformation comprises

and obtaining the three-dimensional transformation joint points.

7. The method for estimating the three-dimensional human body posture based on the deep learning of claim 4, wherein the method comprises the following steps: the joint adaptability transformation unit comprises a convolution prediction layer, a linear alignment layer and a parameter merging layer;

the convolution prediction layer performs convolution operation on the number of the two-dimensional joint points and the positions of the corresponding joint points to acquire space position correlation information of the two-dimensional joint points; the linear alignment layer carries out linear regular output on the spatial position correlation information of the two-dimensional joint points, obtains the alignment of the number of the two-dimensional joints and the number of the prediction parameters, and obtains the learning parameters of rotation and scale; the parameter merging layer constructs rotation angles and scale characteristics of the learning parameters of the two-dimensional joint point rotation and scale to obtain a joint adaptive transformation unit;

the loss function design is adaptive to the data format of the two-dimensional joint, loss2d and loss2 take the minimum unit before two-dimensional transformation as a supervision reference object, and both the minimum unit before the two-dimensional transformation and the two-dimensional joint point updating unit are used as supervision in the calculation process of the joint adaptability transformation unit and the two-dimensional joint point updating unit, so that the transformed and updated two-dimensional joints approach the minimum unit before the two-dimensional transformation in the training process, transformation parameters are obtained through the minimum unit before the two-dimensional transformation, the minimum unit before the two-dimensional transformation is used as supervision data, and the enhanced data can inherit the relationship of the two-dimensional joint of the original image to obtain the two-dimensional transformation joint point which meets the training requirement; under the supervision and training of loss3 and loss1, the learning of the three-dimensional human body posture can obtain the supervision data similar to the original three-dimensional posture group before transformation by taking the three-dimensional posture group before transformation as a reference, the loss3 supervises that the two-dimensional transformation joint points which can be obtained at loss2d and loss2 are in the posture similar to the original minimum unit before two-dimensional transformation, the three-dimensional transformation joint points which meet the conditions can be simulated by a three-dimensional pre-training module and then supervised by loss1, the three-dimensional joint extraction module of the application is trained, the descending speed of loss2d, loss2 and loss3 is ensured, and the integral gradient feedback is realized;

in the process of supervised training, the internal parameters of the three-dimensional joint pre-training module are unchanged.

8. The method for estimating the three-dimensional human body posture based on the deep learning of claim 7, wherein: the two-dimensional joint point rotation and scale learning parameters are 3, wherein 2 parameters are used for constructing rotation angle features, 1 parameter is used for constructing scale features, the sizes of the constructed joint adaptive transformation unit and the batch input dimensions and the frame number dimensions of the input images are consistent, and the position of the two-dimensional joint coordinate of the minimum unit before the two-dimensional joint transformation is directly transformed to obtain the minimum unit after the two-dimensional joint transformation.

9. The deep learning-based three-dimensional human body posture estimation method according to any one of claims 1-8, characterized in that: the image acquisition module acquires a sequence of images as continuous frames of a video sequence.

10. A three-dimensional human body posture estimation system based on deep learning is characterized in that: the device comprises an image acquisition module, a frame acquisition module and a frame acquisition module, wherein the image acquisition module is used for acquiring continuous frames of images of a video sequence;