CN118172412A

CN118172412A - Method and device for carrying out 3D human body posture positioning and restoring by using 2D image

Info

Publication number: CN118172412A
Application number: CN202410596275.XA
Authority: CN
Inventors: 周宇; 王微; 金帅; 刘德生
Original assignee: Zhongke Jingrui Suzhou Technology Co ltd
Current assignee: Zhongke Jingrui Suzhou Technology Co ltd
Priority date: 2024-05-14
Filing date: 2024-05-14
Publication date: 2024-06-11

Abstract

The invention discloses a method and a device for carrying out 3D human body posture positioning and restoring by utilizing a 2D image, which relate to the technical field of 3D posture estimation and comprise S1, a standardized flow process for generating a sample; s2, dimension lifting and projection loss calculation; s3, shielding a processing network; s4, adjusting the part. According to the method and the device for carrying out 3D human body posture positioning and restoring by using the 2D image, the sample is generated by a standardized flow technology, the sample scale is increased so as to cover all possible postures and scenes as much as possible, the generalization capability of the deep learning method in the training process is improved, the missing depth information is obtained by dimension lifting and projection loss calculation, the blocked posture is used as the input of a blocking processing network, a multi-layer fully-connected neural network structure is adopted for predicting the blocking part, the 3D posture which is completely free of blocking is output, the length loss function and the action loss function are utilized for correction, and the stability and the accuracy of a 2D-3D posture restoring algorithm are improved.

Description

Method and device for carrying out 3D human body posture positioning and restoring by using 2D image

Technical Field

The invention relates to the technical field of 3D gesture estimation, in particular to a method and a device for carrying out 3D human gesture positioning and restoring by using a 2D image.

Background

3D pose estimation in short, we refer to a technique called 3D HPE, whose main goal is to try to predict the position of each important joint of the human body in a three-dimensional environment. This technique has extremely wide application, for example in human interaction with machines, or analysis of human movements, even in the field of rehabilitation and the like. In addition, it may provide information about bone structure for other computer vision tasks.

For the representation method of human body, there are mainly two kinds: one is to demonstrate the human body pose through a skeleton, which is made up of a series of key points and lines connecting them; the other is to display the posture and the body shape by using a grid model of the human body in a parameterized mode.

However, estimating the pose of three dimensions from two-dimensional images is a problem with uncertainty factors. That is, there may be a plurality of different three-dimensional poses whose two-dimensional projections are identical. In addition, the practical application of the technology is challenged by the existence of monocular image methods, such as self-occlusion, object occlusion, difficulty in obtaining depth information, and the like.

At present, although we can identify the key joint positions of the human body, the technology is quite mature, and how to restore the three-dimensional positions of the joints by means of limited photos under the condition of multiple people and shielding exists, so that the gesture and the action intention are deduced, which is a problem still needing to be studied deeply.

The prior art also has many challenges and problems in restoring 2D poses to 3D:

1. Absence of depth information: the lack of depth information is indeed a significant feature of 2D images, and to solve this problem, additional sensors, such as a depth camera or a laser scanner, are usually required, but these devices tend to be costly and inconvenient to use, and furthermore, even with depth information, it is not easy to directly extract the exact pose from the depth information, since the depth structure of the human body itself is complex;

2. diversity and complexity of gestures: the posture of the human body varies very much and is very complex. This is mainly due to the fact that the body structure and the muscular system of the person are very complex, and at the same time, environmental factors can influence the posture of the person;

3. limitations of the dataset: the existing data set is small in scale and lacks of diversity, and has great influence on generalization capability and universality of the model;

4. Stability and accuracy of the algorithm: the accuracy and stability of the existing 2D-to-3D gesture reduction algorithm still need to be improved, and particularly when complex gestures and occlusion situations are processed, the performance of the algorithm is often affected.

Therefore, the invention provides a method and a device for carrying out 3D human body posture positioning reduction by utilizing a 2D image.

Disclosure of Invention

The invention aims to provide a method and a device for carrying out 3D human body posture positioning and restoring by utilizing a 2D image so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

In a first aspect, a method for performing 3D human body pose localization and restoration using a 2D image includes:

S1, generating a standardized flow process of a sample: training a real sample, adopting a standardized flow generation model, training independent generation models for legs, a trunk and arms on the left side and the right side, and minimizing the negative log likelihood of the generated sample and the real sample;

S2, dimension lifting and projection loss calculation: lifting the 2D coordinates to 3D through a projection conversion matrix, controlling the 3D object to rotate, and re-projecting the 3D object to 2D through the conversion matrix;

S3, shielding processing network: the 3D gesture with the occlusion is obtained from the 2D through dimension lifting, the gesture with the occlusion is used as the input of an occlusion processing network, the predicted 3D gesture without the occlusion is output, and the prediction part adopts a multi-layer fully-connected neural network structure for prediction;

s4, an adjusting part: the method comprises the steps of introducing a length loss function for generating limbs, correcting the limb length predicted by a network, avoiding the deviation of the generated limb length from a normal value, defining an action loss function for two actions with fixed time intervals during training, and correcting the generated actions to enable the change of the action to be closer to a real value.

Further, in the step S1, the normalized flow is a technique for constructing a more complex generation model in GAN, which includes a generator and a arbiter, and the generator and the arbiter are all typically composed of a deep neural network, by a series of reversible and easy-to-calculate transforms, which transform an input random noise vector into data samples with a desired distribution, and which transform is typically parameterized and can be learned by an optimization algorithm;

The main task of the generator is to generate data from random noise, and the generated data should be as close to real data as possible, and the use flow of the generator is as follows: receiving a random noise vector as input and then generating new data samples through a series of transformations;

The task of the arbiter is to distinguish whether the input data comes from a real data set or is generated by a generator, and the usage flow of the arbiter is as follows: a data sample is received as input and a probability value is then output through a series of transformations indicating the likelihood that the sample is real data.

Further, in the step S1, the formula of the normalized stream is as follows: Wherein/> Is the generated coordinates,/>Is normalized flow with parameters,/>Is the estimated true position,/>Is a constant adjustment coefficient,/>Representing gaussian noise, the standard deviation and variance are 0 and 1.

Further, in the step S1, the formula of likelihood estimation is as follows: Wherein/> Is the likelihood,/>Is the generated coordinates,/>Is a real (GT) coordinate,/>Is the number of samples,/>The probability density function is estimated by a standardized equation, is a distribution to be trained, is required to be trained by using samples, corresponds to one position, generally 5 positions, corresponds to the trunk, the hands and the feet respectively, and θ is a representation parameter without practical meaning.

Further, in step S2, the same 3D motion or pose generates several 2D projections, and a correct 3D reconstruction is re-projected after rotation, so that different 2D images should be generated by the same motion, and a loss function is defined to obtain a minimum value, and the 2D loss function is as follows: Wherein/> Is the original 2D coordinates,/>Is the calculated rotated 3D coordinate, P is the 3D to 2D conversion matrix, R is the rotated azimuth angle matrix, and the azimuth angle range is [ -, pi ],/>The representation is an inverse matrix.

Further, in the step S2, for the 3D coordinates, a loss function is also calculated to keep the conversion consistent, and the loss function of the 3D is as follows: Wherein/> Is the reconstructed 3D coordinates,/>Is the rotated coordinate, P is the 3D to 2D conversion matrix, R is the rotated azimuth angle matrix, and the azimuth angle range is [ -, II ],/>The representation is an inverse matrix.

Further, in the step S3, the network structure adopts a 4-5 layer fully connected network, the activation function adopts relu, and the loss function is defined as follows: wherein the coordinate representation of the subscript m is the predicted value of the occlusion network and the subscript o is GT, including the real sample and the generated sample.

Further, in the step S4, the length loss function is defined as follows: where b represents the length of a limb, including the spine, arm, leg, etc., K is the total number of sample limbs,/> Is a model predictive value,/>Is the value of GT;

The action loss function is defined as follows:

where a, b represent 2 actions of adjacent time intervals,/> Representing coordinates predicted from the model from the actual GT sample,/>Representing the coordinates predicted from the generated samples.

In a second aspect, an apparatus for performing 3D human body posture positioning restoration using a 2D image includes: the system comprises a memory, a processor and computer program instructions stored on the memory and executable on the processor, wherein the processor executes the computer program instructions to implement the method for performing 3D human body posture positioning and restoring by using the 2D image.

In a third aspect, a computer readable storage medium has stored therein computer executable instructions for implementing a method for 3D human body pose location restoration using 2D images as described above when executed by a processor.

The invention provides a method and a device for carrying out 3D human body posture positioning and restoring by utilizing a 2D image, which have the following beneficial effects:

According to the invention, a sample is generated by a standardized flow technology, the sample scale is increased, so that all possible gestures and scenes can be covered as much as possible, the generalization capability of the deep learning method in the training process is improved, missing depth information is obtained by dimension lifting and projection loss calculation, the blocked gesture is used as the input of a blocking processing network, a multi-layer fully-connected neural network structure is adopted to predict a blocked part, a 3D gesture which is completely free of blocking is output, a length loss function and an action loss function are utilized to correct, the generated limb length is prevented from deviating from a normal value, the action change is enabled to be closer to a true value, and the stability and accuracy of a 2D-3D gesture reduction algorithm are improved.

Drawings

Fig. 1 is a standardized flow operation logic diagram of a method for performing 3D human body posture positioning and restoring by using a 2D image.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

As shown in fig. 1, a method for performing 3D human body posture positioning and restoring by using a 2D image includes:

In step S1, the normalized flow is a technique for constructing a more complex generative model in GAN, which is typically learned by an optimization algorithm by converting the input random noise vector into data samples with the desired distribution through a series of reversible and easy-to-calculate transformations, which are typically parameterized, the GAN including generators and discriminators, which are typically all composed of deep neural networks;

the task of the arbiter is to distinguish whether the input data comes from the real data set or is generated by the generator, the usage flow of the arbiter is as follows: receiving a data sample as input and then outputting a probability value through a series of transformations, representing the likelihood that the sample is real data;

in step S1, the formula of the normalized stream is as follows: Wherein/> Is the coordinates that are generated and are used to generate the coordinate,Is normalized flow with parameters,/>Is the estimated true position,/>Is a constant adjustment coefficient,/>Representing gaussian noise, the standard deviation and variance are 0 and 1;

In step S1, the formula of likelihood estimation is as follows: Wherein/> Is the likelihood,/>Is the generated coordinates,/>Is the real (GT) coordinate, N is the number of samples,/>The probability density function is estimated by a standardized equation, is a distribution to be trained, is required to be trained by using samples, corresponds to one position, generally 5 positions, corresponds to trunk, hands and feet respectively, and has no practical meaning, wherein theta is a representation parameter;

In step S2, the same 3D motion or pose will generate several 2D projections, and a correct 3D reconstruction is re-projected after rotation, so that different 2D images should be generated by the same motion, and a loss function can be defined to obtain a minimum value, and the 2D loss function is as follows: Wherein/> Is the original 2D coordinates,/>Is the calculated rotated 3D coordinate, P is the 3D to 2D conversion matrix, R is the rotated azimuth angle matrix, and the azimuth angle range is [ -, pi ],/>The representation is an inverse matrix;

In step S2, for the 3D coordinates, the loss function is also calculated to keep the conversion consistent, and the loss function of 3D is as follows: Wherein/> Is the reconstructed 3D coordinates,/>Is the rotated coordinate, P is the 3D to 2D conversion matrix, R is the rotated azimuth angle matrix, and the azimuth angle range is [ -, II ],/>The representation is an inverse matrix;

In step S3, the network structure adopts a 4-5 layer fully connected network, the activation function adopts relu, and the definition of the loss function is as follows: Wherein the coordinate representation of the subscript m is a predicted value of the shielding network, and the subscript o is GT, and the subscript m comprises a real sample and a generated sample;

S4, an adjusting part: introducing a length loss function for generating limbs, correcting the limb length predicted by the network, avoiding the generated limb length from deviating from a normal value, defining an action loss function for two actions with fixed time intervals during training, and correcting the generated actions to ensure that the change of the action is closer to a real value;

in step S4, the length loss function is defined as follows: where b represents the length of a limb, including the spine, arm, leg, etc., K is the total number of sample limbs,/> Is a model predictive value,/>Is the value of GT;

The action loss function is defined as follows: where a, b represent 2 actions of adjacent time intervals,/> Representing coordinates predicted from the model from the actual GT sample,/>Representing the coordinates predicted from the generated samples.

An apparatus for 3D human body pose localization restoration using 2D images, comprising: the system comprises a memory, a processor and computer program instructions stored on the memory and executable on the processor, wherein the processor executes the computer program instructions to implement the method for performing 3D human body posture positioning and restoring by using the 2D image.

A computer readable storage medium, wherein computer executable instructions are stored in the computer readable storage medium, and the computer executable instructions are executed by a processor to implement the method for performing 3D human body posture positioning reduction using 2D images as described above.

In summary, as shown in fig. 1, the working principle of the method and the device for performing 3D human body posture positioning and restoring by using the 2D image is as follows:

S1, generating a standardized flow process of a sample: by training the real sample, a model is generated by adopting a standardized flow, and the formula of the standardized flow is as follows: Wherein/> Is the generated coordinates,/>Is normalized flow with parameters,/>Is the estimated true position,/>Is a constant adjustment coefficient,/>Representing gaussian noise, the standard deviation and variance are 0 and 1; and the legs, the trunk and the arms on the left and right sides need to train independent generation models, and the negative log likelihood of the generated samples and the real samples is minimized, and the likelihood estimation formula is as follows: /(I)Wherein/>Is the likelihood,/>Is the generated coordinates,/>Is a real (GT) coordinate,/>Is the number of samples,/>The probability density function is estimated by a standardized equation, is a distribution to be trained, is required to be trained by using samples, corresponds to one position, generally 5 positions, and corresponds to the trunk, the hands and the feet respectively; normalized flow is a technique in GAN for building more complex generative models, converting an input random noise vector into data samples with a desired distribution through a series of reversible and easy-to-calculate transformations, and the transformations are typically parameterized, can be learned by optimization algorithms, the GAN includes generators and discriminants, and the generators and discriminants are typically all made up of deep neural networks;

S2, dimension lifting and projection loss calculation: lifting the 2D coordinates to 3D through a projection conversion matrix, controlling the 3D object to rotate, and re-projecting the 3D object to 2D through the conversion matrix; the same 3D motion or pose will produce several 2D projections, while a correct 3D reconstruction is re-projected after rotation, generating different 2D images corresponding to the same motion, by defining a loss function that is minimized, and the 2D loss function is as follows: Wherein/> Is the original 2D coordinates,/>Is the calculated rotated 3D coordinate, P is the 3D to 2D conversion matrix, R is the rotated azimuth angle matrix, and the azimuth angle range is [ -, pi ],/>The representation is an inverse matrix; for 3D coordinates, it is also necessary to calculate the loss function, keeping the transformation consistent, and the loss function for 3D is as follows: /(I)Wherein/>Is the reconstructed 3D coordinates,/>Is the rotated coordinate, P is a 3D to 2D conversion matrix, R is a rotated azimuth angle matrix, the azimuth angle range is [ - ], pi ],The representation is an inverse matrix;

S3, shielding processing network: the 3D gesture with the occlusion is obtained from the 2D through dimension lifting, the gesture with the occlusion is used as the input of an occlusion processing network, the predicted 3D gesture without the occlusion is output, and the prediction part adopts a multi-layer fully-connected neural network structure for prediction; the network structure adopts 4-5 layers of fully connected networks, the activation function adopts relu, and the loss function is as follows: Wherein the coordinate representation of the subscript m is a predicted value of the occlusion network, and the subscript o is GT, including a real sample and a generated sample;

S4, an adjusting part: introducing a length loss function for generating limbs, correcting the limb length predicted by the network, and avoiding the generated limb length from deviating from a normal value, wherein the length loss function is defined as follows: wherein b represents the length of a limb, including spine, arm, leg, etc., K is the total number of sample limbs,/> Is a model predictive value,/>Is the value of GT; and for two actions with fixed time intervals, an action loss function is defined during training, and the generated actions are corrected so that the change of the actions is closer to a real value, and the action loss function is defined as follows: /(I)Wherein a, b represent 2 actions of adjacent time intervals,/>Representing coordinates predicted from the model from the actual GT sample,/>Representing the coordinates predicted from the generated samples.

The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for performing 3D human body posture positioning and restoring by using a 2D image, comprising the steps of:

2. A method for 3D human body posture localization restoration with 2D images according to claim 1, characterized in that in step S1, the normalized flow is a technique for constructing more complex generative models in GAN, the input random noise vector is transformed into data samples with the required distribution through a series of reversible and easy-to-calculate transformations, and the transformations are usually parameterized, and can be learned by optimization algorithms, the GAN comprises a generator and a arbiter, and the generator and the arbiter are all typically composed of deep neural networks;

3. The method for performing 3D human body posture positioning restoration using 2D images according to claim 1, wherein in the step S1, the formula of the normalized flow is as follows: Wherein/> Is the coordinates that are generated and are used to generate the coordinate,Is normalized flow with parameters,/>Is the estimated true position,/>Is a constant adjustment coefficient,/>Representing gaussian noise, the standard deviation and variance are 0 and 1.

4. The method for performing 3D human body posture positioning restoration using 2D images according to claim 1, wherein in the step S1, the formula of likelihood ratio estimation is as follows: Wherein/> Is the likelihood,/>Is the generated coordinates,/>Is a real (GT) coordinate,/>Is the number of samples,/>The probability density function is estimated by a standardized equation, is a distribution to be trained, and is required to be trained by using samples, wherein one standardized flow corresponds to one position, generally 5 positions, and the positions correspond to the trunk, the hands and the feet respectively.

5. A method for performing 3D human body posture positioning and restoring by using 2D images according to claim 1, wherein in the step S2, the same 3D motion or posture generates several 2D projections, and a correct 3D reconstruction is re-projected after rotation, so that different 2D images should be generated by the same motion, a loss function can be defined to obtain a minimum value, and the 2D loss function is as follows: Wherein/> Is the original 2D coordinates,/>Is the calculated rotated 3D coordinates, P is a 3D to 2D conversion matrix, R is a rotated azimuth angle matrix, the azimuth angle range is [ - ], pi ],The representation is an inverse matrix.

6. The method for performing 3D human body posture positioning and restoring by using 2D images according to claim 5, wherein in the step S2, a loss function is also calculated for the 3D coordinates, so that the conversion is consistent, and the loss function of 3D is as follows: Wherein/> Is the reconstructed 3D coordinates,/>Is the rotated coordinate, P is the 3D to 2D conversion matrix, R is the rotated azimuth angle matrix, and the azimuth angle range is [ -, II ],/>The representation is an inverse matrix.

7. The method for performing 3D human body posture positioning and restoring by using 2D images according to claim 1, wherein in the step S3, a 4-5 layer fully connected network is adopted as the network structure, relu is adopted as the activation function, and the loss function is defined as follows: wherein the coordinate representation of the subscript m is the predicted value of the occlusion network and the subscript o is GT, including the real sample and the generated sample.

8. The method for performing 3D human body posture positioning restoration by using 2D images according to claim 1, wherein in the step S4, a length loss function is defined as follows: where b represents the length of a limb, including the spine, arm, leg, etc., K is the total number of sample limbs,/> Is a model predictive value,/>Is the value of GT;

The action loss function is defined as follows:

9. A device for performing 3D human body posture positioning and restoring by using a 2D image, comprising: memory, a processor and computer program instructions stored on the memory and executable on the processor, the processor executing the computer program instructions to implement the method of 3D human body pose localization restoration using 2D images as claimed in any of the preceding claims 1-8.

10. A computer readable storage medium, wherein computer executable instructions are stored in the computer readable storage medium, which when executed by a processor is configured to implement the method for 3D human body posture localization restoration using 2D images according to any of the preceding claims 1-8.