CN115471863A

CN115471863A - Three-dimensional posture acquisition method, model training method and related equipment

Info

Publication number: CN115471863A
Application number: CN202210922155.5A
Authority: CN
Inventors: 苗瑞; 周波; 蔡芳发; 莫少锋; 陈永刚
Original assignee: Shenzhen HQVT Technology Co Ltd
Current assignee: Shenzhen HQVT Technology Co Ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-12-13

Abstract

The invention provides a three-dimensional posture acquisition method, a model training method and related equipment, wherein the three-dimensional posture acquisition method comprises the following steps: acquiring a two-dimensional image to be processed, inputting the two-dimensional image to be processed into a prediction model to obtain target two-dimensional coordinates of each key point of a human body predicted by the prediction model, wherein the prediction model is obtained by training according to each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded; converting each target two-dimensional coordinate into a corresponding target three-dimensional coordinate; and acquiring the three-dimensional posture of the human body according to the target three-dimensional coordinates. In the invention, the prediction model is obtained by training based on the two-dimensional sample image with the key point shielded, and even if the key point in the two-dimensional image is shielded, the two-dimensional coordinate of the key point can be accurately obtained, so that the accurate three-dimensional posture is generated through the three-dimensional coordinate converted by the two-dimensional coordinate.

Description

Three-dimensional posture acquisition method, model training method and related equipment

Technical Field

The invention relates to the technical field of human body postures, in particular to a three-dimensional posture acquisition method, a model training method and related equipment.

Background

Human pose estimation is an important research area for computer vision. Human body pose estimation is also crucial for machine understanding of humans. When the robot can predict the human body posture of the human, the robot can better interact with the human.

The body pose estimation includes two-dimensional body pose estimation and three-dimensional body pose estimation. The two-dimensional human body posture estimation is to position and identify two-dimensional coordinates of each key point of a human body, and obtain a human body skeleton through each two-dimensional coordinate. And the three-dimensional human body posture estimation refers to generating a three-dimensional posture through three-dimensional coordinates of each key point of a human body. The key points are points representing various parts of the human body, for example, the key points are hands, elbows, feet, five organs and the like.

In an exemplary technique, three-dimensional coordinates of each key point of a human body are acquired through a two-dimensional image, and then a three-dimensional posture of the human body is generated through each three-dimensional coordinate. However, the key points of the human body in the two-dimensional image may be blocked by the object or other key points of the human body, so that the three-dimensional coordinates of the blocked key points cannot be obtained, and the three-dimensional posture of the human body is inaccurate.

Disclosure of Invention

The invention provides a three-dimensional posture acquisition method, a model training method and related equipment, which are used for solving the problem that the three-dimensional posture of a human body is inaccurate.

In one aspect, the present invention provides a method for obtaining a three-dimensional gesture, including:

acquiring a two-dimensional image to be processed, inputting the two-dimensional image to be processed into a prediction model, and acquiring target two-dimensional coordinates of each key point of a human body predicted by the prediction model, wherein the prediction model is acquired according to each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded;

converting each target two-dimensional coordinate into a corresponding target three-dimensional coordinate;

and acquiring the three-dimensional posture of the human body according to the target three-dimensional coordinates.

In an embodiment, said converting each of said target two-dimensional coordinates into corresponding target three-dimensional coordinates includes:

and inputting each target two-dimensional coordinate into a conversion model to obtain a target three-dimensional feature corresponding to each target two-dimensional coordinate output by the conversion model.

In another aspect, the present application further provides a model training method, including:

acquiring a sample data set, wherein the sample data set comprises a plurality of two-dimensional sample images and corresponding label data, at least part of key points of a human body in the two-dimensional sample images are shielded, and the label data are two-dimensional coordinates of the key points of the human body in the two-dimensional sample images;

and training a first preset model according to the sample data set to obtain a prediction model, wherein the prediction model is used for acquiring target two-dimensional coordinates of key points of the human body in the two-dimensional image to be processed.

In an embodiment, the training a first preset model according to the sample data set to obtain the prediction model includes:

inputting the two-dimensional sample image into a first sub-model of a first preset model to obtain two-dimensional coordinates to be processed of key points of the human body;

determining a difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through a second sub-model of the first preset model;

stopping training the first preset model when the difference value is smaller than or equal to a preset threshold value, and determining the first preset model of which the training is stopped as the prediction model;

and when the difference value is larger than a preset threshold value, adjusting a loss function of the first preset model for generating a countermeasure network, inputting the two-dimensional sample image into the first sub-model, and returning to the step of executing the step of determining the difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through the second sub-model of the first preset model.

In one embodiment, said adjusting said generating a loss function against the network comprises:

acquiring first probability values corresponding to the to-be-processed two-dimensional coordinates of each key point in the two-dimensional sample image, and adjusting a loss function of the first sub-model according to each first probability value;

determining a probability value of a real two-dimensional image output by the first sub-model according to the second sub-model, and adjusting a loss function of the second sub-model according to the probability value, wherein the two-dimensional coordinates of key points of a human body in the real two-dimensional image are correct two-dimensional coordinates;

and adjusting the loss function of the generated countermeasure network according to a preset mapping relation, the loss function of the first sub-model and the loss function of the second sub-model, wherein the preset mapping relation is the relation among the loss function of the first sub-model, the loss function of the second sub-model and the loss function of the generated countermeasure network.

In an embodiment, after the training of the first preset model according to the sample data set to obtain the prediction model, the method further includes:

acquiring a plurality of training samples, wherein the training samples comprise two-dimensional coordinates to be trained and three-dimensional coordinates corresponding to the two-dimensional coordinates to be trained, and the two-dimensional coordinates to be trained are obtained by extracting two-dimensional images of a human body by the prediction model;

and training a second preset model according to each training sample to obtain a conversion model, wherein the conversion model is used for converting the two-dimensional coordinates into corresponding three-dimensional coordinates.

In another aspect, the present invention further provides an apparatus for obtaining a three-dimensional gesture, including:

the device comprises a first acquisition module, a second acquisition module and a prediction module, wherein the first acquisition module is used for acquiring a two-dimensional image to be processed and inputting the two-dimensional image to be processed into the prediction model to obtain target two-dimensional coordinates of each key point of a human body predicted by the prediction model, the prediction model is obtained by training according to each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded;

the conversion module is used for converting each target two-dimensional coordinate into a corresponding target three-dimensional coordinate;

the acquisition module is further used for acquiring the three-dimensional posture of the human body according to the target three-dimensional coordinates.

In another aspect, the present invention further provides a model training apparatus, including:

the second acquisition module is used for acquiring a sample data set, wherein the sample data set comprises a plurality of two-dimensional sample images and corresponding label data, and at least part of key points of a human body in the two-dimensional sample images are shielded;

and the training module is used for training a first preset model according to the sample data set to obtain a prediction model, and the prediction model is used for acquiring target two-dimensional coordinates of key points of the human body in the two-dimensional image to be processed.

In another aspect, the present invention also provides an apparatus comprising: a memory and a processor;

the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored by the memory to cause the processor to perform the method for acquiring three-dimensional poses as described above or the method for model training as described above.

In another aspect, the present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are executed by a processor to implement the three-dimensional pose acquisition method or the model training method.

According to the three-dimensional posture obtaining method, the model training method and the related equipment, the two-dimensional images are input into the prediction model to obtain the two-dimensional coordinates of each key point of the human body predicted by the prediction model, and the prediction model is obtained by training based on the two-dimensional sample image with the key point being shielded, so that even if the key point in the two-dimensional image is shielded, the two-dimensional coordinates of the key point can be accurately obtained, and the accurate three-dimensional posture is generated through the three-dimensional coordinates converted by the two-dimensional coordinates.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic flow chart diagram illustrating a first embodiment of a method for obtaining a three-dimensional pose according to the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of the model training method according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of the model training method according to the present invention;

FIG. 4 is a schematic flowchart of a third embodiment of the model training method according to the present invention;

FIG. 5 is a schematic flow chart of a fourth embodiment of the model training method according to the present invention;

FIG. 6 is a schematic diagram of a model training process according to the present invention;

FIG. 7 is a block diagram of an apparatus for obtaining three-dimensional pose of the present invention;

FIG. 8 is a block diagram of a model training apparatus according to the present invention;

fig. 9 is a schematic structural diagram of the three-dimensional posture acquisition device/model training device according to the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. The drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a first embodiment of a method for acquiring a three-dimensional pose according to the present invention, the method for acquiring a three-dimensional pose includes the following steps:

step S101, a two-dimensional image to be processed is obtained, the two-dimensional image to be processed is input into a prediction model, target two-dimensional coordinates of each key point of the human body predicted by the prediction model are obtained, the prediction model is obtained according to training of each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded.

In the present embodiment, the acquisition means whose subject is a three-dimensional posture is executed. For convenience of description, the first device is hereinafter referred to as a three-dimensional pose acquisition device.

The first device obtains a two-dimensional image to be processed, wherein the two-dimensional image comprises a human body. The first device is provided with a predictive model. The predictive model may be a neural network model or a generative confrontation network model.

The apparatus inputs a two-dimensional image to be processed to the prediction model. Since part of the position of the human body in the two-dimensional image may be blocked by clothes or articles, the key points of the human body cannot be accurately determined. For example, when a user holds a fan in a two-dimensional image, the fan blocks the mouth, and extraction of a key point is performed, the extracted key point cannot be determined to be the mouth or the chin. The prediction model is obtained based on training of each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded. Even if a key point of a human body is occluded in a two-dimensional image, the prediction model can predict the occluded key point from the two-dimensional image.

And then extracting the two-dimensional coordinates of the key points, wherein the extracted two-dimensional coordinates are defined as target two-dimensional coordinates. The key points are, for example, hands, feet, forehead, mouth, ears, etc. of the human body.

And step S102, converting each target two-dimensional coordinate into a corresponding target three-dimensional coordinate.

After obtaining the two-dimensional coordinates of each target, the device converts the two-dimensional coordinates of each target into corresponding three-dimensional coordinates of the target.

In one example, the target two-dimensional coordinates may be converted into the target three-dimensional coordinates by a conversion relationship of a human body two-dimensional coordinate system and a human body three-dimensional coordinate system. The human body two-dimensional coordinate system and the human body three-dimensional coordinate system can be obtained through deep learning.

In another example, each target two-dimensional coordinate may be converted to a corresponding target three-dimensional coordinate by a conversion model. The transformation model may be a neural network model.

And S103, acquiring the three-dimensional posture of the human body according to the three-dimensional coordinates of each target.

And the first device generates the three-dimensional posture of the human body based on the three-dimensional coordinates of each target after obtaining the three-dimensional coordinates of each target. For example, the first device may input the three-dimensional coordinates of each target to the convolutional neural network, that is, may obtain the three-dimensional posture output by the convolutional neural network. The three-dimensional posture can be used for recognizing human body actions, and the recognition of the human body actions can be used for application scenes such as human body falling alarm and the like.

In the embodiment, the two-dimensional coordinates of each key point of the human body predicted by the prediction model are obtained by inputting the two-dimensional image into the prediction model, and since the prediction model is trained based on the two-dimensional sample image in which the key point is blocked, even if the key point in the two-dimensional image is blocked, the two-dimensional coordinates of the key point can be accurately obtained, so that an accurate three-dimensional posture is generated through the three-dimensional coordinates converted by the two-dimensional coordinates.

The invention also provides a model training method.

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the model training method of the present invention, and the method further includes:

step S201, a sample data set is obtained, the sample data set comprises a plurality of two-dimensional sample images and corresponding label data, at least part of key points of the human body in the two-dimensional sample images are shielded, and the label data are two-dimensional coordinates of the key points of the human body in the two-dimensional sample images.

In the present embodiment, the executing subject is a model training device, and for convenience of description, the second device is referred to as the model training device below.

In this embodiment, the second device acquires a sample data set. The sample data set includes a plurality of two-dimensional sample images and corresponding label data. At least part of key points of the human body in the two-dimensional sample image are shielded, and the label data are two-dimensional coordinates of the key points of the human body in the two-dimensional sample image.

The two-dimensional sample image may be obtained from a wearable device or from data published on a web. After the two-dimensional sample image is obtained, technicians label the two-dimensional coordinates of the key points of the human body on the two-dimensional sample image to obtain corresponding label data.

Step S202, training a first preset model according to the sample data set to obtain a prediction model, wherein the prediction model is used for obtaining target two-dimensional coordinates of key points of the human body in the two-dimensional image to be processed.

The first preset model is provided with a generation countermeasure network. And the second device can obtain a prediction model through training of the generation of the countermeasure network by each two-dimensional sample image pair. The prediction model is used for acquiring target two-dimensional coordinates of key points of a human body in the two-dimensional image to be processed.

The first preset model comprises a first submodel, the first submodel may be a generative model, that is, generating the countermeasure network comprises generating a model, the generative model comprises three convolutional layers and three pooling layers, and the convolutional kernel of the convolutional layers may be 3. The generated model is used for extracting two-dimensional coordinates of key points of a human body in the two-dimensional sample image, specifically, the generated model carries out feature reduction on the two-dimensional sample image through three layers of convolution layers, and then carries out dimension reduction operation on the extracted feature through three layers of pooling layers, so that redundant information of the feature is reduced, and data fitting is carried out. The prediction model is obtained by training the first preset model, so the prediction model also comprises a generation model.

In this embodiment, the second device obtains the sample data set, so that the prediction model is obtained by training the first preset model through the sample data set. Because the human key points in the two-dimensional sample image included in the sample data set are shielded, even if the human key points in the two-dimensional image to be detected are shielded, the prediction model can also extract the two-dimensional coordinates of the shielded key points.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the model training method of the present invention, based on the first embodiment, step S202 includes:

step S301, inputting the two-dimensional sample image into a first sub-model of a first preset model to obtain two-dimensional coordinates to be processed of key points of the human body.

In this embodiment, the first preset model includes a first sub-model and a second sub-model, the first sub-model is a generation model, the second sub-model is a discriminator, that is, the generated countermeasure network includes a generation model and a discriminator, and the generated countermeasure network can be trained through a game between the generation model and the discriminator.

Specifically, the two-dimensional sample image is input to the generation countermeasure network, that is, the two-dimensional sample image is input to the generation module to obtain the to-be-processed two-dimensional coordinates of the key points of the human body.

Step S302, determining a difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through a second sub-model of the first preset model.

Label data corresponding to the two-dimensional sample image is input to the discriminator. And inputting the extracted two-dimensional coordinate to be processed into a discriminator in the generated model, and discriminating the difference of the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed by the discriminator, wherein the label data can be a labeled two-dimensional coordinate. The difference discrimination result can be represented by a difference value, that is, the difference value is used for representing the difference between the labeled two-dimensional coordinate and the to-be-processed two-dimensional coordinate, and the larger the difference value is, the larger the difference between the two is, and the more inaccurate the two-dimensional coordinate extracted by the generated model is.

It should be noted that the discriminator discriminates the difference between the two-dimensional coordinates to be processed and the two-dimensional coordinates labeled for the same keypoint. For example, the discriminator discriminates the difference between the two-dimensional coordinates of the hand to be processed and the two-dimensional coordinates of the label. The discriminator can obtain the difference value corresponding to each key point after discriminating the difference between the two-dimensional coordinate to be processed of the key point and the labeled two-dimensional coordinate, the total difference value can be obtained through weighted calculation based on the weight of each key point and the corresponding difference value, the total difference value is used for determining whether the generated model is trained, namely the second device judges whether the difference value (the difference value obtained through weighted calculation) is smaller than or equal to a preset threshold value.

Step S303, when the difference value is less than or equal to the preset threshold value, stopping training the first preset model, and determining the first preset model of which the training is stopped as the prediction model.

When the difference value is smaller than or equal to the preset threshold value, it can be determined that the two-dimensional coordinates of the key points extracted by the generated model are close to the real two-dimensional coordinates of the key points, at the moment, the training of the first preset model is stopped, and the first preset model with the training stopped can be used as a prediction model.

In addition, the discriminator may not be able to discriminate the difference between the two-dimensional coordinates to be processed and the labeled two-dimensional coordinates. For example, the probability value of the discriminator for discriminating that the two-dimensional coordinate to be processed is the labeled two-dimensional coordinate is 0.5, and the probability value of the discriminator for discriminating that the two-dimensional coordinate to be processed is not the labeled two-dimensional coordinate is also 0.5. When the discriminator cannot discriminate the difference between the two-dimensional coordinate to be processed and the marked two-dimensional coordinate, the difference value corresponding to the two-dimensional coordinate to be processed and the marked two-dimensional coordinate is a preset value, and the preset value is smaller than a preset threshold value.

Step S304, when the difference value is larger than a preset threshold value, adjusting a loss function of the first preset model for generating the countermeasure network, and inputting the two-dimensional sample image into the first sub-model.

When the difference value is larger than the preset threshold value, the two-dimensional coordinates extracted by the generated model are not accurate, and therefore the generated countermeasure network needs to be trained continuously. In contrast, the second device adjusts and generates a loss function of the countermeasure network, and then inputs the two-dimensional sample image to the first sub-model, so that the step of determining the difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through the second sub-model of the first preset model is executed again, that is, the two-dimensional sample image continues to be trained to generate the countermeasure network. The two-dimensional sample image input this time is different from or the same as the two-dimensional sample image input last time.

In this embodiment, a prediction model is generated by the discriminator and the generation model in the first preset model, so that the accuracy of extracting the two-dimensional coordinates of the key points is improved.

Referring to fig. 4, fig. 4 is a fourth embodiment of the model training method of the present invention, and based on the second embodiment, step S304 includes:

step S401, obtaining first probability values corresponding to the to-be-processed two-dimensional coordinates of each key point in the two-dimensional sample image, and adjusting a loss function of the first sub-model according to each first probability value.

In this embodiment, the loss function of the generation model in the first prediction model may adopt a cross-entropy loss function, and the cross-entropy loss function may be:

where N is the number of respective two-dimensional sample images, K represents the number of categories of keypoints, and P _i，k Representing the probability value, y, that the ith keypoint is predicted as a certain keypoint _i，k Two-dimensional coordinates (labeled two-dimensional coordinates) representing the true i-th keypoint, L _G To generate a model loss function.

In this regard, the second device obtains a probability value corresponding to each to-be-processed two-dimensional coordinate of the generative model, where the probability value is defined as a first probability value. The second device obtains a mapping relation (first mapping relation) between the loss function of the generative model and each first probability value based on a formula corresponding to the cross entropy loss function, and can adjust the loss function of the generative model based on the first mapping relation and each first probability value.

Step S402, determining the probability value of the first sub-model outputting a real two-dimensional image according to the second sub-model, and adjusting the loss function of the second sub-model according to the probability value, wherein the two-dimensional coordinates of key points of the human body in the real two-dimensional image are correct two-dimensional coordinates.

And the discriminator determines the probability value of generating a real two-dimensional image output by the countermeasure network by discriminating the two-dimensional coordinate to be processed and the labeled two-dimensional coordinate. The two-dimensional coordinates of the key points of the human body in the real two-dimensional image are correct two-dimensional coordinates, that is, the real two-dimensional coordinates can be understood as the two-dimensional coordinates of the key points related to the label.

For example, if the first probability value of the discriminator for discriminating that the two-dimensional coordinate to be processed is the labeled two-dimensional coordinate is 0.4, the probability value of generating the actual two-dimensional image output by the countermeasure network is also 0.4. Of course, the probability value of generating the real two-dimensional image output by the confrontation network can be obtained through the weight corresponding to each key point and each first probability value.

After obtaining the probability value (defined as the second probability value) of generating the actual two-dimensional image output by the countermeasure network, the loss function of the discriminator is adjusted according to the second probability value, and the loss function L of the discriminator _D Comprises the following steps:

L _D ＝log(1-D(G(z))

wherein D (G (z)) is a second probability value.

The second device obtains a mapping relationship (second mapping relationship) between the loss function of the discriminator and the second probability value based on a formula of the loss function of the discriminator, and adjusts the loss function of the discriminator based on the second mapping relationship and the obtained second probability value.

Step S403, adjusting the loss function of the generated countermeasure network according to a preset mapping relationship, the loss function of the first sub-model, and the loss function of the second sub-model, where the preset mapping relationship is a relationship between the loss function of the first sub-model, the loss function of the second sub-model, and the loss function of the generated countermeasure network.

In the present embodiment, a loss function L is generated against the network _GAN Is composed of

L _GAN ＝(1-γ)L _G +γL _D

Where γ is mainly used to adjust the importance of the two loss functions, γ may be a fixed value.

The mapping (third mapping) between the discriminator loss function, the loss function of the generative model, and the generative countermeasure network can be obtained based on the loss function of the generative countermeasure network.

After the loss function of the discriminator and the loss function of the generated model are obtained, the target loss function of the generated countermeasure network can be adjusted through the third mapping relation, the loss function of the discriminator and the loss function of the generated model, and then the loss function of the generated countermeasure network is adjusted to be the target loss function.

In this embodiment, the second device trains a prediction model with high extraction accuracy by adjusting the loss function of the generator model and the discriminator and adjusting the loss function of the generator countermeasure network.

Referring to fig. 5, fig. 5 is a fourth embodiment of the model training method according to the present invention, and based on any one of the first to third embodiments, after step S202, the method further includes:

step S501, a plurality of training samples are obtained, the training samples comprise two-dimensional coordinates to be trained and three-dimensional coordinates corresponding to the two-dimensional coordinates to be trained, and the two-dimensional coordinates to be trained are obtained by extracting two-dimensional images of a human body by a prediction model.

In this embodiment, a three-dimensional coordinate in a two-dimensional image may be labeled by a dynamic capture tracker and a wearable IMU device to obtain a training sample, that is, the training sample includes a two-dimensional coordinate to be trained and a three-dimensional coordinate labeled by the two-dimensional coordinate to be trained.

Step S502, training the second preset model according to each training sample to obtain a conversion model, wherein the conversion model is used for converting the two-dimensional coordinates into corresponding three-dimensional coordinates.

And training the second preset model by the second device based on each training sample to obtain the conversion model. The conversion model is used for converting the two-dimensional coordinates into corresponding three-dimensional coordinates. The second predetermined model may be a neural network model, comprising a plurality of residual networks. The residual error network comprises a full connection layer, and the output of the full connection layer is combined with the output of the residual error network of the previous layer in a cascade mode, so that data dissipation can be prevented. For example, the second predetermined model includes three residual error networks, the data output by the first residual error network (full link layer output data) is input to the second residual error network, and the data output by the second residual error network is input to the third residual error network in a cascade connection with the data output by the first residual error network. The loss function of the second preset model may be a Mean Square Error (MSE) loss function. Arithmetic function L of the second predetermined model _2D/3D Comprises the following steps:

wherein, Y _i The real 3D is labeled with feature points (labeled three-dimensional coordinates), y _i Is the predicted 3D feature point (predicted three-dimensional coordinates).

It should be noted that the two-dimensional coordinates to be trained are obtained by extracting the two-dimensional image of the human body by the prediction model, that is, the prediction model and the extraction model can be used as an integral modelThe model is trained. Loss function L of the integral model _sum Comprises the following steps:

L _sum ＝L _GAN +L _2D/3D

referring to fig. 6, fig. 6 shows an overall training flow of the generation model, the discriminant model (discriminant), and the conversion model (2D/3D feature point conversion model). Specifically, the two-dimensional sample image is input into the generating model to obtain the to-be-processed two-dimensional coordinates of the human body key points, and the loss function L of the generating model is adjusted based on the to-be-processed two-dimensional coordinates and the marked two-dimensional coordinates (real label data) _G Inputting the two-dimensional coordinate to be processed and the labeled two-dimensional coordinate into a discrimination model, outputting True/False by the discrimination model, namely outputting a probability value (True) that the two-dimensional coordinate to be processed is the labeled two-dimensional coordinate and a probability value (False) that the two-dimensional coordinate to be processed is not the labeled two-dimensional coordinate by the discrimination model, and adjusting a loss function L of the discrimination model based on the True/False _D And generating a loss function of the model, thereby finally adjusting the loss function generated against the network. After training of generation of the countermeasure network is completed, two-dimensional coordinates extracted by a generation model in the countermeasure network and a three-dimensional feature input conversion model marked by the two-dimensional coordinates are generated, the conversion model outputs three-dimensional coordinates, and the conversion model is trained on the basis of the three-dimensional coordinates. The training of each model and the adjustment of the loss function refer to the above description specifically, and are not described herein again.

In addition, the three-dimensional coordinate marked by the two-dimensional coordinate to be trained is obtained through two-dimensional images of the human body acquired by the plurality of image acquisition modules, and the visual angles of the human body acquired by the image acquisition modules are different. For example, in a laboratory environment, scenes from 4 viewpoints are recorded simultaneously by 4 high-definition cameras, and accurate three-dimensional coordinates of key points of a human body are acquired by a MoCap (motion capture device) system. The plurality of image acquisition modules can be external equipment of the second device and can also be part of the second device.

In this embodiment, the second device trains the second preset model through a plurality of training samples to obtain a conversion model, so as to accurately obtain the three-dimensional coordinates of the key points through the conversion model.

The present invention also provides an apparatus for acquiring a three-dimensional posture, and referring to fig. 7, the apparatus 700 for acquiring a three-dimensional posture includes:

the first obtaining module 710 is configured to obtain a two-dimensional image to be processed, input the two-dimensional image to be processed into a prediction model, and obtain target two-dimensional coordinates of each key point of a human body predicted by the prediction model, where the prediction model is obtained by training according to each two-dimensional sample image, and at least part of the key points of the human body in the two-dimensional sample image are blocked;

a conversion module 720, configured to convert each target two-dimensional coordinate into a corresponding target three-dimensional coordinate;

the first obtaining module 710 is configured to obtain a three-dimensional posture of the human body according to the target three-dimensional coordinates.

In an embodiment, the apparatus 700 for acquiring three-dimensional gesture further includes:

and the input module is used for inputting the two-dimensional coordinates of each target into the conversion model to obtain the three-dimensional characteristics of the target corresponding to the two-dimensional coordinates of each target output by the conversion model.

The present invention also provides a model training apparatus, and referring to fig. 8, a model training apparatus 800 includes:

a second obtaining module 810, configured to obtain a sample data set, where the sample data set includes multiple two-dimensional sample images and corresponding tag data, at least part of key points of a human body in the two-dimensional sample images are occluded, and the tag data is two-dimensional coordinates of the key points of the human body in the two-dimensional sample images;

the training module 820 is configured to train the first preset model according to the sample data set to obtain a prediction model, where the prediction model is used to obtain target two-dimensional coordinates of key points of a human body in a two-dimensional image to be processed.

In one embodiment, the model training apparatus 800 further comprises:

the input module is used for inputting the two-dimensional sample image into a first sub-model of a first preset model to obtain two-dimensional coordinates to be processed of key points of the human body;

the determining module is used for determining a difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through a second sub-model of the first preset model;

a training module 820, configured to stop training the first preset model when the difference value is less than or equal to a preset threshold, and determine the first preset model with the training stopped as the prediction model;

and the adjusting module is used for adjusting a loss function of the first preset model for generating the countermeasure network when the difference value is larger than a preset threshold value, inputting the two-dimensional sample image into the first sub-model, and returning to the step of determining the difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through the second sub-model of the first preset model.

In one embodiment, the model training device 800 further comprises:

a second obtaining module 810, configured to obtain first probability values corresponding to the to-be-processed two-dimensional coordinates of each key point in the two-dimensional sample image, and adjust a loss function of the first sub-model according to each first probability value;

the output module is used for determining the probability value of the first sub-model for outputting a real two-dimensional image according to the second sub-model and adjusting the loss function of the second sub-model according to the probability value, wherein the two-dimensional coordinates of key points of a human body in the real two-dimensional image are correct two-dimensional coordinates;

and the adjusting module is used for adjusting the loss function of the generated countermeasure network according to a preset mapping relation, the loss function of the first sub-model and the loss function of the second sub-model, wherein the preset mapping relation is the relation among the loss function of the first sub-model, the loss function of the second sub-model and the loss function of the generated countermeasure network.

In one embodiment, the model training device 800 further comprises:

the second obtaining module 810 is configured to obtain a plurality of training samples, where the training samples include two-dimensional coordinates to be trained and three-dimensional coordinates corresponding to the two-dimensional coordinates to be trained, and the two-dimensional coordinates to be trained are obtained by extracting a two-dimensional image of a human body by using the prediction model;

and the training module 820 is configured to train the second preset model according to each training sample to obtain a conversion model, where the conversion model is configured to convert the two-dimensional coordinates into corresponding three-dimensional coordinates.

FIG. 9 is a hardware block diagram of a three-dimensional posed acquisition device/model training device according to an exemplary embodiment.

The three-dimensional pose acquisition device/model training device 900 may include: a processor 91, e.g. a CPU, a memory 92, a transceiver 93. Those skilled in the art will appreciate that the configuration shown in FIG. 9 does not constitute a limitation of the three-dimensional pose acquisition/model training device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. The memory 92 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The processor 91 may call a computer program stored in the memory 92 or execute instructions to perform all or part of the steps of the above-described three-dimensional pose acquisition method or all or part of the steps of the model training method.

The transceiver 93 is used for receiving and transmitting information from and to an external device.

A non-transitory computer-readable storage medium in which instructions (computer-executable instructions) are executed by a processor of a three-dimensional pose acquisition apparatus, so that the three-dimensional pose acquisition apparatus can execute the above-described three-dimensional pose acquisition method.

A non-transitory computer readable storage medium in which instructions (computer-executable instructions) are executed by a processor of a model training apparatus, enabling the model training apparatus to perform the above-described model training method.

A computer program product comprising a computer program which, when executed by a processor of a three-dimensional pose acquisition device, enables the three-dimensional pose acquisition device to perform the above-mentioned three-dimensional pose acquisition method.

A computer program product comprising a computer program which, when executed by a processor of a model training apparatus, enables the model training apparatus to perform the above-described model training method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for acquiring a three-dimensional posture is characterized by comprising the following steps:

acquiring a two-dimensional image to be processed, inputting the two-dimensional image to be processed into a prediction model to obtain target two-dimensional coordinates of each key point of a human body predicted by the prediction model, wherein the prediction model is obtained by training according to each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded;

and acquiring the three-dimensional posture of the human body according to the three-dimensional coordinates of the targets.

2. The method for acquiring a three-dimensional gesture according to claim 1, wherein the converting each target two-dimensional coordinate into a corresponding target three-dimensional coordinate comprises:

3. A method of model training, comprising:

4. The method of claim 3, wherein the training a first predetermined model according to the sample data set to obtain the predictive model comprises:

and when the difference value is larger than a preset threshold value, adjusting a loss function of the first preset model for generating the countermeasure network, inputting the two-dimensional sample image into the first sub-model, and returning to execute the step of determining the difference value between the two-dimensional coordinate to be processed and the label data of the key point corresponding to the two-dimensional coordinate to be processed through the second sub-model of the first preset model.

5. The model training method of claim 4, wherein said adjusting the loss function of the generative countermeasure network comprises:

acquiring first probability values corresponding to-be-processed two-dimensional coordinates of each key point in the two-dimensional sample image, and adjusting a loss function of the first sub-model according to each first probability value;

6. The method according to any one of claims 3 to 5, wherein said training a first predetermined model according to the sample data set, and after obtaining the prediction model, further comprises:

7. An apparatus for acquiring a three-dimensional posture, comprising:

the system comprises a first acquisition module, a first display module and a second display module, wherein the first acquisition module is used for acquiring a two-dimensional image to be processed and inputting the two-dimensional image to be processed into a prediction model to obtain target two-dimensional coordinates of each key point of a human body predicted by the prediction model, the prediction model is obtained by training according to each two-dimensional sample image, and at least part of key points of the human body in the two-dimensional sample image are shielded;

8. A model training apparatus, comprising:

9. An apparatus, comprising: a memory and a processor;

the memory stores computer-executable instructions;

the processor executing the computer-executable instructions stored by the memory causes the processor to perform the method of acquiring a three-dimensional pose of any of claims 1-2 or the method of model training of any of claims 3-6.

10. A computer-readable storage medium, wherein a computer-executable instruction is stored therein, and when executed by a processor, the computer-executable instruction is used for implementing the method for acquiring a three-dimensional pose according to any one of claims 1-2 or the method for model training according to any one of claims 3-6.