CN114299541A

CN114299541A - Model training method and device and human body posture recognition method and device

Info

Publication number: CN114299541A
Application number: CN202111619948.1A
Authority: CN
Inventors: 罗鹏飞; 李超龙; 吴子扬
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-08
Anticipated expiration: 2041-12-27
Also published as: CN114299541B

Abstract

The application provides a model training method and device and a human body posture recognition method and device, wherein the method is applied to a human body posture estimation model and comprises the following steps: inputting the human body posture sample image into an initial human body posture estimation model to obtain a single character image included in the human body posture sample image; inputting the single character image into at least one single posture estimation model to respectively obtain at least one output result corresponding to the at least one single posture estimation model, wherein the at least one single posture estimation model is a top-down model; and training the initial human body posture estimation model by taking at least one output result as a training guide to obtain a human body posture estimation model, wherein the human body posture estimation model is a bottom-up model and is used for identifying a plurality of human body postures in the human body image. According to the method and the device, the training of the human body posture estimation model is guided by applying the high-precision single posture estimation model, and the recognition precision and speed of the human body posture estimation model are improved.

Description

Model training method and device and human body posture recognition method and device

Technical Field

The application relates to the technical field of image recognition, in particular to a model training method and device and a human body posture recognition method and device.

Background

At present, a multi-person posture estimation task is taken as a basic research topic in computer vision, and has wide application prospects in visual tasks such as behavior recognition, pedestrian re-recognition, human-computer interaction and the like. Currently, two methods, namely a top-down method and a bottom-up method, are generally adopted to solve the multi-person posture estimation task. The top-down method scales the character region corresponding to each character in the image to a fixed size, and then performs single-person pose estimation. The bottom-up method is to directly perform multi-person pose estimation on a plurality of people included in an image, which results in low recognition accuracy of the bottom-up model, and the bottom-up model is required to have the capability of processing people with different sizes, thereby increasing the difficulty in constructing the bottom-up model with high recognition accuracy.

In view of this, how to construct a bottom-up model with high recognition accuracy and high speed is an urgent technical problem to be solved.

Disclosure of Invention

In view of this, embodiments of the present application provide a model training method and apparatus, and a human body posture recognition method and apparatus, which can obtain a human body posture estimation model with high recognition accuracy and high speed.

In a first aspect, an embodiment of the present application provides a model training method, which is applied to a human body posture estimation model, and the method includes: inputting the human body posture sample image into an initial human body posture estimation model to obtain a single character image included in the human body posture sample image; inputting the single character image into at least one single person pose estimation model to obtain at least one output result of the at least one single person pose estimation model, wherein the at least one single person pose estimation model is a top-down model; and training the initial human body posture estimation model by taking at least one output result as a training guide to obtain a human body posture estimation model, wherein the human body posture estimation model is a bottom-up model and is used for identifying a plurality of human body postures in the human body image.

In some embodiments of the present application, obtaining an image of a single person included in a body pose sample image comprises: acquiring a characteristic diagram of a human body posture sample image; determining a first area of a single person on the feature map; determining a first loss value based on the bias vector of each first pixel point in the first region relative to a second pixel point corresponding to the first pixel point in the human body posture sample image, the target bias vector and the area of the first region; adjusting the first area according to the first loss value to obtain a second area of a single person; and intercepting the human body posture sample image according to the second area to obtain a single person image.

In some embodiments of the present application, determining a first loss value based on a bias vector of each first pixel point in the first region relative to a second pixel point corresponding to the first pixel point in the human body posture sample image, a target bias vector, and an area of the first region includes: the bias vector determined by the scale normalized loss function and the second loss value of the target bias vector are divided by the area to obtain a first loss value.

In some embodiments of the present application, training the initial body posture estimation model with the at least one output result as a training guide to obtain the body posture estimation model includes: calculating a third loss value of the characteristic diagram of the at least one output result and the human body posture sample image; adjusting the human body posture estimation model according to the third loss value; and when the third loss value meets the preset condition, stopping calculating the third loss value, and acquiring the human body posture estimation model.

In some embodiments of the present application, calculating a third loss value of the feature map of the at least one output result and the body posture sample image includes: and calculating a third loss value between the at least one output result and the feature map of the human posture sample image by using the loss function.

In some embodiments of the present application, the at least one output result includes feature maps of a plurality of channels, wherein calculating a third loss value between the at least one output result and the feature map of the body pose sample image using a loss function includes: intercepting the characteristic diagram of the human body posture sample image according to the characteristic diagrams of the channels to obtain the characteristic diagram of a single person; and calculating a third loss value between the feature maps of the plurality of channels and the feature map of the single person by using the mean square error loss function.

In some embodiments of the present application, the at least one output result includes a first coordinate of each third pixel point in the single human image, the first coordinate includes a first center coordinate and a first offset vector, wherein calculating a third loss value between the at least one output result and the feature map of the body posture sample image using a loss function includes: determining a feature map of a single person in the feature map of the human posture sample image according to the first central coordinate; and calculating a third loss value between the first coordinate of each third pixel point and a second coordinate of each fourth pixel point in the feature map of the single person by using a smooth L1 loss function, wherein the second coordinate comprises a second center coordinate and a second offset vector.

In a second aspect, an embodiment of the present application provides a human body gesture recognition method, including: obtaining a human body posture estimation model, wherein the human body posture estimation model is obtained based on the model training method of the first aspect; and inputting the image containing the plurality of human body postures into the human body posture estimation model to obtain the recognition results of the plurality of human body postures.

In a third aspect, an embodiment of the present application provides an apparatus for training a model, the apparatus including: the first acquisition module is used for inputting the human body posture sample image into the initial human body posture estimation model so as to acquire a single character image included in the human body posture sample image; the input acquisition module is used for inputting the single character image into at least one single posture estimation model so as to respectively acquire at least one output result corresponding to the at least one single posture estimation model, wherein the at least one single posture estimation model is a top-down model; and the second acquisition module is used for training the initial human body posture estimation model by taking at least one output result as a training guide so as to acquire the human body posture estimation model, wherein the human body posture estimation model is a bottom-up model and is used for identifying a plurality of human body postures in the image.

In a fourth aspect, an embodiment of the present application provides a human body posture recognition apparatus, including: a third obtaining module, configured to obtain a body posture estimation model, where the body posture estimation model is obtained based on the model training method of the first aspect; and the fourth acquisition module is used for inputting the image containing the plurality of human body postures into the human body posture estimation model so as to acquire the recognition results of the plurality of human body postures.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores a computer program for executing the training method of the model according to the first aspect, and/or for executing the human body posture recognition method according to the second aspect.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing processor executable instructions, wherein the processor is configured to perform the method for training a model according to the first aspect and/or to perform the method for human body posture recognition according to the second aspect.

The embodiment of the application provides a model training method and device and a human body posture recognition method and device, and the human body posture estimation model (namely a bottom-up model) can learn the output result of a single body posture estimation model (namely a top-down model) with higher precision in a knowledge distillation mode, so that the high-precision and quick human body posture estimation model can be obtained and is used for multi-person posture estimation.

Drawings

Fig. 1 is a schematic flowchart of a training method of a model according to an exemplary embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for training a model according to another exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for training a model according to another exemplary embodiment of the present application.

Fig. 4 is a flowchart illustrating a method for training a model according to still another exemplary embodiment of the present application.

Fig. 5 is a flowchart illustrating a method for training a model according to still another exemplary embodiment of the present application.

Fig. 6 is a flowchart illustrating a human body gesture recognition method according to an exemplary embodiment of the present application.

Fig. 7 is a schematic structural diagram of a training apparatus for a model according to an exemplary embodiment of the present application.

Fig. 8 is a schematic structural diagram of a human body posture recognition device according to an exemplary embodiment of the present application.

FIG. 9 is a block diagram of an electronic device for model training or human gesture recognition provided by an exemplary embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The top-down method firstly uses a human shape detector to detect a plurality of human figures in an image, then the human figure area corresponding to each human figure is scaled to a fixed size, and the human figure area is sent to a single posture estimator to carry out single posture estimation. It can be seen that the top-down approach requires a single person pose estimation for each person, so that the time for recognition using the top-down approach is proportional to the number of persons, i.e. the more persons included in the image, the longer the recognition time.

The bottom-up method does not depend on a human shape detector, the whole image is directly used as input, the posture result of a plurality of people can be predicted through a forward process, namely the method is not influenced by the number of people in the image, but the bottom-up method is low in recognition accuracy. Moreover, based on the elicitation of the target detection paradigm without an anchor frame, the bottom-up method can directly predict the bias vectors with different scales for each pixel position, but the bias vectors are often dominated by the error of a person with a large scale when calculating the loss value, so that the bias vectors of the skeleton points of the person with a small scale are difficult to regress. Meanwhile, the bottom-up method can also directly predict the Gaussian response graphs of the human skeleton points with different scales, but the response of the human skeleton points with small scales is difficult to learn.

In order to solve the above technical problem, an embodiment of the present application provides a bottom-up model training method as follows.

Fig. 1 is a schematic flowchart of a training method of a model according to an exemplary embodiment of the present application. The method of fig. 1 is performed by a computing device, e.g., a server. As shown in fig. 1, the training method of the model includes the following steps.

In an embodiment, the training method of the model may be training of the human body posture estimation model, that is, training the initial human body posture estimation model to obtain the human body posture estimation model for recognizing a plurality of human body postures included in the image.

110: and inputting the human body posture sample image into an initial human body posture estimation model to obtain a single character image included on the human body posture sample image.

Specifically, the body pose sample image may be an image including a plurality of persons, and the body pose sample image may be an image in a multi-person pose estimation dataset having annotation information, wherein the dataset may be a dataset used for training a multi-person pose estimation model, such as MSCOCO, MPII, and CROWDPOSE. For example, the body pose sample image may be a picture with multiple character labels in the MSCOCO.

The initial body pose estimation model may be a bottom-up model and may act as a student model in knowledge distillation, which may include a central heatmap prediction branch and a bias vector regression branch.

After the human body posture sample image is input into the initial human body posture estimation model, the initial human body posture estimation model can perform feature extraction on the human body posture sample image so as to obtain a feature map of the human body posture sample image. Then, in the initial human posture estimation model, a target human character scale regression branch is introduced, and the human character region (namely, the first region) of any human character in the feature map of the human posture sample image, namely, a single human character is determined.

Further, a first loss value is determined according to a bias vector of each first pixel point in a first region calculated by a bias vector regression branch relative to a second pixel point in an input human body posture sample image, an actual bias vector (namely a target bias vector) of the first pixel point and the second pixel point, and the area of the first region, wherein the first region comprises a plurality of first pixel points, a feature map of the human body posture sample image comprises a plurality of second pixel points, and the plurality of first pixel points are in one-to-one correspondence with the plurality of second pixel points.

Further, adjusting the area range of the first area according to the first loss value, obtaining a second area of the adjusted single person, and then intercepting the human body posture sample image according to the second area to obtain a single person image.

It should be noted that, the step 110 may be understood as intercepting the corresponding single character image on the human posture sample image by using the output result (i.e. the second region) of the biased regression branch, and inputting the single character image into the single posture estimation model of the step 120 respectively.

For example, a single person image is input into a single person pose estimation model based on heat map regression and a single person pose estimation model based on coordinate regression, respectively.

It should be noted that the target person scale regression branch can only predict the approximate region of the person, and the bias vector regression branch has better capability of representing the region of the person because different human body parts need to be considered, so that the output result of the bias vector regression branch is used for capturing the image of the single person.

120: the single character image is input into the at least one single person pose estimation model to obtain at least one output of the at least one single person pose estimation model.

In an embodiment, the at least one single person pose estimation model is a top-down model.

In particular, the at least one single person pose estimation model may be used as a teacher model in knowledge distillation. The at least one single pose estimation model may include a pre-trained single pose estimation model based on a heatmap regression and a pre-trained single pose estimation model based on a coordinate regression, where a specific type of the single pose estimation model may be selected according to a requirement of an actual effect and an existing training resource, and this is not specifically limited in the embodiment of the present application. It should be noted that, when the selected teacher model is too complicated or too large, more video memory resources of the electronic device may be occupied.

The number of the single pose estimation models may be one or more, and the number of the single pose estimation models is not particularly limited in the embodiments of the present application. That is, embodiments of the present application may train a student model based on one or more teacher models.

For example, a single-person pose estimation model based on heat map regression needs to predict a central heat map in addition to the 17 bone point heat maps, where the center of the central heat map may be the center of the human body, i.e., the location center of the 17 bone points. Note that the 17 skeletal points include two ears, two eyes, a nose, left/right shoulders, left/right elbows, left/right hands, left/right hips, left/right knees, and left/right ankles, for a total of 17 skeletal points.

The output of the single-person pose estimation model based on heat map regression may be a heat map of a single person for multiple channels (or a feature map for multiple channels), where each channel represents a category of human feature points including a human center and multiple skeletal points.

In an embodiment, the output result of the single-person pose estimation model based on heat map regression may be an 18-channel feature map.

The prediction rule (also called prediction paradigm) of the coordinate regression-based single-person pose estimation model is different from the conventional direct prediction of the 34-dimensional 17 skeletal point coordinate values, in order to facilitate subsequent distillation of the bias vector regression branch in the bottom-up model (i.e., the initial human pose estimation model), and the coordinate regression-based single-person pose estimation model may employ a similar center-bias vector regression paradigm. Wherein, the 17 bone points of 34 dimensions can be x-axis coordinate values and y-axis coordinate values corresponding to the 17 bone points respectively.

In an embodiment, the output of the coordinate regression based single person pose estimation model may be a first coordinate of the single person, which may include a first center coordinate and a first offset vector of a first region of the single person.

It should be noted that the center-bias vector regression paradigm may be understood as defining a center of a body region and determining the x-axis coordinate and the y-axis coordinate of the center. The remaining 17 bone points are then added with an offset (i.e., a first offset vector) to the center x-axis and y-axis coordinates, respectively, such as a δ x and a δ y, and the calculated first center coordinate plus the first offset vector (i.e., the first coordinate) is used as the output of the center-offset vector regression paradigm.

It should be further noted that the rectangular coordinate system used for determining the center coordinate may be a rectangular coordinate system established with the upper left corner of the human body posture sample image as an origin, or may be a rectangular coordinate system established with the center of the human body posture sample image as an origin, which is not specifically limited in this embodiment of the present application.

130: and training the initial human body posture estimation model by taking at least one output result as a training guide to obtain the human body posture estimation model.

In one embodiment, the body pose estimation model is a bottom-up model for identifying a plurality of body poses in the body image.

Specifically, at least one output result output by at least one single posture estimation model is transferred to the human posture estimation model in a knowledge distillation mode, so that the human posture estimation model can optimize the output result of the human posture estimation model according to the transferred knowledge (such as the output result of the single posture estimation model).

It should be noted that, in the embodiment of the present application, knowledge distillation may be understood as inducing training of a simplified and low-complexity student model through an introduced output result of a teacher model with superior reasoning performance, thereby implementing knowledge migration, that is, migrating an output result of at least one single-person posture estimation model to a human body posture estimation model.

In an embodiment, the single-person pose estimation model comprising a heat map regression-based single-person pose estimation model may correspond to a central heat map prediction branch of the human pose estimation model; the single-person pose estimation model based on coordinate regression included in the single-person pose estimation model may correspond to an offset vector regression branch of the human pose estimation model.

And performing feature extraction on the human body posture sample image based on the convolution layer and the pooling layer in the initial human body posture estimation model to obtain a feature map of the human body posture sample image. And then determining a third loss value of at least one output result of the at least one single-person posture estimation model and the feature map of the human body posture sample image by using the loss function.

In an embodiment, the at least one output result includes a plurality of channel feature maps (e.g., 18 channel feature maps) output by a single-person pose estimation model based on heat map regression (i.e., single-person pose estimation model). In this case, the method of determining the third loss value includes: intercepting the characteristic diagram of the human body posture sample image according to the characteristic diagrams of the channels to obtain the characteristic diagram of a single person; and calculating a third loss value between the feature maps of the plurality of channels and the feature map of the single person by using the mean square error loss function.

In another embodiment, the at least one output result includes a first center coordinate and a first bias vector output by the single person pose estimation model based on coordinate regression (i.e., the single person pose estimation model). In this case, the method of determining the third loss value includes: intercepting the feature map of the human body posture sample image according to the first central coordinate to obtain the feature map of a single person; and calculating a third loss value between the first coordinate of each third pixel point in the single character image and the second coordinate of each fourth pixel point in the feature map of the single character by using a smooth L1 loss function, wherein the second coordinate comprises a second center coordinate and a second offset vector in the feature map of the single character.

Further, the human body posture estimation model is adjusted according to the third loss value, so that the output result of the human body posture estimation model is consistent with the output result of the at least one single human body posture estimation model as far as possible. And when the third loss value meets the preset condition, stopping calculating the third loss value, and acquiring the human body posture estimation model. For example, when the third loss value is lower than the threshold for stopping training, the calculation of the third loss value is stopped, and the human body posture estimation model for completing training is obtained, where the threshold for stopping training may be flexibly set according to actual needs, and this is not specifically limited in this embodiment of the application.

It should be noted that the output result of the human body posture estimation model of the embodiment of the present application is two, i.e., feature maps of multiple channels, and one or more coordinates, which include the center coordinate plus the offset vector.

It should be noted that after the training of the bottom-up body posture estimation model is completed, the body posture estimation model can be integrated into a corresponding electronic device (e.g., a computer, etc.) for use.

Therefore, the human body posture estimation model (namely, the model from bottom to top) can learn the output result of the single human body posture estimation model (namely, the model from top to bottom) with higher precision in a knowledge distillation mode, so that the high-precision and quick human body posture estimation model can be obtained and is used for carrying out multi-person posture estimation.

Fig. 2 is a flowchart illustrating a method for training a model according to another exemplary embodiment of the present application. The embodiment of fig. 2 is an example of the embodiment of fig. 1, and the same parts are not repeated herein, and the differences are mainly described here. As shown in fig. 2, the training method of the model includes the following steps.

210: and acquiring a characteristic diagram of the human body posture sample image.

Specifically, the feature map of the human body posture sample image can be obtained by performing feature extraction on the human body posture sample image through a convolution layer and a pooling layer in the initial human body posture estimation model.

220: a first region of the individual person on the feature map is determined.

Specifically, the body posture sample image may include one or more persons, which is not particularly limited in the embodiment of the present application. That is, the feature map of the human pose sample image may include one or more characters.

The first region may be both the width and height dimensions of any person on the human pose sample image, i.e. a region of a single person.

In one embodiment, the initial network model may employ a center-bias vector regression paradigm, and may further introduce a target person scale regression branch in the center-bias vector regression paradigm based on the anchor-free target detection paradigm to obtain a center-scale regression branch, and determine a first region of an individual person on the feature map based on the center-scale regression branch.

Although the target person scale regression branch has a certain degree of ability to predict the size of the person scale, it can only predict the approximate size of the person region, and cannot accurately regress the position of each skeletal point. Therefore, the training of the target human scale regression branch is easier to perform, and the training of the target human scale regression branch is consistent with the training mode of the target detection model.

230: and determining a first loss value based on the bias vector of each first pixel point in the first region relative to a second pixel point corresponding to the first pixel point in the human body posture sample image, the target bias vector and the area of the first region.

Specifically, the bias vector may be calculated by a bias vector regression branch, where each first pixel point in the first region in the feature map of the human posture sample image corresponds to a bias vector of a second pixel point corresponding to the first pixel point on the human posture sample image, such as δ x and δ y. That is, the bias vector may be understood as the predicted outcome of the bias vector regression branch.

The target offset vector may be an actual offset vector corresponding to each first pixel point in a first region in a feature map of the human body posture sample image and corresponding to a second pixel point corresponding to the first pixel point on the human body posture sample image.

The area of the first region is the area of a single person region. It should be noted that the area of the first region can be understood as the predicted result (or the output result) of the target person scale regression branch in the initial human body posture estimation model.

In one embodiment, since loss values generated by persons of different scale sizes are different in size, the bias vector determined by the scale normalization loss function and the second loss value of the target bias vector are divided by the area of the first region to obtain a first loss value.

It should be noted that this step may be understood as starting the training of the bias vector regression branch based on the trained center-scale regression branch. In addition, in the training process of the regression branch of the offset vector, the center-scale regression branch also participates in training, and parameters are not fixed, so that the whole module is optimized end to end.

240: and adjusting the first area according to the first loss value to obtain a second area of the single person.

Specifically, the size of the first region of the single person is modified according to the first loss value, so that a modified second region of the single person is obtained.

For example, the width and/or height of the first region is modified according to the first loss value.

250: and intercepting the human body posture sample image according to the second area to obtain a single person image.

Specifically, the adjusted or modified second region of the single person is intercepted on the human posture sample image (i.e., the sample image of the input initial human posture estimation model) to obtain the single person image.

Therefore, the embodiment of the application performs scale normalization on the offset vector according to the area of the character region, so that characters with different scales have loss values with the same magnitude, and the situation that the loss values are dominated by large-scale characters in the training process is avoided, namely the problem of inconsistent character scales is avoided.

In an embodiment of the present application, determining a first loss value based on a bias vector of each first pixel point in the first region relative to a second pixel point corresponding to the first pixel point in the human body posture sample image, a target bias vector, and an area of the first region includes: the bias vector determined by the scale normalized loss function and the second loss value of the target bias vector are divided by the area to obtain a first loss value.

Specifically, the scale-normalized loss function may be a Smooth L1 loss function, and the type of the loss function is not particularly limited in the embodiments of the present application.

When calculating the loss (or error) between the bias vector predicted by the bias vector regression branch and the target vector by using the Smooth L1 loss function, the area of the first region of the single person is introduced, namely the prediction result of the target person scale regression branch is introduced.

In one embodiment, a second penalty value between the bias vector for the bias vector regression branch prediction and the target vector, calculated using the Smooth L1 penalty function, is divided by the area of the first region to obtain a first penalty value.

It should be noted that the area of the first region may be understood as a scale normalization factor, and the second loss value of each pixel point in the human body posture sample image is divided by the scale normalization factor, so as to avoid the numerical influence of people with different sizes.

Therefore, the numerical difference of the coordinate regression bias vectors of the people with different sizes is reduced by adopting the scale normalization loss function and the area size of the single person region, so that the gradient values of the bias vectors of the people with different sizes are the same, and the condition that the loss value is dominated by the large-scale person in the training process is avoided.

Fig. 3 is a flowchart illustrating a method for training a model according to another exemplary embodiment of the present application. The embodiment of fig. 3 is an example of the embodiment of fig. 1, and the same parts are not repeated herein, and the differences are mainly described here. As shown in fig. 3, the training method of the model includes the following steps.

310: and calculating a third loss value of the at least one output result and the feature map of the human posture sample image.

Specifically, feature extraction is performed on the human body posture sample image based on a convolutional layer and a pooling layer in the human body posture estimation model to obtain a feature map of the human body posture sample image. And then calculating a third loss value between the at least one output result and the feature map of the human posture sample image by using the loss function. It should be noted that the body posture sample image in this step may be a feature map with the feature map of the body posture sample image described in step 210 in the embodiment of fig. 2.

In an embodiment, the at least one output result includes a plurality of channel feature maps (e.g., 18 channel feature maps) output by a single-person pose estimation model based on heat map regression (i.e., single-person pose estimation model). The method comprises the following steps: and calculating a third loss value between the feature maps of the multiple channels and the feature map of the single person by using a mean square error loss function, wherein the feature map of the single person comprises the feature maps obtained by mapping the feature maps of the multiple channels on the feature map of the human posture sample image.

In another embodiment, the at least one output result includes first coordinates output by a single-person pose estimation model based on coordinate regression (i.e., single-person pose estimation model), the first coordinates including a first center coordinate and a first offset vector. The method comprises the following steps: and calculating a third loss value between the first coordinate of each third pixel point in the single character image and the second coordinate of each fourth pixel point in the feature map of the single character by using a smooth L1 loss function, wherein the second coordinate comprises a second center coordinate and a second offset vector in the feature map of the single character.

320: and adjusting the human body posture estimation model according to the third loss value.

Specifically, the human body posture estimation model is adjusted according to the third loss value, so that the output result of the single human body posture estimation model is consistent with the prediction result of the human body posture estimation model. And continuously determining a third loss value in the continuous training process until the third loss value meets the preset condition.

330: and when the third loss value meets the preset condition, stopping calculating the third loss value, and acquiring the human body posture estimation model.

Specifically, the preset condition may be that the third loss threshold is lower than a preset training stopping threshold, and the preset condition is not specifically limited in the embodiment of the present application.

In an embodiment, when the third loss value is lower than a preset training stopping threshold, the calculation of the third loss value is stopped, and the human body posture estimation model which completes training is obtained. The training stopping threshold may be a loss value corresponding to the initial human body posture estimation model when the initial human body posture estimation model converges in the training process of the initial human body posture estimation model, and the training stopping threshold is not specifically limited in the embodiment of the present application.

Therefore, the loss value is continuously calculated in the training process of the human posture estimation model, and the human posture estimation model is adjusted according to the loss value, so that guarantee is provided for distilling the output result of the top-down model to the bottom-up model.

In an embodiment of the present application, calculating a third loss value of the feature map of the at least one output result and the body posture sample image includes: and calculating a third loss value between the at least one output result and the feature map of the human posture sample image by using the loss function.

In particular, the loss function may be a MSELoss loss function, smooth L₁A loss function or a cross entropy loss function, etc., and the loss function is not particularly limited in the embodiments of the present application. It should be noted that the selection of the loss function may be determined by a knowledge distillation process of the human posture estimation model according to the output result of the single posture estimation model.

In one embodiment, when knowledge distillation is performed on the central heat map prediction branch of the human pose estimation model using the feature maps of multiple channels output by the single-person pose estimation model based on heat map regression, the loss function may be a mean square error loss function (e.g., mselos).

In another embodiment, when the center point coordinates and the first bias vector output by the single-person posture estimation model based on coordinate regression are applied to perform knowledge distillation on the bias vector regression branch of the human posture estimation model, the Loss function may be a Smooth L1 (or Smooth L1 Loss) Loss function.

Therefore, the loss value is calculated through the loss function, the loss function is flexibly adjusted according to the content of knowledge distillation, and the accuracy of the calculation result of the loss value is guaranteed.

Fig. 4 is a flowchart illustrating a method for training a model according to still another exemplary embodiment of the present application. The embodiment of fig. 4 is an example of the embodiment of fig. 3, and the same parts are not repeated, and the differences are mainly described here. As shown in fig. 4, the training method of the model includes the following steps.

In an embodiment, the at least one single pose estimation model comprises: based on the heat map regression single person pose estimation model, the at least one output result includes feature maps for a plurality of channels (e.g., 18-channel feature maps). Wherein the profiles of the plurality of channels may be understood as heat maps of 18 channels corresponding to a single person area.

It should be noted that this step can be understood as training the central heat map prediction branch in the initial human body posture estimation model (i.e., student model) based on the single-person posture estimation model (i.e., teacher model) of the heat map regression.

410: and according to the feature maps of the multiple channels, intercepting the feature map of the human body posture sample image to obtain the feature map of a single person.

Specifically, in the training process of the central heat map prediction branch included in the human posture estimation model (i.e., the student model), the feature maps of the multiple channels output by the single-person posture estimation model based on the heat map regression may be first mapped onto the feature map of the human posture sample image to determine the mapping regions of the feature maps of the multiple channels on the feature map of the human posture sample image. And then according to the determined mapping area, intercepting the mapping area on the feature map of the human posture sample image as the feature map of a single person.

In one embodiment, the application uses a Region of interest (ROI) mode to cut out the feature map of the corresponding individual person.

420: and calculating a third loss value between the feature maps of the plurality of channels and the feature map of the single person by using the mean square error loss function.

Specifically, a mean square error loss function (such as an MSELoss loss function) is calculated by using a feature map of a single person and feature maps of a plurality of channels output by a single person posture estimation model based on heat map regression, and a third loss value calculated by the mean square error loss function is used for adjusting a prediction result of a student model so as to improve the output precision of the person posture estimation model.

Therefore, the knowledge distillation is carried out on the human body posture estimation model through the output result of the single posture estimation model based on the heat map regression, and therefore the accuracy of the heat map regression branch in the human body posture estimation model is improved.

Fig. 5 is a flowchart illustrating a method for training a model according to still another exemplary embodiment of the present application. The embodiment of fig. 5 is an example of the embodiment of fig. 3, and the same parts are not repeated, and the differences are mainly described here. As shown in fig. 5, the training method of the model includes the following steps.

In one embodiment, the at least one output result includes a first coordinate of each third pixel point within the single character image, the first coordinate including a first center coordinate and a first offset vector.

Since the coordinate regression-based single-person pose estimation model can adopt a similar center-offset vector regression paradigm, that is, after a first center coordinate of a single person image is defined, the remaining pixel points or key points (for example, 17 skeleton points) of the single person image can be represented by the first center coordinate plus a first offset vector.

It should be noted that this step can be understood as training the single-person pose estimation model (i.e., teacher model) based on coordinate regression on the bias vector regression branch in the initial human pose estimation model (i.e., student model).

510: and determining the feature map of the single person in the feature map of the human posture sample image according to the first central coordinate.

Specifically, in the training process of the bias vector regression branch included in the human posture estimation model (i.e., the student model), first, a first center coordinate output by the coordinate regression-based single-person posture estimation model may be mapped onto the feature map of the human posture sample image to determine a second center coordinate corresponding to the first center coordinate. And then taking the character area comprising the second center coordinate as the characteristic map of the single character, namely selecting the characteristic map of the single character as the area needing knowledge distillation currently.

520: using smooth L₁And calculating a third loss value between the first coordinate of each third pixel point and the second coordinate of each fourth pixel point in the feature map of the single character by using the loss function.

In an embodiment, the second coordinates include a second center coordinate and a second offset vector.

Specifically, the Smooth L1 loss function is calculated from the second center coordinates and the second offset vector (i.e., the second coordinates) corresponding to each fourth pixel point in the feature map of the single person and the first center coordinates and the first offset vector (i.e., the first coordinates) corresponding to each third pixel point in the single person image output by the coordinate regression-based single person posture estimation model. And the third loss value calculated by the Smooth L1 loss function is used for adjusting the prediction result (namely the second coordinate) of the human body posture estimation model so that the second coordinate is consistent with the first coordinate as much as possible.

It should be noted that parameters of the single-person posture estimation model based on the heat map regression and the single-person posture estimation model based on the coordinate regression need to be fixed, and do not participate in training of the human body posture estimation model. The branches of the human body posture estimation model (i.e. the bottom-up model), such as the bias vector regression branch, the target character scale regression branch, and the like, need to be optimized end-to-end to improve the model effect.

Therefore, the knowledge distillation is carried out on the human body posture estimation model through the output result of the single posture estimation model based on the coordinate regression, and therefore the precision of the regression branch of the offset vector in the human body posture estimation model is improved.

Fig. 6 is a flowchart illustrating a human body gesture recognition method according to an exemplary embodiment of the present application. The method of fig. 6 is performed by a computing device, e.g., a server. As shown in fig. 1, the human body posture recognition method includes the following steps.

610: and acquiring a human body posture estimation model.

In an embodiment, the human body posture estimation model is obtained based on the training method of the model described in the above embodiment.

It should be noted that, for a specific method for obtaining the human body posture estimation model, please refer to the description of the above embodiments for details.

620: and inputting the image containing the plurality of human body postures into the human body posture estimation model to obtain the recognition results of the plurality of human body postures.

Specifically, the human body posture estimation model recognizes a plurality of human body postures included in the input image to obtain recognition results of the plurality of human body postures.

The recognition results of the plurality of human body postures comprise the background of an image and the recognition result of a human body, and binarization processing is performed on the recognition results of the background and the human body by using a preset threshold, so that the recognition results of the plurality of human body postures can be obtained, for example, each pixel point on the recognition results of the plurality of human body postures can be represented by 0 or 1, wherein 1 represents a pixel point in a human body region, and 0 represents a pixel point in a background region.

Therefore, the human body posture estimation model (namely, the model from bottom to top) can learn the output result of the single human body posture estimation model (namely, the model from top to bottom) with higher precision in a knowledge distillation mode, and the speed and the precision of acquiring the human body posture recognition in the input image are improved.

Fig. 7 is a schematic structural diagram of a training apparatus 700 for a model according to an exemplary embodiment of the present application. As shown in fig. 7, the training apparatus 700 of the model includes: a first retrieving module 710, an input retrieving module 720 and a second retrieving module 730.

The first obtaining module 710 is configured to input the body pose sample image into the initial body pose estimation model to obtain a single character image included in the body pose sample image; the input obtaining module 720 is configured to input the single character image into at least one single pose estimation model to obtain at least one output result corresponding to the at least one single pose estimation model, respectively, where the at least one single pose estimation model is a top-down model; the second obtaining module 730 is configured to train the initial human body posture estimation model by using the at least one output result as a training guide to obtain a human body posture estimation model, where the human body posture estimation model is a bottom-up model and is used for recognizing a plurality of human body postures in the image.

The embodiment of the application provides a training device of model for human body posture estimation model (from bottom to top model) can learn the output result of the higher single posture estimation model of precision (from top to bottom model) through the mode of knowledge distillation, in order to obtain high accuracy, quick human body posture estimation model, is used for carrying out many people's posture estimation.

According to an embodiment of the present application, the first obtaining module 710 is configured to obtain a feature map of a body posture sample image; determining a first area of a single person on the feature map; determining a first loss value based on the bias vector of each first pixel point in the first region relative to a second pixel point corresponding to the first pixel point in the human body posture sample image, the target bias vector and the area of the first region; adjusting the first area according to the first loss value to obtain a second area of a single person; and intercepting the human body posture sample image according to the second area to obtain a single person image.

According to an embodiment of the present application, the first obtaining module 710 is configured to divide the bias vector determined by the scale-normalized loss function and the second loss value of the target bias vector by an area to obtain a first loss value.

According to an embodiment of the present application, the second obtaining module 730 is configured to calculate a third loss value of the at least one output result and the feature map of the human body posture sample image; adjusting the human body posture estimation model according to the third loss value; and when the third loss value meets the preset condition, stopping calculating the third loss value, and acquiring the human body posture estimation model.

According to an embodiment of the present application, the second obtaining module 730 is configured to calculate a third loss value between the at least one output result and the feature map of the body posture sample image by using a loss function.

According to an embodiment of the present application, the at least one output result includes feature maps of multiple channels, and the second obtaining module 730 is configured to perform interception on the feature map of the human body posture sample image according to the feature maps of the multiple channels to obtain a feature map of a single person; and calculating a third loss value between the feature maps of the plurality of channels and the feature map of the single person by using the mean square error loss function.

According to an embodiment of the present application, the at least one output result includes a first coordinate of each third pixel point in the single character image, the first coordinate includes a first center coordinate and a first offset vector, and the second obtaining module 730 is configured to determine a feature map of the single character in the feature map of the human posture sample image according to the first center coordinate; and calculating a third loss value between the first coordinate of each third pixel point and a second coordinate of each fourth pixel point in the feature map of the single person by using a smooth L1 loss function, wherein the second coordinate comprises a second center coordinate and a second offset vector.

It should be understood that, in the above embodiment, specific working processes and functions of the first obtaining module 710, the input obtaining module 720 and the second obtaining module 730 may refer to descriptions in the model training method provided in the above embodiments of fig. 1 to 6, and are not described herein again to avoid repetition.

Fig. 8 is a schematic structural diagram of a human body posture recognition apparatus 800 according to an exemplary embodiment of the present application. As shown in fig. 8, the human body posture identifying apparatus 800 includes: a third acquisition module 810 and a fourth acquisition module 820.

The third obtaining module 810 is configured to obtain a body posture estimation model, where the body posture estimation model is obtained based on the training method of the model according to the first aspect; the fourth obtaining module 820 is configured to input the image including the plurality of body gestures into the body gesture estimation model to obtain the recognition results of the plurality of body gestures.

The embodiment of the application provides a human body posture recognition device, and through the mode of knowledge distillation, the human body posture estimation model (from the bottom upwards model) can learn the output result of the single posture estimation model (from the top downwards model) with higher precision, thereby improving the speed and the precision of obtaining the human body posture recognition in the input image.

It should be understood that, for the specific working processes and functions of the third obtaining module 810 and the fourth obtaining module 820 in the foregoing embodiment, reference may be made to the description in the human body posture identification method provided in the foregoing embodiment of fig. 6, and in order to avoid repetition, details are not described here again.

FIG. 9 is a block diagram of an electronic device 900 for model training or human gesture recognition provided by an exemplary embodiment of the present application.

Referring to fig. 9, electronic device 900 includes a processing component 910 that further includes one or more processors, and memory resources, represented by memory 920, for storing instructions, such as applications, that are executable by processing component 910. The application programs stored in memory 920 may include one or more modules that each correspond to a set of instructions. Further, the processing component 910 is configured to execute instructions to perform the above-described training method of the model, or the human body posture recognition method.

The electronic device 900 may also include a power component configured to perform power management for the electronic device 900, a wired or wireless network interface configured to connect the electronic device 900 to a network, and an input-output (I/O) interface. The electronic device 900 may be operated based on an operating system, such as Windows Server, stored in the memory 920^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TMOr the like.

A non-transitory computer readable storage medium having instructions stored thereon that, when executed by a processor of the electronic device 900, enable the electronic device 900 to perform a method of model training, comprising: inputting the human body posture sample image into an initial human body posture estimation model to obtain a single character image included in the human body posture sample image; inputting the single character image into at least one single person pose estimation model to obtain at least one output result of the at least one single person pose estimation model, wherein the at least one single person pose estimation model is a top-down model; and training the initial human body posture estimation model by taking at least one output result as a training guide to obtain a human body posture estimation model, wherein the human body posture estimation model is a bottom-up model and is used for identifying a plurality of human body postures in the human body image.

Or, a human body posture recognition method includes: obtaining a human body posture estimation model, wherein the human body posture estimation model is obtained based on the model training method of the first aspect; and inputting the image containing the plurality of human body postures into the human body posture estimation model to obtain the recognition results of the plurality of human body postures.

All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program check codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in the description of the present application, the terms "first", "second", "third", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modifications, equivalents and the like that are within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A training method of a model is applied to a human body posture estimation model, and is characterized by comprising the following steps:

inputting the human body posture sample image into an initial human body posture estimation model to obtain a single character image included in the human body posture sample image;

inputting the single character image into at least one single person posture estimation model to respectively obtain at least one output result corresponding to the at least one single person posture estimation model, wherein the at least one single person posture estimation model is a top-down model;

and training the initial human body posture estimation model by taking the at least one output result as a training guide to obtain the human body posture estimation model, wherein the human body posture estimation model is a bottom-up model and is used for identifying a plurality of human body postures in the image.

2. The method for training a model according to claim 1, wherein said obtaining an image of a single person included in the body pose sample image comprises:

acquiring a characteristic diagram of the human body posture sample image;

determining a first area of a single person on the feature map;

determining a first loss value based on a bias vector, a target bias vector and the area of each first pixel point in the first region relative to a second pixel point corresponding to the first pixel point in the human body posture sample image;

adjusting the first area according to the first loss value to obtain a second area of the single person;

and intercepting the human body posture sample image according to the second area so as to obtain the single person image.

3. The method for training a model according to claim 2, wherein the determining a first loss value based on the offset vector of each first pixel point in the first region relative to a second pixel point in the human body posture sample image corresponding to the first pixel point, a target offset vector and the area of the first region comprises:

dividing a second loss value of the bias vector and the target bias vector determined using a scale-normalized loss function by the area to obtain the first loss value.

4. The method for training a model according to claim 1, wherein the training the initial body posture estimation model with the at least one output result as a training guide to obtain the body posture estimation model comprises:

calculating a third loss value of the at least one output result and the feature map of the human posture sample image;

adjusting the human body posture estimation model according to the third loss value;

and when the third loss value meets a preset condition, stopping calculating the third loss value, and acquiring the human body posture estimation model.

5. The method for training a model according to claim 4, wherein the calculating a third loss value of the feature map of the at least one output result and the human posture sample image comprises:

calculating the third loss value between the at least one output result and the feature map of the body posture sample image by using a loss function.

6. A method of training a model according to claim 5, wherein the at least one output result comprises a feature map of a plurality of channels,

wherein the calculating the third loss value between the at least one output result and the feature map of the body posture sample image by using a loss function comprises:

intercepting the characteristic diagram of the human body posture sample image according to the characteristic diagrams of the channels to obtain the characteristic diagram of a single person;

and calculating the third loss value between the feature maps of the plurality of channels and the feature map of the single person by using a mean square error loss function.

7. The method of claim 5, wherein the at least one output result includes a first coordinate for each third pixel point in the single character image, the first coordinate including a first center coordinate and a first offset vector,

determining a feature map of a single person in the feature map of the human posture sample image according to the first central coordinate;

using smooth L₁The loss function calculates the third loss value between the first coordinate of each third pixel point and a second coordinate of each fourth pixel point in the feature map of the single character, wherein the second coordinate comprises a second center coordinate and a second offset vector.

8. A human body posture recognition method is characterized by comprising the following steps:

obtaining a human body posture estimation model, wherein the human body posture estimation model is obtained based on a training method of the model of any one of the claims 1 to 7;

and inputting an image containing a plurality of human body postures into the human body posture estimation model to obtain the recognition results of the human body postures.

9. A training device of a model is applied to a human body posture estimation model, and is characterized by comprising:

the system comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for inputting a human body posture sample image into an initial human body posture estimation model so as to acquire a single character image included in the human body posture sample image;

an input obtaining module, configured to input the single character image into at least one single pose estimation model to respectively obtain at least one output result corresponding to the at least one single pose estimation model, where the at least one single pose estimation model is a top-down model;

and the second obtaining module is used for training the initial human body posture estimation model by taking the at least one output result as a training guide so as to obtain the human body posture estimation model, wherein the human body posture estimation model is a bottom-up model and is used for identifying a plurality of human body postures in the image.

10. A human body posture identifying device, comprising:

a third obtaining module, configured to obtain a body posture estimation model, where the body posture estimation model is obtained based on a training method of the model of any one of claims 1 to 7;

and the fourth acquisition module is used for inputting the image containing a plurality of human body postures into the human body posture estimation model so as to acquire the recognition results of the plurality of human body postures.

11. A computer-readable storage medium, in which a computer program is stored for performing a method of training a model according to any of the preceding claims 1 to 7 and/or for performing a method of human body posture recognition according to claim 8.

12. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions,

wherein the processor is configured to perform a training method of the model of any one of the above claims 1 to 7, and/or to perform a human posture recognition method of the above claim 8.