CN109508681B

CN109508681B - Method and device for generating human body key point detection model

Info

Publication number: CN109508681B
Application number: CN201811380813.2A
Authority: CN
Inventors: 鲍慊; 刘武; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2021-11-30
Anticipated expiration: 2038-11-20
Also published as: CN109508681A

Abstract

The embodiment of the application discloses a method and a device for generating a human body key point detection model. One embodiment of the method comprises: acquiring a sample set comprising a sample human body image and marking information; selecting samples from the sample set, and performing the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models. The embodiment can more accurately detect the blocked or hidden human key points.

Description

Method and device for generating human body key point detection model

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating a human body key point detection model.

Background

The human body key point detection is to obtain the positions of human body key points in an image or a video through a computer vision technology and is divided into two problems of single key point detection and multi-person key point detection. After human body detection, the multi-person key point detection usually obtains the key point positions of each person in the picture by using a single key point detection method, so that the improvement of the performance of the single key point detection method is especially important. The deep learning method provides an effective solution for improving the detection accuracy of the key points of the human body.

In the related art, the detection accuracy of the key points such as wrists and ankles which are easily shielded and deformed is low. And the difference of feature maps with different scales is not considered, so that the problems that the target area of the key point is small and difficult to detect cannot be effectively solved.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a human body key point detection model.

In a first aspect, an embodiment of the present application provides a method for generating a human body key point detection model, including: acquiring a sample set, wherein samples in the sample set comprise sample human body images and marking information of key points in the sample human body images; selecting samples from the sample set, and performing the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.

In some embodiments, inputting a sample human body image of the selected sample into the initial first model to obtain a feature map of the pyramid structure, including: inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block; and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.

In some embodiments, determining the first layer loss value based on the feature map and the annotation information of the key points in the sample human body image comprises: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point; a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.

In some embodiments, inputting the feature map into the initial second model, and obtaining the position coordinates of the detected keypoints comprises: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.

In some embodiments, generating an attention feature map from the feature map comprises: adding the feature map into bottleneck blocks of different times to obtain feature maps of different scales; the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map; inputting feature maps of different scales into an attention model to obtain first attention maps of different resolutions; the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram; inputting the second feature map into the attention model to obtain a second attention map; and combining the second attention diagram and the second characteristic diagram to obtain the attention characteristic diagram.

In some embodiments, determining the second layer loss value based on the position coordinates of the detected keypoints and the annotation information of the keypoints in the sample human body image comprises: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.

In some embodiments, the method further comprises: and in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, reselecting the sample from the sample set, and continuing to perform the training step by using the adjusted initial first model and the adjusted initial second model.

In a second aspect, an embodiment of the present application provides a method for detecting a human body, including: acquiring a human body image of a detection object; inputting the human body image into the human body key point detection model generated by the method according to the first aspect, and generating the position coordinates of the human body key points of the detection object.

In a third aspect, an embodiment of the present application provides an apparatus for generating a human body key point detection model, including: the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire a sample set, and samples in the sample set comprise a sample human body image and marking information of key points in the sample human body image; a training unit configured to select samples from a set of samples, and to perform the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.

In some embodiments, the training unit is further configured to: inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block; and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.

In some embodiments, the training unit is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point; a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.

In some embodiments, the training unit is further configured to: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.

In some embodiments, the training unit is further configured to: adding the feature map into bottleneck blocks of different times to obtain feature maps of different scales; the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map; inputting feature maps of different scales into an attention model to obtain first attention maps of different resolutions; the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram; inputting the second feature map into the attention model to obtain a second attention map; and combining the second attention diagram and the second characteristic diagram to obtain the attention characteristic diagram.

In some embodiments, the training unit is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.

In some embodiments, the apparatus further comprises an adjustment unit configured to: and in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, reselecting the sample from the sample set, and continuing to perform the training step by using the adjusted initial first model and the adjusted initial second model.

In a fourth aspect, an embodiment of the present application provides an apparatus for detecting key points of a human body, including: a detection unit configured to acquire a human body image of a detection object; a generating unit configured to input the human body image into the human body key point detection model generated by the method according to any one of the first aspect, and generate position coordinates of the human body key points of the detection object.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a sixth aspect, the present application provides a computer-readable medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method according to any one of the first aspect.

According to the method and the device for generating the human body key point detection model, the pyramid model and the attention model are fused to generate the human body key point detection model. For the convolutional neural network, different depths correspond to different levels of semantic features, the shallow network has high resolution, the detail features are concerned, the deep network has low resolution, and the semantic features are concerned. The cascade model is implemented by connecting two or more neural networks in series, so as to obtain more context information. Under the condition that the calculated amount of an original model is not increased basically, the multi-scale problem in object detection can be solved by changing network connection. The attention model (attention model) measures the importance of different features to the current task by calculating the weight of the features, thereby focusing on important features and weakening unimportant features. The technologies are beneficial for the network to detect the hidden or hidden human key points more accurately.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of generating a human keypoint detection model according to the application;

FIGS. 3a and 3b are schematic diagrams of an application scenario of a method for generating a human body keypoint detection model according to the application;

FIG. 4 is a flow diagram of yet another embodiment of a method of generating a human keypoint detection model according to the present application;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for generating a human keypoint detection model according to the application;

FIG. 6 is a schematic block diagram of one embodiment of an apparatus for detecting key points in a human body according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which a method of generating a human keypoint detection model, an apparatus for generating a human keypoint detection model, a method of detecting human keypoints, or an apparatus for detecting human keypoints according to an embodiment of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminals

101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing communication links between the

terminals

101, 102, the database server 104 and the server 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the

terminals

101, 102 to interact with the server 105 over the network 103 to receive or send messages or the like. The

terminals

101 and 102 may have various client applications installed thereon, such as a model training application, a human key point detection and recognition application, a shopping application, a payment application, a web browser, an instant messenger, and the like.

Here, the

terminals

101 and 102 may be hardware or software. When the

terminals

101 and 102 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), laptop portable computers, desktop computers, and the like. When the

terminals

101 and 102 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminals

101, 102 are hardware, an image capturing device may be further mounted thereon. The image acquisition device can be various devices capable of realizing the function of acquiring images, such as a camera, a sensor and the like. The user 110 may use the image capturing device on the

terminal

101, 102 to capture a human body image of himself or another person.

Database server 104 may be a database server that provides various services. For example, a database server may have a sample set stored therein. The sample set contains a large number of samples. The sample can include a sample human body image and labeling information of key points in the sample human body image. In this way, the user 110 may also select samples from a set of samples stored by the database server 104 via the

terminals

101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the

terminals

101, 102. The background server may train the initial model by using samples in the sample set sent by the

terminals

101 and 102, and may send a training result (e.g., the generated human body key point detection model) to the

terminals

101 and 102. In this way, the user can apply the generated human key point detection model to perform human key point detection.

Here, the database server 104 and the server 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating a human body key point detection model or the method for detecting a human body provided by the embodiment of the present application is generally performed by the server 105. Accordingly, the means for generating a human body keypoint detection model or the means for detecting human body keypoints is generally also provided in the server 105.

It is noted that database server 104 may not be provided in system architecture 100, as server 105 may perform the relevant functions of database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating a human keypoint detection model according to the present application is shown. The method for generating the human body key point detection model can comprise the following steps:

step 201, a sample set is obtained.

In this embodiment, the execution subject of the method of generating a human body keypoint detection model (e.g., the server 105 shown in fig. 1) may obtain the sample set in a variety of ways. For example, the executing entity may obtain the existing sample set stored therein from a database server (e.g., database server 104 shown in fig. 1) via a wired connection or a wireless connection. As another example, a user may collect a sample via a terminal (e.g.,

terminals

101, 102 shown in FIG. 1). In this way, the executing entity may receive samples collected by the terminal and store the samples locally, thereby generating a sample set.

Here, the sample set may include at least one sample. The sample can include a sample human body image and annotation information associated with key points in the sample human body image.

Optionally, data enhancement of the training samples may be performed, including rotation, size change, cropping, flipping, changing light intensity, etc., to obtain augmented training data and to make the model more generalized. Experiments show that the size of an input picture can influence the accuracy of key point detection, and the larger the size of the input picture is in a certain range, the more accurate the position of a detected key point is. Since the shape of a person in a picture is generally a long bar, the method sets the size of an input picture to 864 x 648, taking accuracy and calculation into consideration. In implementation, the picture is cropped while ensuring that the aspect ratio of the picture is unchanged, and the picture size is modified to 864 x 648 after zero padding of the picture edges. When data enhancement is carried out on the picture, corresponding operations such as rotation, scale change and overturning are carried out on the marked key point coordinates.

In the present embodiment, the sample human body image generally refers to an image containing a human body. It may be a planar human body image or a stereoscopic human body image (i.e., a human body image containing depth information). And the sample human body image may be a color image (e.g., RGB (Red, Green, Blue, Red-Green-Blue) photograph) and/or a grayscale image, etc. The Format of the Image is not limited in the present application, and may be a Format such as jpg (Joint Photo graphics Experts Group, a picture Format), BMP (Bitmap, Image file Format), or RAW (RAW Image Format), as long as the subject reading and recognition can be performed.

At step 202, a sample is selected from a sample set.

In this embodiment, the executing subject may select a sample from the sample set obtained in step 201, and perform the training steps from step 203 to step 208. The selection manner and the number of samples are not limited in the present application. For example, at least one sample may be randomly selected, or a sample with better definition (i.e., higher pixels) of the human body image may be selected from the samples.

Step 203, inputting the sample human body image of the selected sample into the initial first model to obtain the characteristic diagram of the pyramid structure.

In this embodiment, the executive may input the sample human image of the sample selected in step 202 into the initial first model. By detecting and analyzing the key point regions in the sample human body image, a feature map containing key points can be obtained.

In this embodiment, the initial first model may be an existing variety of neural network models created based on machine learning techniques. The neural network model may have various existing neural network structures (e.g., DenseBox, VGGNet, ResNet, SegNet, etc.). The storage location of the initial model is likewise not limited in this application.

As shown in fig. 3a, in the first stage of the cascade model, Resnet101 may be selected as a basic network structure, the feature maps conv2, conv3, conv4, conv5 output by the last convolutional layer of each residual block are taken out to pass through 1 × 1 full convolutional layer 1, and then the feature maps of the feature pyramid structure are obtained by up-sampling and transverse connecting the layers.

And step 204, determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image.

In this embodiment, the executive subject may analyze the labeling information of the key points of the sample human body image and the feature map obtained in step 203, so as to determine the first layer loss value. For example, the feature map and the label information of the key point may be used as parameters and input to a specified loss function (loss function), so that a loss value between the feature map and the key point can be calculated.

In this embodiment, the loss function is usually used to measure the degree of inconsistency between the predicted value (e.g. feature map) and the actual value (e.g. annotation information) of the model. It is a non-negative real-valued function. In general, the smaller the loss function, the better the robustness of the model. The loss function may be set according to actual requirements.

In some optional implementations of this embodiment, determining the first-layer loss value based on the feature map and the annotation information of the key points in the sample human body image includes: and generating a real thermodynamic diagram (heatmap) for each key point according to the labeling information of the key points in the sample human body image. And generating a preset number of first predictive thermodynamic diagrams according to the feature maps, wherein each first predictive thermodynamic diagram corresponds to one key point. A first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram. And (3) sequentially passing each hierarchical feature diagram output in the step 203 through a 1 × 1 convolutional layer and a 3 × 3 convolutional layer to obtain the detected key point thermodynamic diagrams under each resolution, and calculating the L2 losses of the detected key point thermodynamic diagrams and the real key point thermodynamic diagrams as a first layer loss function of the network.

Step 205, inputting the feature map into the initial second model to obtain the position coordinates of the detected key points.

In this embodiment, the executing agent may input the feature map generated in step 203 into the initial second model, and obtain the position coordinates of the detected key points. The initial second model may be an attention model based neural network. The primary purpose of the initial second model is to extract the concerned features from feature maps of different scales, and the detailed features and the semantic features can be retained. Thereby concentrating the important features and weakening the unimportant features.

In some optional implementations of this embodiment, inputting the feature map into the initial second model, and obtaining the position coordinates of the detected key points includes: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.

In some optional implementations of this embodiment, generating the attention feature map according to the feature map includes:

and step 2051, adding the feature maps into the bottleneck blocks of different times to obtain feature maps of different scales.

In this embodiment, in the second stage of the cascade model, bottleeck (bottleneck block) of different times is added to the feature maps of each level output in step 203, so as to obtain feature maps of different scales. Stacking more bottleneck blocks into deeper levels, with smaller space sizes, achieves a good balance between efficiencies.

And step 2052, the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map.

In this embodiment, the feature maps of different scales obtained in step 2051 are upsampled and then subjected to pixel-wise addition (pixel-wise add) to obtain a feature map f_cSee 301 in fig. 3 a.

And step 2053, inputting the feature maps with different scales into the attention model to obtain first attention maps with different resolutions.

In this embodiment, the different-scale feature maps output in step 2051 are subjected to attention maps (attention maps) with different resolutions by an attention model, see 302 in fig. 3 a.

And step 2054, the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram.

In this embodiment, the fused first attention map is combined with f obtained in step 2052_cCombining to obtain refined feature maps f_AM1See 303 in fig. 3 a.

In step 2055, the second feature map is input into the attention model to obtain a second attention map.

In this embodiment, still further, f_AM1Obtaining a refined attention map AM from an attention model₂I.e. 304 in fig. 3 a.

And step 2056, combining the second attention map and the second feature map to obtain an attention feature map.

In this embodiment, f is_AM1And AM₂Combining to obtain a refined feature map f_out. I.e. 305 in fig. 3 a. In this step, different resolutions focus on different image features, small resolutions focus on global information, and high resolutions focus more on local details.

And step 206, determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image.

In this embodiment, the execution subject may analyze the labeling information of the key points of the sample human body image and the position coordinates of the key points obtained in step 205, so as to determine the second layer loss value. For example, the position coordinates of the detected key points and the label information of the key points may be input to a predetermined loss function (loss function) as parameters, and a loss value between the two values may be calculated.

In this embodiment, the loss function is generally used to measure the degree of inconsistency between the predicted value (e.g. the position coordinates of the detected key points) and the actual value (e.g. the annotation information of the key points) of the model. It is a non-negative real-valued function. In general, the smaller the loss function, the better the robustness of the model. The loss function may be set according to actual requirements.

In some optional implementations of this embodiment, the second layer is determined based on the position coordinates of the detected key points and the label information of the key points in the sample human body imageLoss values, including: and generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image. And generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to one key point. And determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram. Feature map f output in step 2056_outAnd inputting the heat maps into the full convolutional layer 2 (namely, sequentially passing through 1 × 1 convolutional layer and 3 × 3 convolutional layers) to obtain detected key points, wherein the number of the heat maps is the same as that of the key points, each heat map corresponds to one key point, and the position of the maximum probability pixel is searched on each heat map, namely the position coordinate of the detected key point. And calculating the L2 loss of the detection thermodynamic diagram and the real thermodynamic diagram of the key points as a function of the loss of the second stage of the network.

Step 207, determining whether the initial first model and the initial second model are trained based on the first layer loss value and the second layer loss value.

In this embodiment, the first layer loss value and the second layer loss value are added to obtain the total loss value of the network. In each iterative training process, inputting pictures and corresponding key point marking data, calculating a first layer loss value and a second layer loss value by forward propagation, then calculating the gradient of the first layer loss value and the second layer loss value, and completing the backward propagation of the network and updating parameters. Experiments show that after a certain number of iterations, the first layer loss value and the second layer loss value are changed, only key points which are difficult to detect are concerned, namely, only a plurality of key point channels with larger second loss values are calculated and returned, and therefore a better detection effect on the key points which are difficult to detect is achieved.

From the change in the loss value, the execution subject may determine whether the initial model is trained. As an example, if multiple samples are selected in step 202, the performing agent may determine that the initial first model and the initial second model are trained to be complete if the total loss value of each sample reaches the target value. As another example, the performing agent may count the proportion of samples with total loss values reaching the target value to the selected samples. And when the ratio reaches a preset sample ratio (e.g., 95%), it can be determined that the initial model training is complete.

In this embodiment, if the executing entity determines that the training of the initial first model and the initial second model is completed, the executing entity may continue to execute step 208. If the executing agent determines that the initial first model and the initial second model are not trained, the relevant parameters in the initial first model and the initial second model may be adjusted. The weights in each convolutional layer in the initial first model and the weights in each attention model in the initial second model are modified, for example, using a back propagation technique. And may return to step 202 to re-select samples from the sample set. So that the training steps described above can be continued.

It should be noted that the selection manner is not limited in the present application. For example, in the case where there are a large number of samples in the sample set, the execution subject may select a non-selected sample from the sample set.

Step 208, in response to determining that the training of the initial first model and the initial second model is complete, determining the initial first model and the initial second model as the human body key point detection models.

In this embodiment, if the execution subject determines that the training of the initial first model and the initial second model is completed, the initial first model and the initial second model may be determined as the human body key point detection model.

Optionally, the executing entity may store the generated human body key point detection model locally, or may send it to a terminal or a database server.

According to the method provided by the embodiment of the application, the attention model is added into the cascaded characteristic pyramid model, so that the accuracy of detecting the human key points which are difficult to detect, blocked or rarely act is improved. The test result is as the example of fig. 3b, and the scheme can accurately detect 14 key points of the head, the neck, the left and right shoulders, the left and right elbows, the left and right wrists, the left and right hips, the left and right knees and the left and right ankles of the human body, and can accurately detect key points which are difficult to detect, are shielded and have rare actions.

Referring to fig. 4, a flowchart 400 of an embodiment of a method for detecting a human body provided by the present application is shown. The method for detecting a human body may include the steps of:

step 401, acquiring a human body image of a detection object.

In the present embodiment, an execution subject (e.g., the server 105 shown in fig. 1) of the method for detecting a human body may acquire a human body image of a detection target in various ways. For example, the execution subject may obtain the human body image stored therein from a database server (e.g., database server 104 shown in fig. 1) through a wired connection manner or a wireless connection manner. As another example, the execution subject may also receive a human body image captured by a terminal (e.g.,

terminals

101, 102 shown in fig. 1) or other device.

In the present embodiment, the detection object may be any user, such as a user using a terminal, or another user who appears in the image capturing range, or the like. The body image may equally be a color image and/or a grayscale image, etc. And the format of the human body image is not limited in this application.

Step 402, inputting the human body image into the human body key point detection model, and generating the position coordinates of the human body key points of the detection object.

In this embodiment, the executing subject may input the human body image acquired in step 401 into the human body key point detection model, thereby generating a human body key point detection result of the detection object. The human body key point detection result may be position information for describing key points of the human body in the image.

In this embodiment, the human key point detection model may be generated by the method described in the embodiment of fig. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.

It should be noted that the method for detecting a human body in this embodiment may be used to test the human body key point detection model generated in the foregoing embodiments. And then the human body key point detection model can be continuously optimized according to the test result. The method may also be a practical application method of the human body key point detection model generated by the above embodiments. The human body key point detection model generated by the embodiments is adopted to detect the human body key points, and the performance of human body key point detection is improved. If more human key points are found, the found human key point information is more accurate, and the like. The accuracy of human key point detection of difficult detection, sheltered or rare actions is improved.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a human body key point detection model, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for generating a human body key point detection model of the present embodiment includes: an acquisition unit 501, a training unit 502 and an adjustment unit 503. Wherein the obtaining unit 501 is configured to obtain a sample set, wherein samples in the sample set include a sample human body image and annotation information of key points in the sample human body image. The training unit 502 is configured to select samples from a sample set and to perform the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.

In this embodiment, the specific processes of the obtaining unit 501, the training unit 502 and the adjusting unit 503 of the apparatus 500 for generating a human body key point detection model may refer to

steps

201 and 208 in the corresponding embodiment of fig. 2.

In some optional implementations of this embodiment, the training unit 502 is further configured to: inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block; and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.

In some optional implementations of this embodiment, the training unit 502 is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point; a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.

In some optional implementations of this embodiment, the training unit 502 is further configured to: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.

In some optional implementations of this embodiment, the training unit 502 is further configured to: adding the feature map into bottleneck blocks of different times to obtain feature maps of different scales; the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map; inputting feature maps of different scales into an attention model to obtain first attention maps of different resolutions; the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram; inputting the second feature map into the attention model to obtain a second attention map; and combining the second attention diagram and the second characteristic diagram to obtain the attention characteristic diagram.

In some optional implementations of this embodiment, the training unit 502 is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.

In some optional implementations of this embodiment, the apparatus 500 further includes an adjusting unit 503 configured to: and in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, reselecting the sample from the sample set, and continuing to perform the training step by using the adjusted initial first model and the adjusted initial second model.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for detecting key points of a human body, which corresponds to the embodiment of the method shown in fig. 4, and which can be applied in various electronic devices.

As shown in fig. 6, the apparatus 600 for detecting key points of a human body according to the present embodiment includes: a detection unit 601 and a generation unit 602. Wherein the detection unit 601 is configured to acquire a human body image of the detection object. The generating unit 602 is configured to input the human body image into a human body key point detection model generated by the method described in the embodiment of fig. 2, and generate the position coordinates of the human body key points of the detection object.

It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 4. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a touch panel, a keyboard, a mouse, a camera, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and a training unit. As another example, it can also be described as: a processor includes a detection unit and a generation unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit acquiring a sample set".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a sample set, wherein samples in the sample set comprise sample human body images and marking information of key points in the sample human body images; selecting samples from the sample set, and performing the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.

Further, the one or more programs, when executed by the electronic device, may further cause the electronic device to: acquiring a human body image of a detection object; and inputting the human body image into the human body key point detection model to generate the position coordinates of the human body key points of the detection object. The human key point detection model may be generated by using the method for generating the human key point detection model described in the above embodiments.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of generating a human keypoint detection model, comprising:

acquiring a sample set, wherein samples in the sample set comprise sample human body images and marking information of key points in the sample human body images;

selecting samples from the sample set, and performing the following training steps: inputting a sample human body image of a selected sample into an initial first model to obtain a characteristic diagram of a pyramid structure, wherein the initial first model adopts Resnet101 as a basic network structure; determining a first-layer loss value based on the feature map and the labeling information of key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points, wherein the initial second model is a neural network based on an attention model; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether training of the initial first model and the initial second model is completed based on the first layer loss value and the second layer loss value; in response to determining that training of the initial first model and the initial second model is complete, determining the initial first model and the initial second model as human key point detection models;

wherein, the inputting the feature map into the initial second model to obtain the position coordinates of the detected key points comprises:

generating an attention feature map according to the feature map;

generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point;

for a second predictive thermodynamic diagram of the predetermined number of second predictive thermodynamic diagrams, detecting position coordinates of corresponding key points according to the position of the maximum probability pixel in each second predictive thermodynamic diagram;

the generating of the attention feature map according to the feature map comprises:

adding the characteristic diagram into bottleneck blocks of different times to obtain characteristic diagrams of different scales;

the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map;

inputting the feature maps of different scales into an attention model to obtain first attention maps of different resolutions;

the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram;

inputting the second feature map into the attention model to obtain a second attention map;

and combining the second attention map and the second feature map to obtain an attention feature map.

2. The method of claim 1, wherein the inputting the sample human body image of the selected sample into the initial first model to obtain the feature map of the pyramid structure comprises:

inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block;

and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.

3. The method of claim 2, wherein the determining a first layer loss value based on the feature map and annotation information for keypoints in the sample human image comprises:

generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image;

generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point;

a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.

4. The method of claim 1, wherein the determining a second layer loss value based on the position coordinates of the detected keypoints and annotation information of keypoints in the sample human body image comprises:

and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.

5. The method according to one of claims 1-4, wherein the method further comprises:

in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, and reselecting samples from the sample set, and continuing the training step using the adjusted initial first model and initial second model.

6. A method for detecting a human, comprising:

acquiring a human body image of a detection object;

inputting the human body image into a human body key point detection model generated by adopting the method of any one of claims 1 to 4, and generating position coordinates of the human body key points of the detection object.

7. An apparatus for generating a human keypoint detection model, comprising:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire a sample set, wherein samples in the sample set comprise a sample human body image and marking information of key points in the sample human body image;

a training unit configured to select samples from the set of samples and to perform the following training steps: inputting a sample human body image of a selected sample into an initial first model to obtain a characteristic diagram of a pyramid structure, wherein the initial first model adopts Resnet101 as a basic network structure; determining a first-layer loss value based on the feature map and the labeling information of key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points, wherein the initial second model is a neural network based on an attention model; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether training of the initial first model and the initial second model is completed based on the first layer loss value and the second layer loss value; in response to determining that training of the initial first model and the initial second model is complete, determining the initial first model and the initial second model as human key point detection models;

wherein the training unit is further configured to:

generating an attention feature map according to the feature map;

the training unit is further configured to:

8. The apparatus of claim 7, wherein the training unit is further configured to:

9. The apparatus of claim 8, wherein the training unit is further configured to:

10. The apparatus of claim 7, wherein the training unit is further configured to:

11. The apparatus according to one of claims 7-10, wherein the apparatus further comprises an adjustment unit configured to:

12. An apparatus for detecting human keypoints, comprising:

a detection unit configured to acquire a human body image of a detection object;

a generating unit configured to input the human body image into a human body key point detection model generated by the method according to any one of claims 1 to 5, and generate position coordinates of human body key points of the detection object.

13. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the method according to any one of claims 1-6.