WO2021007859A1

WO2021007859A1 - Method and apparatus for estimating pose of human body

Info

Publication number: WO2021007859A1
Application number: PCT/CN2019/096587
Authority: WO
Inventors: 谭文伟
Original assignee: 华为技术有限公司
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2021-01-21
Also published as: CN113892113A

Abstract

A method and apparatus for estimating the pose of a human body, the method comprising: inputting an image to be processed into a neural network, detecting in parallel K classes of key points in said image to obtain a detection heat map and tag pool of each class of key points (S301); according to the detection heat map of each class of key points, acquiring peaks in the detection heat map of each class of key points (S302); according to the tag pool of each class of key points, acquiring tag values of key points corresponding to each peak in the detection heat map of each class of key points (S303); and using a key point cluster similar to the tag values as connected key points of the same human body (S304). The method and apparatus for estimating the pose of a human body may improve the efficiency and accuracy of estimating the pose of a human body for multiple individuals.

Description

Human body posture estimation method and device

Technical field

The embodiments of the present application relate to the field of image processing, and in particular to a method and device for estimating a human body pose.

Background technique

Human body posture estimation has received more and more attention due to its important application value and theoretical significance. At present, the research on single-person human body pose estimation has reached high accuracy, and the research direction of the industry focuses on multi-person human body pose estimation.

A multi-person human body pose estimation method is the top-down method, which mainly includes: the human body is detected by a human body detector, and the position of the human body is judged, and the human body is framed; after the human body is determined, the single human body is determined independently The key points are predicted, and finally the attitude prediction is achieved. When the human body is occluded and the background is complicated and easily confused, the detection effect of the top-down method is often more sensitive. For complex postures, the top-down approach is also difficult to handle. When the number of people increases, the time expenditure will increase proportionally with the increase of the number of people.

Another multi-person human pose estimation method is the bottom-up method, which mainly includes key point detection and clustering. First, all the key points of all categories in the picture are detected, and then the key points are clustered, and different key points of different people are connected together, and the clustering produces different individuals. Compared with the top-down method, the bottom-up method has slightly lower accuracy, but it has great advantages in time efficiency.

Many current multi-person human body pose estimation methods have their own advantages and disadvantages. How to balance efficiency and accuracy is an urgent problem in multi-person human body pose estimation.

Summary of the invention

The embodiments of the present application provide a method and device for estimating a human body posture to improve the efficiency and accuracy of multi-person human posture estimation.

In order to achieve the foregoing objectives, the following technical solutions are adopted in the embodiments of this application:

In a first aspect, a method for estimating a human body pose is provided. The method may include: inputting an image to be processed into a neural network, parallelly detecting K types of key points in the image to be processed, and obtaining a detection heat map and a tag pool for each type of key point; wherein , The detection heat map of a type of key point indicates the possibility of this type of key point in different positions in the image to be processed; the tag pool of a type of key point includes the tag value of each key point of this type in the image to be processed, a key point The tag value of is used to indicate the human body group to which the key point belongs; according to the detection heat map of each type of key point, the peak value in the detection heat map of each type of key point is obtained; according to the tag pool of each type of key point, each type of key is obtained Point detection The label value of the key point corresponding to each peak in the heat map; cluster the key points with similar label values as the same person connection key point.

Through the human body pose estimation method provided in this application, in multi-person human body pose estimation, parallel detection reduces the calculation amount of the model and increases the speed of the model; assigning a label value to each key point as the prior knowledge of the grouping not only improves The accuracy of clustering and grouping, the process of grouping can be carried out in parallel, and the efficiency of grouping is also improved; therefore, the efficiency and accuracy of multi-person human pose estimation are improved.

With reference to the first aspect, in a possible implementation, the neural network is configured with a grouping loss function, and the grouping loss function is used to assign the label value of the key point. In order to realize the assignment of tag values for each key point as the prior knowledge of grouping, the accuracy and efficiency of clustering grouping are improved.

In combination with the first aspect or any of the foregoing possible implementations, in another possible implementation, a specific implementation of the packet loss function is provided. The packet loss function

Among them, N is the number of human bodies in the image to be processed; w(y) represents the label value of the key point at the y coordinate;

Represents the mean value of the tag values of all real keypoint positions of the nth person.

Combining the first aspect or any of the above possible implementations, in another possible implementation, the neural network is configured with a detection loss function, and the detection loss function is used to calculate the mean square error between the predicted detection heat map and the true key point heat map To output the detection heat map.

In combination with the first aspect or any of the foregoing possible implementations, in another possible implementation, the label values of key points are allocated according to the spatial constraint relationship, and the key points of the same person in the spatial constraint relationship are allocated similar label values.

In combination with the first aspect or any of the above possible implementations, in another possible implementation, a type of key point detection heat map indicates the possibility of such key points appearing in different positions in the image to be processed, including: The detection heat map of key points represents the Gaussian distribution of this type of key point in different positions in the image to be processed; or, the detection heat map of a type of key point represents the probability of this type of key point in different positions in the image to be processed. For example, the detection heat map may be a confidence map.

In a second aspect, a device for estimating a human body pose is provided. The device may include a detection unit, an acquisition unit, and a clustering unit. Among them, the detection unit is used to input the image to be processed into the neural network, parallelly detect the K types of key points in the image to be processed, and obtain the detection heat map and label pool of each type of key point; among them, the detection heat map of one type of key point represents The possibility that this type of key point appears at different positions in the image to be processed; the tag pool of a type of key point includes the tag value of each key point of this type in the image to be processed, and the tag value of a key point is used to indicate The human body group to which the key point belongs; the acquisition unit is used to obtain the peak value in the detection heat map of each type of key point according to the detection heat map of each type of key point obtained by the detection unit; the acquisition unit is also used to obtain according to the detection unit The tag pool of each type of key point in the, to obtain the tag value of the key point corresponding to each peak; the clustering unit is used to cluster the key points with similar tag values obtained by the acquisition unit as the key points of the same person connection.

Through the human body posture estimation device provided in this application, in multi-person human body posture estimation, parallel detection reduces the calculation amount of the model and improves the speed of the model; assigning a label value to each key point as the prior knowledge of the grouping not only improves The accuracy of clustering and grouping, the process of grouping can be carried out in parallel, and the efficiency of grouping is also improved; therefore, the efficiency and accuracy of multi-person human pose estimation are improved.

It should be noted that the human body posture estimation device provided in the second aspect is used to implement the human body posture estimation method provided in the above-mentioned first aspect, and the specific implementation may refer to the specific implementation of the above-mentioned first aspect.

In a third aspect, an embodiment of the present application provides a human body pose estimation device, the device includes a processor, and is configured to implement the human body pose estimation method described in the first aspect. The device may further include a memory, which is coupled to the processor, and when the processor executes the instructions stored in the memory, the human body posture estimation method described in the first aspect can be implemented. The device may further include a communication interface, which is used for the device to communicate with other devices. Exemplarily, the communication interface may be a transceiver, a circuit, a bus, a module, or other types of communication interfaces.

It should be noted that the instructions in the memory in this application can be pre-stored or downloaded from the Internet when the device is used and then stored. This application does not specifically limit the source of the instructions in the memory. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which can be electrical, mechanical or other forms, used for information exchange between devices, units or modules.

In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the human body posture described in any one of the above aspects or any one of the possible implementations. Estimate method.

In a fifth aspect, the embodiments of the present application also provide a computer program product, which when running on a computer, causes the computer to execute the human body posture estimation method described in any one of the above-mentioned aspects or any possible implementation manner.

In a sixth aspect, the embodiments of the present application provide a chip system, which includes a processor and may also include a memory, configured to implement the functions in the foregoing method. The chip system can be composed of chips, or can include chips and other discrete devices.

The solutions provided in the third aspect to the sixth aspect described above are used to implement the human body posture estimation method provided in the first aspect described above, and therefore can achieve the same beneficial effects as the first aspect, and will not be repeated here.

Description of the drawings

FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the application;

FIG. 2 is a schematic structural diagram of a human body posture estimation device provided by an embodiment of the application;

3 is a schematic flowchart of a method for estimating a human body pose provided by an embodiment of this application;

Figure 4 is a schematic diagram of another application scenario provided by an embodiment of the application;

Figure 4a is a schematic diagram of a detection heat map provided by an embodiment of the application;

FIG. 5 is a comparison diagram before and after image processing provided by an embodiment of the application;

Fig. 6 is a schematic diagram of label value clustering provided by an embodiment of the application;

FIG. 7 is a comparison diagram of the human body pose estimation performed by the openpose algorithm provided by an embodiment of the application and the algorithm of the application;

FIG. 8 is a comparison diagram of the human body pose estimation performed by the openpose algorithm provided by the embodiment of the application and the algorithm of the application;

FIG. 9 is a comparison diagram of the human body pose estimation performed by the openpose algorithm provided by an embodiment of the application and the algorithm of the application;

FIG. 10 is a schematic structural diagram of another apparatus for estimating a human body pose according to an embodiment of the application;

FIG. 11 is a schematic structural diagram of another apparatus for estimating a human body pose according to an embodiment of the application.

Detailed ways

In the embodiments of the present application, in order to clearly describe the technical solutions of the embodiments of the present application, words such as "first" and "second" are used to distinguish the same items or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and order of execution, and words such as "first" and "second" do not limit the difference. The “first” and second descriptions of technical features have no order or size order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner to facilitate understanding.

In the description of this application, unless otherwise specified, "/" means that the related objects are in an "or" relationship. For example, A/B can mean A or B; "and/or" in this application is only It is a kind of association relationship that describes the associated objects, which means that there can be three kinds of relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B It can be singular or plural. Also, in the description of this application, unless otherwise specified, "plurality" means two or more than two. "The following at least one item (a)" or similar expressions refers to any combination of these items, including any combination of a single item (a) or plural items (a). For example, at least one item (a) of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .

In the embodiments of the present application, at least one can also be described as one or more, and the multiple can be two, three, four or more, which is not limited in this application.

In addition, the network architectures and scenarios described in the embodiments of this application are intended to illustrate the technical solutions of the embodiments of this application more clearly, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. Those of ordinary skill in the art will know that with With the evolution of the network architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

The method provided by the embodiment of the present application can be used in a neural network for human body posture estimation, and the neural network can be a stacked hourglass network or other structure, which is not specifically limited in the embodiment of the present application. Fig. 1 shows a schematic diagram of the application scenario of the present application. As shown in Fig. 1, an input image is input to a neural network, and the neural network performs human posture estimation to obtain an output image.

It should be noted that the neural network is illustrated as a cascaded hourglass network in FIG. 1, but it is not specifically limited.

At present, when the neural network performs human body pose estimation, although the single-person human body pose estimation has reached a higher accuracy, the method of multi-person human body pose estimation still needs improvement.

Based on this, this application provides a human body pose estimation method, the basic principle of which is: in bottom-up multi-person human body pose estimation, key points are detected in parallel, and the detection heat of each type of key point is obtained while detecting the key points. Graphs and tag pools, in this way, parallel detection reduces the amount of model calculations and increases the speed of the model; assigning tag values to each key point as prior knowledge of the grouping not only improves the accuracy of clustering and grouping, but also The process can be performed in parallel, and the efficiency of grouping is also improved; therefore, the efficiency and accuracy of multi-person human posture estimation are improved.

The implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings.

2 is a schematic diagram of the composition of a human body posture estimation device 20 provided by an embodiment of the application. As shown in Fig. 2, the human body posture estimation device 20 may include at least one processor 21, a memory 22, a communication interface 23, and a communication bus 24. . Hereinafter, each component of the human body posture estimation device 20 will be specifically introduced in conjunction with FIG. 2.

The processor 21 may be one processor or a collective term for multiple processing elements. For example, the processor 21 is a central processing unit (CPU), or may be an application specific integrated circuit (ASIC), or may be one or more integrated circuits configured to implement the embodiments of the present application For example, one or more microprocessors (digital signal processors, DSP), or one or more field programmable gate arrays (FPGA).

Among them, the processor 21 can execute various functions of the function alias control server by running or executing a software program stored in the memory 22 and calling data stored in the memory 22. In a specific implementation, as an embodiment, the processor 21 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 2.

In a specific implementation, as an embodiment, the human body posture estimation device 20 may include multiple processors, such as the processor 21 and the processor 25 shown in FIG. 2. Each of these processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). The processor here may refer to one or more devices, circuits, and/or processing cores for processing data (for example, computer program instructions).

The memory 22 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions The dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this. The memory 22 may exist independently and is connected to the processor 21 through the communication bus 24. The memory 22 may also be integrated with the processor 21. Among them, the memory 22 is used to store a software program for executing the solution of the present application, and the processor 21 controls the execution.

The communication interface 23 uses any device such as a transceiver to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. . The communication interface 23 may include a receiving unit and a sending unit.

The communication bus 24 may be an industry standard architecture (ISA) bus, an external device interconnection (peripheral component, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus. The bus can be divided into address bus, data bus, control bus, etc. For ease of representation, only one thick line is used in FIG. 2, but it does not mean that there is only one bus or one type of bus.

It should be pointed out that the components shown in FIG. 2 do not constitute a limitation on the communication device. In addition to the components shown in FIG. 2, the communication device may include more or less components than those shown in the figure, or a combination of certain components. Some components, or different component arrangements.

Specifically, the processor 21 executes the following functions by running or executing software programs and/or modules stored in the memory 22, and calling data stored in the memory 22:

Input the image to be processed into the neural network to detect K types of key points in the image to be processed in parallel, and obtain the detection heat map and tag pool of each type of key point; among them, the detection heat map of one type of key point indicates that different positions in the image to be processed appear The possibility of this type of key point; the tag pool of a type of key point includes the tag value of each key point of this type in the image to be processed, and the tag value of a key point is used to indicate the body group to which the key point belongs; according to each type Key point detection heat map, get the peak value in the detection heat map of each type of key point; according to the tag pool of each type of key point, get the label value of the key point corresponding to each peak in the detection heat map of each type of key point ; The clustering of key points with similar label values is regarded as the key point of connecting the same person.

On the one hand, an embodiment of the present application provides a method for estimating a human body pose. As shown in FIG. 3, the method may include:

S301: Input the to-be-processed image into the neural network, and detect K-type key points in the to-be-processed image in parallel, and obtain a detection heat map and tag pool for each type of key point.

Specifically, S301 may be performed pixel by pixel in the image to be processed, or the image to be processed may be cut into multiple regions, and S301 may be performed region by region, which is not specifically limited in the embodiment of the present application. The term “parallel” in this article refers to non-serial.

Among them, the key points may be joint points on the human body or others, which are not specifically limited in the embodiments of the present application. The category number K of the key points can be configured according to actual needs, and the embodiment of the application does not specifically limit it.

For example, 17 key points are defined in the COCO data set, namely: 0-nose, 1-left eye, 2-right eye, 3-left ear, 4-right ear, 5-left shoulder, 6-right shoulder, 7- Left elbow, 8-right elbow, 9-left wrist, 10-right wrist, 11-left hip, 12-right hip, 13-left knee, 14-right knee, 15-left ankle, 16-right ankle.

Among them, the detection heat map of a type of key point indicates the possibility of this type of key point in different positions in the image to be processed; the label pool of the type of key point includes the label value of each key point of the type in the image to be processed , The tag value of a key point is used to indicate the body group to which the key point belongs.

In a possible implementation, the detection heat map of a type of key points indicates the possibility of this type of key point in different positions in the image to be processed, which can be specifically implemented as follows: The detection heat map of a type of key point indicates that the difference in the image to be processed The position where the Gaussian distribution of this type of key point appears. In multi-person human pose estimation, since there are multiple human bodies in an image, there will be multiple peaks. For example, in the detection heat map of the key point of the left button, there are multiple people's left button peaks.

In another possible implementation, the detection heat map of a type of key points indicates the possibility of such key points appearing in different positions in the image to be processed, which can be specifically implemented as follows: the detection heat map of a type of key points represents the image to be processed The probability of such key points appearing in different positions. For example, the detection heat map can be a two-dimensional confidence map, which expresses the probability of key points by color depth.

In a specific implementation, a detection loss function can be configured for the neural network, and the detection loss function is used to calculate the mean square error between the predicted detection heat map and the true key point heat map to output the detection heat map. The content of the detection loss function can be configured according to actual requirements, which is not specifically limited in the embodiment of the present application.

For example, the detection loss function adopts the form in the paper "Stacked hourglass networks for human pose estimation", which will not be repeated here.

Specifically, the label values of key points can be assigned according to the spatial constraint relationship, and the key points of the same person in the spatial constraint relationship are assigned similar label values.

Among them, the spatial constraint relationship can be used to determine which areas in the image are the same human body and which areas are different human bodies. This application does not limit the specific content of the spatial constraint relationship, and can be configured according to actual needs.

For example, in a possible implementation, the spatial constraint relationship can be: using the image pixel positions of the same human body in adjacent areas in the image are relatively close, and different human bodies appear to be far apart in the image space, in order to assign joint points to The individual belonging to the individual determines the area of the individual in the image through non-maximum suppression, and then obtains which areas in the image are the same human body and which areas are different human bodies.

Specifically, different human bodies may be configured with different label value intervals in advance, and the label values may be obtained by determining the human body according to the spatial constraint relationship and the label value intervals configured for different human bodies.

Similar tag values mean that the absolute value of the difference value is less than a preset threshold value, and the specific value of the preset threshold value can be configured according to actual needs, which is not specifically limited in the embodiment of the present application.

Specifically, according to the spatial constraint relationship, similar label values can be randomly generated for different key points in the same person in the spatial constraint relationship, or, for different key points in the same person in the spatial constraint relationship, according to a preset algorithm Similar tag values are generated, which is not specifically limited in the embodiment of the present application. It should be noted that the content of the preset algorithm can be configured according to actual needs.

For example, in multi-person human pose estimation, a real number is generated every time a key point is detected, and the real number is used as the label value to indicate the group to which the detected pixel belongs. Of course, the type of the tag value may also be in other forms than real numbers, which are not specifically limited in the embodiment of the present application.

In a specific implementation, a grouping loss function can be configured for the neural network, and the grouping loss function is used to assign the label value of the key point. The content of the grouping loss function can be configured according to actual needs, which is not specifically limited in the embodiment of the present application.

Exemplarily, the embodiment of the present application provides a specific implementation of a grouping loss function, and the grouping loss function is as follows: (1).

Among them, N is the number of human bodies in the image to be processed; w∈R ^W×H , represents the tag pool corresponding to the image to be processed; w(y) represents the tag value of the key point at the y coordinate;

Represents the mean value of the tag values of all real keypoint positions of the nth person. For a given picture, assuming that there are N people, the real position coordinates of the joint points of the human body of these N people are: T={(y _nk )}, n=1,...,N,k=1,... .,K.

The following briefly describes the design idea of the grouping loss function.

The grouping of key points is equivalent to a clustering problem. In order to get a better grouping result of key points, the generated prediction tag pool needs to aggregate the joint points of the same person as much as possible, and make the joint points of different people as much as possible Separation, in order to reduce the error of using clustering algorithm to group key points during testing. To establish the loss separately, first use the k-means clustering method for the same person, that is, introduce the reference value

Represents the mean value of the label values of all the real joint points of the nth person, and the mathematical expression is as follows (2).

According to the definition of k-means clustering, the square loss function is established as the following equation (3).

For the case of different people, let the mean of the label values of all the real joint points of the nth person be

Let the average value of the label values of all the real joint points of the n′th person be

Assume

Indicates the error between the former and the latter. In order to measure the distance between the two, the square loss function is also introduced as the following equation (4).

As mentioned above, it is necessary to make equation (3) smaller and make equation (4) larger. In order to unify (3) and (4) into a loss function, it is necessary to find the minimum value of equation (4), so a negative exponential function is introduced, and equation (4) is rewritten to obtain the following equation (5).

Combining (3) and (5) to obtain the loss function is the following equation (6).

The total loss is the weighted sum of the detection loss and the packet loss: Loss = mL _g + nL _d , where m and n are hyperparameters.

It should be noted that the design idea of the above-mentioned grouping loss function is only an example for illustration, and all grouping loss functions involved in this idea belong to the grouping loss function described in this application.

FIG. 4 is a schematic diagram of a scene of the human body pose estimation method described in an embodiment of the application. As shown in FIG. 4, the input image is input to the neural network (the cascaded hourglass network is shown in the figure), and the process of S301 is executed to obtain the detection heat map and tag pool shown in FIG. 4. Exemplarily, the detection heat map obtained from the input image in Figure 4 is shown in Figure 4a. Each small image in Figure 4a is a detection heat map of a type of key point, and the bright spot in each small image represents the key point. Peak position.

For example, this solution can adopt a 4-order hourglass structure, the input size of the network is 256×256, and the output size is 128×128. If there are K human key points to be predicted, the number of output channels of the network is 2K, K channels are used for detection, and K channels are used for grouping. For the COCO data set, since each image has 17 key point annotations, the final output of the network is 34 channels, of which the first 17 channels are used to output joint point detection heat maps, and the last 17 channels are used to output joint points The grouping label information.

The following describes the execution process of this application with specific examples.

As shown in Figure 5, the original image on the left uses the COCO data set. Using the model of this application, the detection heat map and tag pool for each type of key point will be obtained. Among them, the detection heat map can be the Gaussian distribution of each type of key point. The tag values included in the 17 tag pools are as follows (arranged in the order of the 17 key points in the COCO data set):

[-1.8509495,1.0292919,4.163754], [-1.8537576,1.0384212,4.1641073], [-1.8466513,1.0294132,4.1671414], [-1.8384541,1.0231285,0],[-1.8542455,0,4.150399],[-1.8703872,1.0921177 ,4.120116], [-1.8843985,1.0381733,4.1086555], [-1.8894546,1.1279857,4.1338553], [-1.8111589,1.057544,4.1566596], [-1.9049348,0,4.1225743], [-1.8559527,1.0369155,4.124661], [ -1.9162827,1.0587022,4.111921], [-1.9004283,1.1545397,4.1475873], [-1.9555404,0,4.185656], [-1.8526844,0,4.170965], [0,0,0], [0,0,0] . Among them, 0 means that the key point information is not available.

S302. Obtain a peak value in the detection heat map of each type of key point according to the detection heat map of each type of key point.

Specifically, the detection heat map of each type of key point is most likely that the positions of the key points are all peaks.

For example, based on the example in S301, in the detection heat map of the nose key points, 3 peaks can be obtained.

S303. Obtain the tag value of the key point corresponding to each peak according to the tag pool of each type of key point.

Specifically, in S303, the label value of the key point corresponding to each peak is obtained according to the coordinate position in the image to be processed.

For example, based on the example in S301, the three peaks in the detection heat map of the nose key points obtained in S302 have their own coordinates. According to the coordinates, in the label value of the nose key point, the label of the corresponding position is obtained The value is [-1.8509495,1.0292919,4.163754].

S304. Cluster the key points with similar label values as the key points for connecting the same person.

For example, in S304, the Lloyd method in the K-means algorithm can be used to match similar key points. Of course, other clustering methods can also be used, which is not specifically limited in the embodiment of the present application.

It should be noted that when the Lloyd method is used to match similar key points, the k value in the algorithm corresponds to the number of people in the picture. In this application, the number of peak points on the detected heat map with the most peak points can be calculated in S302 to determine the number of peak points in the picture. Number of people. The "distance" in the Lloyd method can be Euclidean distance. In actual operation, other distance measurement methods can also be selected according to the actual situation to obtain the best key point grouping.

For example, based on the example in S301, using the Lloyd method to cluster the label values in S304, the grouped label values can be obtained as follows:

[-1.8509495, -1.8537576, -1.8466513, -1.8384541, -1.8542455, -1.8703872, -1.8843985, -1.8894546, -1.8111589, -1.9049348, -1.8559527, -1.9162827, -1.9004283, -1.9555404, -1.8526844, 0, 0] ；

[1.0292919, 1.0384212, 1.0294132, 1.0231285, 0, 1.0921177, 1.0381733, 1.1279857, 1.057544, 0, 1.0369155, 1.0587022, 1.1545397, 0, 0, 0, 0];

[4.163754, 4.1641073, 4.1671414, 0, 4.150399, 4.120116, 4.1086555, 4.1338553, 4.1566596, 4.1225743, 4.124661, 4.111921, 4.1475873, 4.185656, 4.170965, 0, 0].

As shown in Figure 6, the label values after clustering are shown. It can be seen that the label values are well separated.

Connect the key points corresponding to the label values in a group to obtain the output image on the right in Figure 5, which can well recognize the human posture in the original image.

This application is a bottom-up method. Compared with the existing top-down method, the operating speed of the model is greatly improved when the accuracy is similar. Turning the serial key point detection into parallel can reduce the calculation amount of the model, greatly reduce the complexity of the model, and increase the processing speed of the model.

In the scenario shown in Figure 7, the openpose on the left will have key points missed and incorrectly connected. For example, the right knee is detected as the left knee and the left hip is connected to the right knee. The application method (the right picture in Figure 7) relatively maintains a relatively stable performance in this scenario.

The test was performed on the MS COCO test-dev data set, and the accuracy rate was compared with the bottom-up method of openpose. The comparison results are shown in Table 1.

Table 1

To	APAP	AP ⁵⁰ AP ⁵⁰	AP ⁷⁵ AP ⁷⁵	AP ^M AP ^M	AP ^L AP ^L	ARAR	AR ⁵⁰ AR ⁵⁰	AR ⁷⁵ AR ⁷⁵	AR ^M AR ^M	AR ^L AR ^L
OpenposeOpenpose	0.6110.611	0.8440.844	0.6670.667	0.5580.558	0.6840.684	0.6650.665	0.8720.872	0.7180.718	0.6020.602	0.7490.749
本申请This application	0.6650.665	0.8490.849	0.7260.726	0.6120.612	0.7440.744	0.7010.701	0.8670.867	0.7550.755	0.6400.640	0.7890.789

In addition, compared with openpose, a real-time human pose estimation algorithm, the time efficiency is compared. By testing openpose and the application method using the same batch of pictures, it is obtained that the average time for openpose to process a picture is 8.492 seconds, while the application method processes one image. The picture only takes 3.409 seconds, and the time efficiency has been improved to a certain extent. From the comparison of accuracy and time efficiency above, it can be seen that the comprehensive performance of the method of this application has been greatly improved.

Figures 7, 8, and 9 illustrate the detection results of the method of this application and openpose. The left image in the figure is the result of openpose, and the right image is the result of the method of this application.

In the foregoing embodiments provided in the present application, the methods provided in the embodiments of the present application are introduced from the perspective of the working principle of the human body posture estimation device. In order to implement the functions in the methods provided in the above embodiments of the present application, the human body posture estimation apparatus may include a hardware structure and/or software module, and the above functions are implemented in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above-mentioned functions is executed in a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraint conditions of the technical solution.

The division of modules in the embodiments of the present application is illustrative, and is only a logical function division. In actual implementation, there may be other division methods. In addition, the functional modules in the various embodiments of the present application may be integrated into one process. In the device, it can also exist alone physically, or two or more modules can be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.

In the case of dividing each functional module corresponding to each function, as shown in FIG. 10, the human body posture estimation apparatus 100 provided in this embodiment of the present application is used to implement the functions in the foregoing method. As shown in FIG. 10, the human body posture estimation apparatus 100 may include: a detection unit 1001, an acquisition unit 1002, and a clustering unit 1003. The detection unit 1001 is used to perform S301 in FIG. 3; the acquisition unit 1002 is used to perform S302 and S303 in FIG. 3; the clustering unit 1003 is used to perform S304 in FIG. 3. Among them, all relevant content of each step involved in the above method embodiment can be cited in the function description of the corresponding function module, and will not be repeated here.

In the case of adopting integrated division of various functional modules, as shown in FIG. 11, the human body posture estimation device 110 provided in this embodiment of the present application is used to implement the functions in the foregoing method. The human body posture estimation device 110 includes at least one processing module 1101 for implementing the functions in the method provided in the embodiment of the present application. Exemplarily, the processing module 1101 may be used to execute the processes S301 to S304 in FIG. 3. For details, please refer to the detailed description in the method example, which will not be repeated here.

The human body posture estimation device 110 may further include at least one storage module 1102 for storing program instructions and/or data. The storage module 1102 and the processing module 1101 are coupled. The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units, or modules, and may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules. The processing module 1101 may cooperate with the storage module 1102 to operate. The processing module 1101 may execute program instructions stored in the storage module 1102. At least one of the at least one storage module may be included in the processing module.

The human body posture estimation apparatus 110 may further include a communication module 1103 for communicating with other devices through a transmission medium, so as to determine that the human body posture estimation apparatus 110 can communicate with other devices. The communication module 1103 is used for the device to communicate with other devices.

When the processing module 1101 is a processor, the storage module 1102 is a memory, and the communication module 1103 is a communication interface, the human body posture estimation apparatus 110 involved in FIG. 11 in the embodiment of the present application may be the human body posture estimation apparatus 20 shown in FIG. 2.

As mentioned above, the human body posture estimation device 100 or the human body posture estimation device 110 provided by the embodiments of the present application may be used to implement the functions in the methods implemented by the various embodiments of the present application. For ease of description, only the same as those in the embodiments of the present application are shown. For related parts and specific technical details that are not disclosed, please refer to the various embodiments of this application.

As another form of this embodiment, a computer-readable storage medium is provided, and an instruction is stored thereon, and the method in the foregoing method embodiment is executed when the instruction is executed.

As another form of this embodiment, a computer program product containing instructions is provided, and when the instructions are executed, the method in the foregoing method embodiment is executed.

The embodiment of the present application further provides a chip system. The chip system includes a processor for implementing the technical method in the embodiment of the present invention. In a possible design, the chip system further includes a memory for storing necessary program instructions and/or data of the communication device in the embodiment of the present invention. In a possible design, the chip system further includes a memory for the processor to call application program codes stored in the memory. The chip system may be composed of one or more chips, and may also include chips and other discrete devices, which are not specifically limited in the embodiment of the present application.

The steps of the method or algorithm described in combination with the disclosure of the present application can be implemented in a hardware manner, or in a manner that a processor executes software instructions. Software instructions can be composed of corresponding software modules, which can be stored in RAM, flash memory, ROM, erasable programmable read-only memory (erasable programmable ROM, EPROM), electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EPROM, EEPROM), registers, hard disk, mobile hard disk, CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC may be located in the core network interface device. Of course, the processor and the storage medium may also exist as discrete components in the core network interface device. Alternatively, the memory may be coupled with the processor. For example, the memory may exist independently and be connected to the processor through a bus. The memory can also be integrated with the processor. The memory may be used to store application program codes that execute the technical solutions provided in the embodiments of the present application, and the processor controls the execution. The processor is used to execute the application program code stored in the memory, so as to implement the technical solution provided by the embodiment of the present application.

Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated as needed. It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be It can be combined or integrated into another device, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate. The parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of software products, which are stored in a storage medium It includes several instructions to make a device (may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any change or replacement within the technical scope disclosed in this application shall be covered by the protection scope of this application . Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A method for estimating a human body pose, characterized in that it comprises:

Input the image to be processed into the neural network and detect the K types of key points in the image to be processed in parallel to obtain the detection heat map and tag pool of each type of key point; among them, the detection heat map of one type of key point represents the image to be processed The possibility of this type of key point appearing in different positions in the middle; the tag pool of a type of key point includes the tag value of each key point of this type in the image to be processed, and the tag value of a key point is used to indicate to which the key point belongs Human body group

Obtain the peak value in the detection heat map of each type of key point according to the detection heat map of each type of key point;

Obtaining the tag value of the key point corresponding to each peak according to the tag pool of each type of key point;

The clusters of key points with similar label values are regarded as the key points for connecting the same person.
The method according to claim 1, wherein the neural network is configured with a grouping loss function, and the grouping loss function is used for allocating label values of key points.
The method according to claim 2, wherein:

The grouping loss function
Wherein, N is the number of human bodies in the image to be processed; w(y) represents the label value of the key point at the y coordinate;
Represents the mean value of the tag values of all real keypoint positions of the nth person.
The method according to any one of claims 1 to 3, wherein the neural network is configured with a detection loss function, and the detection loss function is used to calculate the mean square error between the predicted detection heat map and the true key point heat map To output the detection heat map.
The method according to any one of claims 1 to 4, wherein the label value is allocated according to a spatial constraint relationship, and the key points of the same person in the spatial constraint relationship are allocated similar label values.
The method according to any one of claims 1 to 5, wherein the detection heat map of a type of key points indicates the possibility of the type of key points in different positions in the image to be processed, and includes:

The detection heat map of a type of key points represents the Gaussian distribution of this type of key points in different positions in the image to be processed;

or,

The detection heat map of a type of key point represents the probability of this type of key point in different positions in the image to be processed.
A human body posture estimation device is characterized in that it comprises:

The detection unit is used to input the image to be processed into the neural network, to parallelly detect the K types of key points in the image to be processed, to obtain the detection heat map and label pool of each type of key point; among them, the detection heat map of one type of key point represents The possibility that this type of key point appears at different positions in the image to be processed; the tag pool of a type of key point includes the tag value of each key point of this type in the image to be processed, and the tag value of a key point is used to indicate The human body group to which the key point belongs;

An obtaining unit, configured to obtain a peak value in the detection heat map of each type of key point according to the detection heat map of each type of key point obtained by the detection unit;

The obtaining unit is further configured to obtain the tag value of the key point corresponding to each peak according to the tag pool of each type of key point obtained by the detection unit;

The clustering unit is used for clustering the key points with similar label values obtained by the obtaining unit as the connecting key points of the same person.
8. The device according to claim 7, wherein the neural network is configured with a grouping loss function, and the grouping loss function is used to allocate the label value of the key point.
The device according to claim 8, wherein:

The grouping loss function
Wherein, N is the number of human bodies in the image to be processed; w(y) represents the label value of the key point at the y coordinate;
Represents the mean value of the tag values of all real keypoint positions of the nth person.
The device according to any one of claims 7-9, wherein the neural network is configured with a detection loss function, and the detection loss function is used to calculate the mean square error between the predicted detection heat map and the real key point heat map To output the detection heat map.
The device according to any one of claims 7-10, wherein the tag value is assigned according to a spatial constraint relationship, and the key points of the same person in the spatial constraint relationship are assigned similar label values.
The device according to any one of claims 7-11, wherein the detection heat map of a type of key points indicates the possibility of the type of key points in different positions in the image to be processed, and includes:

The detection heat map of a type of key points represents the Gaussian distribution of this type of key points in different positions in the image to be processed;

or,

The detection heat map of a type of key point represents the probability of this type of key point in different positions in the image to be processed.
A human body posture estimation device, comprising a processor and a memory, the memory is coupled to the processor, and the processor is configured to execute the human body posture estimation method according to any one of claims 1 to 6.
A computer-readable storage medium, comprising instructions, when running on a computer, causes the computer to execute the human body posture estimation method according to any one of claims 1 to 6.
A computer program product, when it runs on a computer, causes the computer to execute the human body posture estimation method according to any one of claims 1 to 6.