CN112001217A

CN112001217A - Multi-person human body posture estimation algorithm based on deep learning

Info

Publication number: CN112001217A
Application number: CN202010560950.5A
Authority: CN
Inventors: 周旺发; 邓三鹏; 祁宇明; 马瑞军; 权利红; 王帅; 王文; 邓茜
Original assignee: Anhui Bo Wan Robot Co ltd; Hubei Bono Robot Co ltd; Tianjin Bonuo Intelligent Creative Robotics Technology Co ltd; Tianjin University of Technology and Education China Vocational Training Instructor Training Center
Current assignee: Anhui Bo Wan Robot Co ltd; Hubei Bono Robot Co ltd; Tianjin Bonuo Intelligent Creative Robotics Technology Co ltd; Tianjin University of Technology and Education China Vocational Training Instructor Training Center
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2020-11-27

Abstract

The invention provides a human body posture estimation algorithm based on deep learning, which comprises the following steps: the method comprises the steps of inputting an image or video file containing human body postures of multiple persons into a built model, extracting image characteristics of multiple person limbs and joint points from the input image and video by using a ResNet network with 50 layers, detecting by using a convolution posture machine, selecting heat map of the optimal joint point from the detected joint points by using a Gaussian function, matching the joint point pairs by using a component affinity field theory to obtain all limb types and joint point sets required by the human body postures, matching the limb types and the joint point sets by using a Hungary algorithm and a human body limb frame to generate postures, and finishing estimation of the postures of the multiple persons in the image through the whole process. The invention can be applied to the rescue robot platform to accurately and efficiently estimate the human body postures of a plurality of people to be rescued in complex environments such as land dust, wet lands, narrow spaces and the like.

Description

Multi-person human body posture estimation algorithm based on deep learning

Technical Field

The invention belongs to the technical field of multi-person human body image processing in a complex environment, and particularly relates to a multi-person human body posture estimation algorithm based on deep learning.

Background

Rescue under outdoor land environment is one of main contents of human rescue, and the existing traditional rescue method can not ensure timely and accurate arrival at the scene to carry out rescue when facing to the complex land environments such as sand, dust, wetland, narrow space and the like, thereby increasing a plurality of instability factors for rescue tasks, and simultaneously greatly threatening the safety of related personnel due to secondary disasters possibly occurring in the search and rescue process. In order to make up for the defect that the existing search and rescue system equipment cannot cover complex terrains on land, it is necessary to develop a portable and high-adaptability ground robot system capable of meeting the requirement of multi-search and rescue terrains. The rescue robot has the main task of quickly finding the posture information of the injured person to prepare for further taking rescue measures. The image information of the injured person has the characteristics of rich content, easy and quick acquisition speed, so that a machine Vision (CV) technology is very common in the land rescue robot. In the process of visual search, the processing and processing contents of a plurality of information such as image classification, target detection, target pose judgment and estimation of machine vision are involved. In the actual rescue process, the visual information of the injured person is easily influenced by the outdoor severe environment, and particularly, an effective image is difficult to obtain due to the interference of an image background and the posture of the injured person (the posture of a single person or multiple persons is blocked), so that the posture estimation solution is not unique, and the accurate and stable posture estimation on the injured person cannot be realized. In order to solve the problem of how to accurately find out injured people in a complex environment by machine vision, the multi-person human posture estimation model which can effectively resist the interference of an outdoor environment and improve the posture estimation efficiency of the injured people and has certain robustness is developed, and the multi-person human posture estimation model has important significance for the development of the current rescue robot.

Disclosure of Invention

In view of the above, the present invention is directed to a method for estimating and calculating human postures of multiple persons based on deep learning, so as to solve the above-mentioned problems in the background art.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a multi-person human body posture estimation algorithm based on deep learning transfers an image or a video file into a model, then uses a ResNet network with 50 layers to extract image characteristics of multi-person limbs and joints of the input image and video, uses a Convolution Posture Machine (CPM) to detect the joints, uses a Gaussian function to select a heat map of an optimal joint of the detected joints, uses a component affinity field theory (PAFs) to match the obtained joints with the limbs to obtain a set of all the limb types and joints required by human body posture, then uses a Hungary algorithm and a human body limb frame to match the set of the limb types and joints to obtain complete human body posture estimation, and finally completes the multi-person posture estimation in the image.

Further, the model input specifically includes an RGB color image within 1000 × 1000 pixel size or a video file containing a multi-person image, and the file format is MP4 format.

Further, the feature extraction network consists of a 50-layer ResNet network. One of the cores of the ResNet network is to introduce a deep Bottleneck structure (deep Bottleneck architecture), and the principle is to add several Identity mapping layers (i.e. y ═ x, where the output is equal to the input) behind a shallower network (shalow Net) to increase the depth of the network and improve the non-linear capability of the network, and meanwhile, Identity mapping cannot cause an error increase, i.e. a Deeper network should not bring an error increase on the training set. The advent of residual networks successfully addressed the frequent appearance of conventional neural networks: if the distance from the input layer is too far, the derivative value transmitted back by the residual error is too small, so that the adjustment value is zero and is close to distortion; each layer network needs to learn a new output function f (x), and when the network depth is greatly increased, the number of the output functions causes problems of high calculation pressure and the like. The original input of the network is directly input into a deeper layer of the network from a bypass to increase the residual error of the network, the phenomenon that the residual error disappears is remedied, and each time when each layer of the residual error network is trained, only one residual error is learned relative to the original data, instead of directly mapping f (x).

The conventional convolutional neural network extracts all information at one time, the risk of gradient disappearance is increased, the residual error network only learns the residual error, the calculation time is divided into two paths, and the first network directly transfers downwards: attempting to learn the residual f (x) directly from x; the second shortcut network: and inputting x. The image is input x and the result to be fitted (output) is h (x). According to the residual module structure, the output result is differentiated into x + y, that is, h (x) x + y, and y is further made to be f (x), that is, y is also fit by x, and then the obtained residual and x are added together to obtain the output result of the layer, that is, the mapping value h (x) f (x) + x is different from the input x to obtain the required residual, so that the residual structure actually only needs to fit f (x), and the calculation formula is shown in (1).

Further, in the feature detection process, confidence maps (confidence maps) are represented by S for 2D detection positions of specific key points in the image, for example, if there is only one person in the image and the joint points are visible, there should be a single peak value for each confidence map; if there are k people in the image where j, say j, necks are visible for the joint point, then there should be j peaks. Inputting the feature points obtained in the first step of the model into a posture convolution machine network for joint point detection to obtain a batch of potential joint point confidence graphs, and then inputting the potential joint points X into the posture convolution machine network_j,kAnd the real joint point p is calculated by using the formula (2) to obtain the optimal joint point.

Where σ represents the peak creep degree, and p is the image coordinate value at that point. Will be provided with

The resulting set obtains the final output predicted confidence map by equation (3.3).

Furthermore, joint points in the image can be obtained through a confidence map of joint point detection, and the network model connects key points by using Part Affinity Fields. Part Affinity Fields (PAFs) are the core content of the openpos model, which refers to the location and orientation information stored in the limb area. The PAFs are further classified into single-person PAFs and multi-person PAFs.

Further, each limb joint point points to the other limb in a single person's examination, each limb having a corresponding Affinity Field connectionThe body part to which it relates. Let X_j1,kAnd X_j2,kRespectively represent a joint point j₁And j₂Coordinate of, vector of

Representing a limb C of the Kth person consisting of these two joint points, only from point j when point P is on this limb as shown₁Point j of₂Time, vector

Is a unit vector; the other points are zero vectors, and the judgment conditions are shown in the formulas (4) and (5).

The point P on the limb C satisfies both equations (6) and (7).

Where L represents the length of the limb, V is a vector perpendicular to the unit vector, σ_lRefers to the width of the limb. The vector is required when a plurality of limbs C are overlapped in a figure

The average value is obtained as shown in equation (8).

Wherein n is_c(P) represents the number of non-0 vectors at the P point. And detecting the associated point pairs formed by the joint points, and screening real associated point pairs and limbs suitable for reality by calculating the line integral of PAF on the line segment formed by the associated point pairs. The integral formula is shown in equations (9) and (10).

p(u)＝(1-u)d_j1+ud_j2 (10)

Where p (u) is the interpolated position between the two joint points.

Furthermore, in multi-person detection, after non-maximum suppression is carried out on the detected confidence maps, position discrete point candidate sets of the joint points are obtained, in images of multiple persons, the candidate points need to be matched with different persons, multiple solutions exist, and the multi-person posture solution is obtained through the combined action of the Hungary algorithm and the body limb frame.

Further, the hungarian algorithm means that a body limb part and a joint point are assumed to be G, and G ═ V, E is an undirected graph. The vertex set V of the graph can be divided into two mutually disjoint subsets X and Y (no edge inside the subset), and two endpoints of any one edge in the graph belong to different subsets, so the graph G is called a bipartite graph. In the matching process, it is necessary to ensure that the endpoints in the subsets X and Y are matched with each other as many as possible by one-to-one without repetition if | V occurs₁|≤|V₂I (i.e., the number of endpoints in subset 1 that need to be matched is less than subset 2), and | M | ═ V₁If V is equal to V, this is called matching process as perfect matching₁|＝|V₂If is called perfect match.

Further, in order to help the Hungarian algorithm to quickly match a part of limb pairs which are not easy to match in the graph, a human limb frame model is introduced, wherein points represent important joint points of a human body, lines represent limbs, and the points and the lines do not represent volumes, so that the model is modeled by a non-volume method, and for all joint points and limbs of the human body, only when the joint points and the limbs are connected, the connection exists between the adjacent joint points and the limbs. When the multiple persons do not seriously shield or the characteristics of the joints of the human bodies are not obvious, the model can be applied to a part of detected limbs, then a preferential candidate area of a missing limb and a joint point is provided for the network according to the space rotation range of the rest of the limbs in the model, and the preferential candidate area has high detection and matching weights so as to improve the identification precision of the network in the multiple person overlapped image.

Compared with the prior art, the multi-person human body posture estimation algorithm based on deep learning has the following advantages:

the invention aims at the problems that the rescue robot has inaccurate recognition when recognizing the posture of a person in a complex land environment and the accuracy of an Open position human posture estimation model is to be further improved, and the invention carries out two improvements:

(1) the multi-person human body posture estimation algorithm based on deep learning is used as a multi-person human body posture estimation core algorithm in the complex environment recognized by the robot, and the robot is effectively helped to recognize the human body posture in the complex environment on the land.

(2) The matching problem of the human body limbs and the joint points of a plurality of people is solved under the combined action of the Hungarian algorithm and the human body limb framework, and the matching precision of the human body limbs and the joint points is further improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating algorithm detection according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a single limb PAFs in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a candidate pair of nodes according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of two-part graph matching according to an embodiment of the present invention

FIG. 5 is a schematic diagram of a human body structure based on component synthesis according to an embodiment of the present invention;

FIG. 6 is a graph of the test results according to the embodiment of the present invention;

Detailed Description

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

According to the invention, on the basis of an Open pos algorithm, a 50-layer ResNet network is used as a human body feature extraction network, the robustness of the algorithm to the estimation of the human body posture of multiple persons in a strange environment is improved, and then a Hungary algorithm and a human body limb frame are used for matching the multiple person limbs and joint points to obtain the estimation result of the human body posture of the multiple persons.

1. First an image or video file is input into the model.

2. And then inputting the input image and video file into a feature extraction network to obtain feature points required by the human body gestures of multiple people.

3. During the detection process, confidence maps (confidence maps) are represented by S for 2D detection positions of specific key points in the image, for example, if only one person is in the image and the joint points are visible, a single peak value should exist in each confidence map; if there are k people in the image where j, say j, necks are visible for the joint, then there should be j peaks. Inputting the feature points obtained in the first step of the model into a posture convolution machine network for joint point detection to obtain a batch of potential joint point information graphs, and then inputting the potential joint points X into the posture convolution machine network_j,kAnd the real joint point p is calculated using equation (11) to obtain the optimal joint point.

The resulting set obtains the final output predicted confidence map by equation (12).

Joint points in the image can be obtained through a confidence map of joint point detection, and the network model connects key points to obtain a limb by using Part Affinity Fields shown in figure 1.

Let X_j1,kAnd X_j2,kRespectively represent a joint point j₁And j₂Coordinate of, vector of

Is a unit vector; the other points are all zero vectors, and the judgment conditions are shown in equations (13) and (14).

The point P on the limb C satisfies both equations (15) and (16).

Wherein L representsLength of limb, V is the vector perpendicular to the unit vector, σ_lRefers to the width of the limb.

The vector is required when a plurality of limbs C are overlapped in a figure

The average value is obtained as shown in equation (17).

Wherein n is_c(P) represents the number of non-0 vectors at the P point. And detecting the associated point pairs formed by the joint points, and screening real associated point pairs and limbs suitable for reality by calculating the line integral of PAF on the line segment formed by the associated point pairs. The integral formula is shown in formulas (18) and (19).

p(u)＝(1-u)d_j1+ud_j2 (19)

Where p (u) is the interpolated position between the two joint points.

After the detected confidence map is subjected to non-maximum suppression, a position discrete point candidate set of the joint points is obtained, and in an image of multiple persons, there are multiple solutions, for example, as shown in fig. 2, where the candidate points need to be matched to different persons. The boxes represented by the same color in the graph represent the same joint point, the possible results that the three joint points can be connected into limbs are shown as b, and the network model uses the Hungarian algorithm, the global context connection implicitly coded by the paired association results contained in the PAFs and the human body limb framework to obtain the connection of high-quality multi-person key point pairs.

4. The idea of the algorithm is as follows: let G ═ (V, E) be an undirected graph. The vertex set V of the graph can be divided into two mutually disjoint subsets X and Y (no edge inside the subset), and two endpoints of any one edge in the graph belong to different subsets, so the graph is calledG is a bipartite graph. In the matching process, it is necessary to ensure that as many endpoints in the subsets X and Y as possible are matched with each other one-to-one without repetition if | V occurs₁|≤|V₂I (i.e., the number of endpoints in subset 1 that need to be matched is less than subset 2), and | M | ═ V₁If V is equal to V, this is called matching process as perfect matching₁|＝|V₂If is called perfect match.

The augmented path may be defined as: setting M as the successfully matched set in the bipartite graph G, as shown in FIG. 3, if P is a path in the graph G that can connect two paths without matching points (the initial point of P can be both X and Y), and the edge belonging to M and the edge not belonging to M appear alternately on P, then P is an augmented path of M. The calculation process of the Hungarian algorithm is that M is set to be null, an augmentation path P on M is found out, and then more matching M' is obtained to replace M through negation operation. The operation of continuing the previous step is repeated until no more augmented paths are found, so the core of the hungarian algorithm is to find as many augmented paths as possible.

The algorithmic pseudo code of the augmented path is as follows:

in the algorithm, the network model takes two types of limb sets which can be correctly connected as subsets X and Y, and obtains correct limb combinations through the Hungarian algorithm to form a complete human body posture structure.

In order to help Hungarian algorithm to quickly match a part of limb pairs which are not easy to match in a graph, a human limb frame model is introduced, wherein points represent important joint points of a human body, lines represent the limbs, and the points and the lines do not represent volumes, so that the model is modeled by a non-volume method, and for all joint points and limbs of the human body, only when the joint points and the limbs are connected, the connection exists between the adjacent joint points and the limbs. When the multiple persons do not seriously shield or the human joint features are not obvious, the model can be sleeved on a part of detected limbs, then a preferential candidate area of the missing limbs and joint points is provided for the network according to the space rotation range of the rest of the limbs in the model, and the area has high detection and matching weight so as to improve the identification precision of the network in the multiple person overlapped image. The model is divided into two layers, wherein the first layer is a human body posture integral layer, the second layer comprises a head, a trunk, a left arm, a left leg, a right arm and a right leg, the third layer comprises a left big arm, a left small arm, a right big arm, a right small arm, a left thigh, a left shank, a right thigh and a right shank, the fourth layer comprises a joint connected between the two parts in the third layer, if the fourth layer below the right lower arm of the third layer comprises a wrist joint and an elbow joint, the whole structure schematic diagram is shown in figure 4, and the layers are directed to the lower layer from the high layer through arrows.

When in detection, firstly, the matched and determined limb is taken as a stable point, then the limb can be simplified into a rigid body according to the principle that the limb of a human body is connected with each other through a hinge at two ends, and the length of the limb of the human body has a certain proportional relation, so that the constraint of the limb of the human body is divided into two parts: the first part is the length constraint on the same limb, and the calculation is shown as formula (20); the second part is the length constraint of the symmetrically positioned limb, and the calculation formula is shown as (21).

Wherein R is_iRepresenting a group of limbs having a certain similarity, S_iThe (i) th limb is represented,

mean values representing the ratio between the length of all limbs in a group and their mean values.

And after the length estimation value of the limb is obtained, taking the joint point in the limb as the center, taking the estimated length of the limb as the joint point and the limb related to the limb in the radius detection range, and then calculating all limb matching by using the Hungarian algorithm again.

Experiments and analyses

In order to test the generalization ability of the model in various environments, a plurality of photos of multiple persons are randomly selected from campus, battlefield, earthquake, fire and dust environments and tested in the trained model, and the test results are shown in fig. 5 to 6.

Analysis of results

In order to quantitatively describe the detection accuracy of the model in a complex environment, 100 images are randomly extracted from three environments with low visibility, such as war, earthquake, smoke and the like, and then the images are detected, and are compared with the correct results to obtain the detection accuracy of each human body and the detection accuracy of each limb, wherein the results are shown in tables 2 and 3.

TABLE 2 human body detection accuracy in complex environments

TABLE 3 accuracy of estimation of human body limbs in complex environment

As can be seen from Table 2, the mean value of the human body detection accuracy in the 3 environments is 0.83, but only the performance of the number of human body postures detected by the model is evaluated. Table 3 shows the detection accuracy of each limb of the human body in 3 environments, and it can be seen that the detection accuracy of the limb of the human body in the war environment is the lowest, because the posture change of the human body in the war is the largest and the environment is quite harsh, the detection accuracy of each limb is greatly reduced compared with the original accuracy of the model. In an earthquake environment, the position and posture of personnel can be shielded by the surrounding environment, so that the accuracy of the trunk part and the tail ends of the limbs of the human body is greatly reduced, and the accuracy of other parts is reduced in a small range. Compared with other two environments, the low visibility environment such as smoke has the advantages that although visibility is reduced and the detection accuracy of the limb end is affected, the pose change is small, and the detection accuracy of each limb is relatively high.

In a word, when the figure in the picture is in a clear background environment, the person shielding is not serious, and the image is a close scene, the model detection effect is good, and the detection accuracy is high. When the background of the image is complex, but the characters in the image are sparse and the occlusion is not serious, the detection rate and the accuracy of the model are high. When people in the figures are stacked densely and the background and the human body are fused, the detection effect is reduced, because when people are stacked heavily and the background and the human body are fused, the features extracted by the feature extraction network are poor, even the features which are not considered to be effective are discarded, so that the joint point features of the people disappear, and finally the body features which are stacked together and are completely fused with the background cannot be detected by the model.

The three-aspect research is carried out on the problem that the rescue robot estimates the postures of multiple persons inaccurately in the complex land environment and the problem that the precision of the existing human posture estimation model needs to be improved: the method comprises the following steps of (1) providing a multi-person human body posture estimation model based on deep learning; (2) using a 50-layer ResNet network as a feature extraction network; (3) the Hungarian algorithm and the human body limb frame are used together to obtain the postures of the human bodies of the multiple persons.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-person human posture estimation model based on deep learning is characterized in that: inputting the image into a network, obtaining the estimation characteristics of the postures of the human bodies of the multiple persons through the processing of a characteristic extraction network, and then inputting the characteristics into a matching network of the limbs and the joint points of the human bodies of the multiple persons to realize the estimation of the postures of the human bodies of the multiple persons.

2. The model of claim 1, wherein the model comprises: the model input specifically comprises an RGB color image within 1000 × 1000 pixels in size or a video file containing a multi-person image, and the file format is MP4 format.

3. The model of claim 1, wherein the model comprises: the human body limbs and the joint points are obtained by matching and using the Hungarian algorithm and the human body limb frame.