CN113643362A

CN113643362A - Limb corrector based on human body measurement in 2D human body posture estimation system

Info

Publication number: CN113643362A
Application number: CN202110682875.4A
Authority: CN
Inventors: 王全玉; 艾力; 孙玥; 张开翔
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-11-12

Abstract

A system and method for correcting limb depth in a 2D body posture estimation system uses depth sensors and anthropometry to implement 2D to 3D mapping. When mapping 2D body gestures to 3D gestures, the factors of holes in the depth map, self-occlusion, and inter-object occlusion may cause the depth sensor to generate errors in the depth returned for each joint. The system and method relies on a 2D body pose estimator, depth sensor data and an average body measurement data. The method obtains 2D body pose information from a 2D body pose estimator and generates accurate 3D poses using a depth map and average body measurements. The method utilizes the principle that the 2D projection of the limb and the limb length information can assist in calculating the projection in depth. Different from the traditional technologies such as posture matching and deep neural network, the method has the characteristic of low calculation, so that the projection process from 2D to 3D is easy to realize and is more suitable for a real-time posture estimation system.

Description

Limb corrector based on human body measurement in 2D human body posture estimation system

The technical field is as follows:

the invention relates to the field of human body posture estimation systems, in particular to joint mapping from 2D to 3D in a 2D human body posture estimation system.

Secondly, background art:

recently, human posture estimation systems have attracted strong interest to researchers due to the advent of larger sets of relevant data and software and hardware support. The detection of human body joints in the image lays a foundation for many practical applications. Human pose estimation can be done in 2D and 3D space. In 2D body posture estimation, the system performs joint positioning from a monocular image. Whereas in 3D pose estimation, the system relies on color images and synchronized depth maps. The depth map may be generated by other processes or depth sensors, which are difficult and less accurate to generate from a pre-learning process, while the depth sensors may generate high accuracy depth maps in real time, but cover a limited depth space. The depth sensor may generate a depth map using a time-of-flight sensor or stereo vision.

Training a model by providing pairs of samples (images, depth maps) to the model is one way to generate depth, and once training is complete, the model can predict the depth of each pixel in the input image, but the accuracy of the process is low. The second approach is to synthesize a depth map using multiple depth cues in the image. The depth cues used include accompanying perspective geometry, defocus, visual saliency, and adaptive depth models. Other techniques use temporal information for stereo image generation or motion analysis to generate depth maps.

Although different methods are available for generating depth maps, there are some problems with the generated depth maps, one of which is that the presence of holes in the depth frames causes problems with point-to-point mapping. Another problem is the overlapping of joints when mapping human body gestures from 2D to 3D, which can generate uncertain depth for certain joints in the picture. This problem is not related to depth sensors but is caused by the nature of the joints, i.e. the joints can be connected.

The invention does not use direct point-to-point mapping to solve the problem of holes in the depth map, but selects the optimal depth point by checking the nearby points of the target point. The second problem was corrected using anthropometry: joint depth ambiguity due to joint overlap. The approximate depth of each joint is calculated from the anthropometric information and compared to the depth returned by the sensor to estimate the error in the joint depth. The average human limb length is measured by anthropometry and used as a basis.

Third, the invention

Systems, methods, and processes for mapping 2D body gestures to 3D body gestures and limb depth correction are set forth. Many of the current solutions for mapping from 2D to 3D are computationally intensive, even reducing the overall application performance when cascaded with other computationally intensive tasks. The present invention maps 2D body pose joints to 3D using a minimum set of computations.

Fig. 1 illustrates an abstract representation of a system using an off-the-shelf 2D body pose estimator (0102) that receives a real-time monocular image sequence (0101) and outputs 2D body poses (0103) of N joints for each person in the image, the body pose model and joint configuration being as shown in fig. 5. The detected 2D gestures are then input to a 2D to 3D mapper module (0104) that transforms each 2D joint into a 3 dimensional plane by increasing the depth axis. The mapper module generates a 3D joint set in which each joint is a value-plus-depth coordinate of a 2D joint. Considering the reason for depth map holes or occlusions can result in 3D poses that lack some depth values or have incorrect depth values. The generated 3D gesture (0105) is then input to a anthropometric limb corrector module (0106) which further disambiguates depth and outputs an accurate 3D human gesture (0107).

The block diagram shown in fig. 2 represents a basic 2D body pose estimation system. The system acquires an image or a set of image sequences (0201) and passes them to a pose estimator (0202) to generate a set of 2D joint locations for each person in the image. For any particular image in the sequence with K individuals, N postural joints, each 2D postural joint is represented as:

FIG. 3 illustrates a block diagram of a 2D-to-3D gesture mapping module that has as input a set of 2D joint positions (0301). They are processed (0302) to generate the corresponding 3D mapped joints (0303). It does not use direct correspondence point depth mapping of 2D joints to depth maps, but selects the best depth representation from the neighborhood of a particular joint in the depth map. For any particular image in the sequence with K individuals, N postural joints, each 3D postural joint is represented as:

fig. 4 shows a block diagram of a anthropometric limb corrector module (0400). The module takes the generated 3D pose (0401) and outputs an accurate 3D pose (0403) by eliminating depth ambiguity in the limb (0402). To resolve depth ambiguities, the 2D projection of the limb and the depth of the depth sensor are compared with the anthropometric results of each limb to select the correct depth value.

Figure 5 shows a graphical human joint model (0500). The model consists of 21 postural joints, each of which is a 3D or 4D vector. A2D joint is denoted by J, which is a set of x, y locations with confidence levels ranging from 0-1. One 3D joint, denoted by O, is a set of x, y, z positions with confidence ranges of 0-1, and N is the number of postural joints. The joint description is shown in fig. 5.

Figure 6 shows limb configuration and average anthropometric limb measurements for the human body (0600). Herein, the term "limb" describes a virtual skeleton between two postural joints. We use the term "virtual skeleton" because a few limbs cannot actually represent a skeleton, such as: limbs from nose to eyes, eyes to ears, and nose to neck. All average human limb measurements are in millimeters. The symmetric limbs have the same limb length. Figure 5 shows the ratio of each limb length to the average body height.

Fig. 7A shows two identically sized color (0701) and depth (0702) frames. The two frames are of the same size, MxN. Where M is the number of horizontal pixels and N is the number of vertical pixels. Each pixel in the color frame is denoted as P (x, y) and each point in the depth map is denoted as Z (x, y). In the direct depth point mapping, the depth of each point in the color frame, i.e., the corresponding depth value in the depth map, i.e., P (x, y) ═ Z (x, y). But this can lead to the generation of a wrong depth map, since the depth sensor produces a hole in the depth map. To this end, in the depth map, we select the optimal depth from the neighborhood of points. As shown in fig. 7B, the proximity is calculated by using the euclidean distance (0703). The 3D joint O obtained from the 2D joint J mapping is calculated as follows:

where r is a parameter controlling the size of the neighbor region in the depth map. The larger the value of r, the larger the observed proximity.

Fig. 8 shows the conversion process of the screen coordinate point S (x, y) to the world coordinate point W (x, y, z). The process uses a perspective projection model in which the camera parameters are known, including the horizontal field of view, the vertical field of view, and the width and height of the screen plane. The conversion from screen coordinate point S to world coordinate point W is:

W(z)＝Z_Sensor (7)

wherein, FoV_hIs the horizontal field of view, FoV, of the depth camera_vIs the vertical field of view of the depth camera, w is the width of the 2D screen plane, h is the height of the 2D screen plane, Z_SensorIs the depth (in millimeters) that the depth sensor returns for W. All points S coordinates are in pixels and the point W coordinates are in millimeters.

Figure 9 illustrates the relationship between anthropometric limb length, joint depth of sensor return and 2D projection of the limb. As is evident from the figure, the anthropometric limb length is proportional to the length of the limb in the 2D plane plus the end joint depth returned by the sensor. We can use the above relationship to find the joint with the larger depth error. The joint with the wrong depth may be replaced by a depth calculated from anthropometric measurements. Here O is the 3D joint that needs to compute the depth error and P is the parent node of O, so OP is a limb with two end joints each. Z _ Anth ^ (O-P) is the anthropometric length of limb OP, Z _ Sensor ^ O is the depth returned by the depth Sensor of joint O. The approximate depth of joint O, z _ approx ^ O, and Limb Anthropormetric Error (LAE) for joint O with parent P are calculated as:

where tolerance is a threshold over-parameter, which is a value within a certain range of inter-decision intervals.

FIG. 10 shows the fine tuning sequence of the 3D postural joints. The process starts at the root node, the mid-hip (O10), and proceeds in parallel in three branches (right hip, torso, and left hip). Stopping until the leaf nodes (right wrist, left wrist, right ear and left ear) are reached.

Fig. 11 shows the projection of the 2D pose on the y-axis and z-axis. The anthropometric limb length for each joint is denoted as z _ Anth ^ O, and the Sensor depth returned for each joint is denoted as z _ Sensor ^ O, where O is the particular 3D joint. The joint-parent joint configuration for each joint is shown in table 1.

TABLE 1

FIG. 12 shows the pose in random frame 1 in the world coordinate system. The frame shows four directional views of the front, top, left and right of the gesture. The three axes (X, Y, Z) are in millimeters. The origin of the frame is (0mm,0mm,0mm). the minimum value of the axes is (-1000m, -1000m, -1000m), and the maximum value is (1000m,1000m,1000 m).

Fig. 13 shows the pose in random frame 1 in the screen coordinate system. The frame shows the elevation pose at a size of 640px x480 px. The two axes (x, y) are in pixels. The origin (0px ) of the frame is located in the upper left corner.

Fig. 14 shows the depth patch for joint J0 in random frame 1. The patch shown is a portion of the depth map (640x 480) that is centered on joint coordinates J0. The size of the patch is 11px x 11px (horizontal portion 350-. In direct mapping, the depth value of the patch center corresponds to the depth of the joint. In the neighbor allocation method, all points contained in an area within a distance r from the center radius participate in a mapping algorithm. Since the center of the patch represents a hole in the depth map (i.e., depth value 0mm), in this case, no direct mapping is used, but a minimum value around r 4 is 2998 mm.

Fig. 15 shows the depth patch for joint J1 in random frame 1. The patch shown is the portion of the depth map (640x 480) centered at the coordinate J1. The size of the patch is 11px x 11px (horizontal portion 350-. In direct mapping, the depth value of the patch center corresponds to the depth of the joint. In the neighbor assignment method, all points included within a radius r from the center participate in the mapping algorithm. Since the center of the patch does not represent a hole in the depth map, (i.e., depth value ≠ 0mm), in this case direct mapping is used, but rather a joint depth of 2798mm is employed.

FIG. 16 shows the pose in random frame 2 in the world coordinate system. The frame shows four directional views of the front, top, left and right of the gesture. The three axes (X, Y, Z) are in millimeters. The origin of the frame is (0mm,0mm,0mm). the minimum value of the axes is (-1000m, -1000m, -1000m), and the maximum value is (1000m,1000m,1000 m).

Fig. 17 shows the pose in random frame 2 in the screen coordinate system. The frame shows the elevation pose at a size of 640px x480 px two axes (x, y) are represented in pixels. The origin (0px ) of the frame is located in the upper left corner.

Fig. 18 shows the depth patch of joint J5 in random frame 2. The patch shown is a portion of the depth map (640x 480) that is centered on joint coordinates J5. The patch has a size of 11px x 11px (the horizontal portion of the depth map is 345-. In direct mapping, the depth value of the patch center corresponds to the depth of the joint. In the neighbor assignment method, all points included within a radius r from the center participate in the mapping algorithm. Since the center of the patch represents a hole in the depth map (i.e., depth value 0mm), in this case no direct mapping is used, but rather a minimum value around r 4 of 2341mm is employed.

Fig. 19 shows the depth patch for joint J8 in random frame 2. The patch shown is the portion of the depth map (640x 480) centered at the coordinate J8. The size of the patch is 11px x 11px (the horizontal portion of the depth map is 307-. In direct mapping, the depth value of the patch center corresponds to the depth of the joint. In the neighbor assignment method, all points included within a radius r from the center participate in the mapping algorithm. Since the center of the patch does not represent a hole in the depth map, (i.e., depth value ≠ 0mm), in this case direct mapping is used, but rather a joint depth of 2527mm is employed.

Description of the drawings

All figures provided illustrate the abstract idea of the invention and provide a rough understanding of all used systems, methods or processes.

Fig. 1 shows an abstract block diagram of a 2D to 3D body posture mapper and a anthropometric limb corrector.

FIG. 2 illustrates a block diagram of a typical 2D body pose estimation system.

FIG. 3 illustrates a block diagram of a 2D-to-3D gesture mapping module.

Figure 4 shows a calibration module for a anthropometric limb.

Figure 5 shows a graphical human stick model of the gesture detector.

Figure 6 illustrates the limb configuration and anthropometric results for each limb.

Fig. 7A illustrates a color and depth frame of size MxN.

Fig. 7B illustrates 2D point optimal depth selection based on euclidean distance.

Fig. 8 shows the conversion process of the screen coordinate system to the world coordinate system using perspective projection.

Fig. 9 illustrates a projection of a limb on the depth axis.

Fig. 10 illustrates the order of the 3D joint fine adjustment sequence.

FIG. 11 illustrates the projection of the 2D pose along the y-axis and z-axis.

Fig. 12 shows a random frame 1 in a world coordinate system.

Fig. 13 shows a random frame 1 in the screen coordinate system.

Fig. 14 shows the depth frame patch for joint J0 in random frame 1.

Fig. 15 shows the depth frame patch for joint J1 in random frame 1.

Fig. 16 shows the random frame 2 in a world coordinate system.

Fig. 17 shows the representation of the random frame 2 in the screen coordinate system.

Fig. 18 shows the depth frame patch for joint J5 in random frame 2.

Fig. 19 shows the depth frame patch for joint J8 in random frame 2.

Fifthly, specific implementation method

The system maps precise 3D poses using 2D poses, a depth corrector module, and body measurements. For each frame in the image sequence, it acquires an RGB-D image as a color image and depth image pair. The color image is passed to a 2D pose estimator that generates 2D poses for each person in the image in the form of a set of keypoints. The system then uses a depth mapping module to select an optimal depth candidate point from the depth map for each 2D joint to map the 2D pose to the 3D pose. The 3D pose may still contain the wrong (large difference compared to the true depth) joint mapping depth, which will be further refined by the anthropometric measurements. The system is compatible with any 2D body pose estimation module, regardless of its implementation, and can return to the 2D joint set shown in fig. 5.

Consider a depth camera returning a 30fps frame rate, 640x480 pixel RGB frame and a 30fps frame rate, 640x480 pixel synchronized depth map. The camera parameters of the same camera are: the most recent depth is 1 meter, the most distant depth is 8 meters, FOVh is 60 ° and FOVv is 49.5 °. Fig. 12 shows a random transient frame (frame 1) in the world coordinate system, and fig. 13 shows its corresponding screen coordinate frame of size 640x480 pixels. Table 2 shows the poses detected by the 2D pose estimation module for the same frame, while fig. 14 and 15 show the corresponding depth map patches for joints J0 and J1, respectively. The pose estimation module returns a set of 21 2D joints of data for each image in the sequence. The input is then loaded to the depth mapper module to return the 3D gesture.

TABLE 2

The depth mapper module maps each joint depth by using equation (4) and a neighbor parameter r-4. The depth of the 0-depth joint resulting from the presence of the hole in the depth map is replaced and is shown in the above table. Table 3 shows the screen coordinates and corresponding world coordinates for each joint in table 2. The conversion of screen coordinates to world coordinates is done by equations (5, 6, 7, 8 and 9).

TABLE 3

The world coordinates are then input to a human depth corrector module for depth value correction. It takes into account the anthropometric values provided in figure 6. The system also calculates the human Limb measurement Error (LAE) for each joint, taking into account the 30mm tolerance, using equations (10, 11 and 12). Table 4 lists the approximate depth values, LAE, and final depth for each joint.

TABLE 4

For the same camera, fig. 16 shows a second random transient frame (frame 2) in world coordinates, which corresponds to a screen coordinate frame of 640x480 pixels in size shown in fig. 17. Table 5 shows that for the pose detected by the 2D pose estimation module for the same frame, the depth map patches for joints J5 and J8 are as shown in fig. 18 and 19. The pose estimation module returns a joint set of 21 2D joints for each image in the sequence. The input is then loaded to the depth mapper module to return the 3D gesture.

TABLE 5

The missing depth is recovered using the neighbor parameter r-4 m and is shown in table 5. Table 6 gives the screen coordinates for each joint in table 5 and the corresponding world coordinates. The conversion of the screen to world coordinates is done using equations (5, 6, 7, 8 and 9).

TABLE 6

Table 7 shows the approximate depth values, LAE, and final depth for each joint.

TABLE 7

Claims

1. A computer-based method for 2D pose to 3D pose mapping and 3D pose optimization, comprising:

a. an input module for extracting a real-time RGB image sequence from an input device.

b. An input module for extracting a real-time synchronized depth map from a depth sensor.

c. A pose estimation module for acquiring images and generating 2D joint positions for each person in the images.

d. A mapping module for promoting a 2D human pose to a 3D pose.

e. A3D pose optimization module corrects joint depth using anthropometry.

2. The method of claim 1, receiving a sequence of RGB + D images comprising one or more persons.

3. The method of claim 1, for whole body and part body poses, mapping each 2D joint to a 3D joint in a depth map.

4. The method of claim 4, comprising direct point-to-point mapping in RGB and depth maps for non-missing depth joints.

5. The method of claim 4, comprising selecting the best depth candidate point in the vicinity of the mapped point in the depth map.

6. The method of claim 1, wherein the precise 3D body posture is generated using anthropometry.

7. A system for 2D to 3D human pose mapping and 3D pose refinement, the system comprising:

a. an apparatus for capturing a real-time monocular image sequence.

b. An apparatus for capturing a synchronized depth map using a depth sensor.

c. A sequence of images containing one or more persons is received.

d. Non-volatile memory for non-runtime saving of system binaries.

e. A main memory for saving system executable files, image sequences and depth maps during processing.

f. A computer processor for performing a method with a running system binary, comprising:

i. the image sequence is captured and stored as a frame buffer.

Creating and storing the synchronized depth map as a depth buffer.

Processing each image to generate a set of 2D joint locations and joint confidence scores for each person in the image.

Processing the 3D pose using anthropometry and 2D projection of the joints to generate an accurate 3D pose.

8. The method of claim 8, wherein the depth approximation estimation is performed using an average body measurement of the human body.

9. The method of claim 8, calculating the joint depth approximation using the limb 2D projection and anthropometric limb length.

10. The method of claim 8, calculating the limb joint measurement error as a difference between the sensor depth and the anthropometric approximate depth.

11. The method of claim 8, using limb joint measurement error for joint depth assignment.

12. The method of claim 8, comprising treating the joint with a compound selected from the group consisting of O, and O_mTo O_nApproximate value of (1)_p(z) and body measurements