CN114066932A - Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method - Google Patents

Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method Download PDF

Info

Publication number
CN114066932A
CN114066932A CN202111130790.1A CN202111130790A CN114066932A CN 114066932 A CN114066932 A CN 114066932A CN 202111130790 A CN202111130790 A CN 202111130790A CN 114066932 A CN114066932 A CN 114066932A
Authority
CN
China
Prior art keywords
joint
posture
dimensional
person
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111130790.1A
Other languages
Chinese (zh)
Inventor
欧林林
许成军
张旭环
张鑫
禹鑫燚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202111130790.1A priority Critical patent/CN114066932A/en
Publication of CN114066932A publication Critical patent/CN114066932A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Abstract

A real-time multi-person human body three-dimensional posture estimation and tracking method based on deep learning comprises the steps of firstly, obtaining feature proof (feature map) from RGB image information through a feature extraction network. Then, the CPM network is used for regressing the positions of the 2D joint points of the human body and the confidence scores (Heat Map and PAFs) corresponding to each joint; then inputting the acquired 2D joint point information and the characteristic matrix into a ResNet residual block to return to a 3D posture (Location Map); finally, a priority redundancy association algorithm is provided, and detected two-dimensional key points and three-dimensional position maps are distributed to individuals. In addition, the invention provides a greedy strategy-based multi-person human body posture tracking algorithm aiming at the problem of multi-person human body posture tracking, which can effectively track the three-dimensional postures of multiple persons even under the condition that some frames are lacked due to correlation error or shielding in the pairing process. The invention meets the requirements of real-time performance and light weight and can be applied to various practical applications.

Description

Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method
Technical Field
The invention relates to a monocular multi-person human body three-dimensional attitude estimation and tracking technology, and provides a real-time multi-person human body three-dimensional attitude estimation method based on deep learning and a human body attitude tracking method based on greedy matching respectively. Particularly, aiming at the problems of instantaneity and occlusion in the estimation of the three-dimensional posture of the human body of multiple persons, a lightweight neural network and a joint priority redundancy strategy are respectively provided.
Background
Human body posture information perception, namely Human body posture Estimation (Human position Estimation), is an important task in computer vision, is also an essential step for a computer to understand Human actions and behaviors, and is widely applied to the fields of Human-computer interaction, AR, VR, games and the like. In recent years, methods for estimating human body posture based on deep learning are proposed in succession, and the performance far beyond the traditional method is achieved. During actual solution, estimation of human body posture is often converted into a prediction problem of human body key points, namely, position coordinates of each key point of a human body are predicted firstly, and then a spatial position relation between the key points is determined according to priori knowledge, so that a predicted human body skeleton is obtained. In previous work, such as: a kinect depth camera is widely used for three-dimensional human body posture estimation in a book Real-time human-robot interaction in complex environment using v2 image recognition published by YANG Y, YAN H, DEHGHAN M, et al, but the problem of depth blurring caused by occlusion exists when the positions of human body joints are read based on depth information. In order to solve the problem of shielding, multi-view attitude information fusion is generally a good method, and more accurate attitude information can be obtained by utilizing estimation results under different views. However, due to the large amount of multi-view information fusion calculation, the real-time performance is poor along with the increase of the number of people, and the method is not suitable for multi-person human posture detection. Thanks to the development of deep learning in recent years, the three-dimensional human body posture estimation based on images is greatly advanced, and the method has good effects on the problems of real-time performance and occlusion handling.
Disclosure of Invention
The invention overcomes the defects in the prior art, provides a real-time multi-person human body three-dimensional posture estimation method based on deep learning, improves the posture estimation instantaneity on the premise of reducing the number of sensors, and ensures the motion capture precision.
The invention estimates and tracks the 2D and 3D postures of the people under the current camera view angle through a lightweight neural network aiming at the RGB image or video stream acquired from the camera view angle. First, the present invention obtains a feature matrix (feature map) from RGB image information through a feature extraction network. Then, the CPM network is used for regressing the position of the 2D joint point of the human body and the father joint (HeatMap and PAFs) corresponding to each joint; then inputting the acquired 2D joint point information and the characteristic matrix into a ResNet residual block to return to a 3D posture (Location Map); finally, a priority redundancy association algorithm is provided, and detected two-dimensional key points and three-dimensional position maps are distributed to individuals. In addition, the invention provides a greedy strategy-based multi-person human body posture tracking algorithm aiming at the problem of multi-person human body posture tracking. The algorithm can effectively track the three-dimensional poses of multiple people, even in the absence of certain frames due to correlation errors or occlusions during pairing.
A real-time multi-person human body three-dimensional posture estimation and tracking method based on deep learning comprises the following specific steps:
step 1: designing a neural network structure;
the present invention uses a lightweight MobileNet V3 as the backbone network and modifies it to have a multi-stage multitasking architecture. The network has two output branches including a 2D pose estimation branch and a 3D pose estimation branch, wherein the 2D pose branch simultaneously regresses keypoint heat Map and paf, and the 3D pose branch regresses Location Map. Given an RGB image, a feature matrix is obtained by a lightweight trunk and input to a two-dimensional branch, CPM-based resulting in keypoint heatma and paf. Then, the feature matrix and the 2D gesture are input into the three-dimensional branch network by using ResNet, and the 3D Location Map is regressed at the 2D body limb pixel. Furthermore, we supervise Location maps between different phases to reduce the dependency of the network on the data set. Assuming the predefined number of nodes is N, the network will output a fixed number of maps, including N HeatMaps, 2N paf, and 3N Location Maps. The output is represented as follows:
heatmaps: all in the imagePixel locations where a human joint may exist. All the human body 2D postures in the image are collected into
Figure BDA0003280453690000031
Each attitude piThere are 15 articulated points, each containing corresponding pixel coordinates
Figure BDA0003280453690000032
And corresponding confidence
Figure BDA0003280453690000033
Represents the detection evaluation of the neural network on the joint point when
Figure BDA0003280453690000034
It indicates that the node is not detected. Wherein i represents the number of the pose in the image, and j represents the joint number corresponding to the pose.
PAFs: a set of two-dimensional vectors, the vector at each node of interest representing the 2D direction of the respective body part. Its role is to correctly assign the detected 2D joint points to the corresponding persons.
Location map: the joint feature channels are used to store 3D coordinates that are regressed at 2D pixel locations. For each joint, three maps are required to represent the corresponding x, y, z estimated coordinates, respectively. For an image of size W H, the three-dimensional positions of all n joints are stored using 3n position maps of size W/k H/k, where k is a down-sampling factor. Similar to Heat Map, the 3D pose of the network fit is noted
Figure BDA0003280453690000035
Each 3D pose PiComprises 15 joints, each joint
Figure BDA0003280453690000036
Consisting of corresponding x, y, z coordinates.
Step 2: constructing a loss function;
in the present invention, we construct a loss function based on 2D and 3D poses and a supervised process. In thatDuring training, L2The loss applies to all branches. 2D attitude loss L2DIs the pixel position error between the HeatMap and PAF and their true values. 3D attitude loss LlocIs the joint error between the Location Map and its true value. Loss of supervision LsupIs the Location Map error at different stages. Overall loss LtotalIs represented as follows:
Figure BDA0003280453690000041
where N and S are the number of joint and network stages, respectively, p represents each pixel position and the superscript represents the true case. w is a2D,wlocAnd wsupIs a penalty factor.
And step 3: reconstructing three-dimensional postures of a plurality of human bodies;
from the position coordinates of the key points in the Keypoint HeatMap and Location Map, we need to associate the detected joints with the corresponding individuals. Joints are assigned directly using PAF scores, and pose information is unreliable due to occlusion. During the inference process, we use root depth mapping to reflect the number of operators, since the number of people in the input image is unknown. Generally, the mid-body trunk joints (neck, hips) are not occluded, which is the best choice for the root joints. In this study, we consider the neck joint of the human body as the root joint. If a person's root joints are visible, we continue to assign joints to the person. Otherwise, the person is not visible in the scene, nor is the pose predictable.
To address the occlusion problem, the present invention prioritizes the non-occluded persons when assigning the joints. The occlusion state can be inferred in a depth Map (Location Map Z channel) predicted by the network. The root depth value represents the absolute position of each person. Thus, rather than a PAF score, the priority of each person is ordered from near to far by the predicted root depth. Our network allows reading the position of the limb from any 2D joint of the respective limb. For an individual, first read the basic posture at the root joint
Figure BDA0003280453690000042
This basic posture is the average posture in the data set. We then continue to read the limb poses from the joints near the roots to obtain the full 3D pose. If the joint is valid, the limb pose will replace the joint of the basic pose. Otherwise, the other joints of the limb are examined along the kinematic chain. If all joints of the limb are not effective, the limb posture cannot be refined. Finally, another refinement method for further reducing errors based on a camera model is proposed. Given the visible 2D coordinates and joint depth, the 3D joint can be recovered by the camera model as follows:
[X,Y,Z]T=ZK-1[x,y,1]T (2)
where [ X, Y, Z ] and (X, Y) represent the 3D and 2D coordinates of the joint, respectively, and K is the camera reference matrix.
In general, in this process, the pose estimator outputs 2D and 3D poses for each person in each frame. Each posture consists of corresponding joint points, wherein the 2D posture information comprises pixel coordinate values corresponding to the joint points in the image and confidence degrees corresponding to the joint points; the 3D pose information contains the spatial coordinate position of each joint point relative to the root joint and is ultimately represented in camera coordinates.
And 4, step 4: tracking a multi-person three-dimensional pose over a continuous time sequence;
the above-described 3D pose estimation method processes only data of the current frame. Therefore, a 3D gesture belonging to the same person cannot be recognized in consecutive frames. At this stage, a continuous frame three-dimensional pose tracking algorithm based on a greedy strategy is designed by using the three-dimensional pose estimation result of each frame, and the tracking problem of multi-person poses on a time sequence is solved.
At this step, we need to redefine the sign of the 3D pose taking into account the time index t. For example, StRepresenting the set of all 3D skeletons at time t,
Figure BDA0003280453690000051
the gesture numbered i representing the current time,
Figure BDA0003280453690000052
represents the nth joint of the skeleton, and
Figure BDA0003280453690000053
is used to indicate whether the nth joint is present at time t.
The algorithm takes unordered 3D poses in t frames as input and then outputs a 4D pose sequence with temporal information. The skeleton belonging to the same person is found in the continuous frames by adopting a forward search method. And connecting the skeletons among different frames by calculating corresponding costs through a greedy algorithm. In the pairing process, under the condition that the skeleton does not exist in certain frames due to association error or occlusion, the method still ensures that the skeleton can be effectively tracked. Because only three-dimensional poses exist at the current stage, the cost function between skeletons can be defined as:
Figure BDA0003280453690000061
wherein | | · | | represents the posture
Figure BDA0003280453690000062
And posture
Figure BDA0003280453690000063
The euclidean distance between them, N ═ (1, 2, 3.., N) denotes the joint number, and N is the total number of joints of the skeleton.
Pose tracking is divided into three cases. In (a), the number of poses in the previous and subsequent frames is the same, and skeletons between different frames are connected through corresponding confidence degrees. In (b), the number of poses in the current frame is greater than the number of poses in the previous frame. For unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τt. In (c), the number of poses in the current frame is also greater than the number of poses in the previous frame. After the forward search process is completed, there is still an unpaired skeleton (skeleton 3) in the current frame, and at this time, an ID should be assigned to the skeleton 3.
Defining the current frame t as a frame to be matched, and initializing a search frame to t-1. And calculating the matching degree of all matched skeletons in the current frame to be matched and the search frame. All pairs of skeletons in the sequence are ordered by increasing values of ζ. After ordering all candidate skeleton correspondences, if ζminBelow a set threshold δ, the pairing ζ may be considered valid. Current frame
Figure BDA0003280453690000065
The gesture in (1) inherits the search frame
Figure BDA0003280453690000064
ID information of the successfully paired gesture. At the same time, ζminAnd its associated "redundancy pairs" should be deleted. If some unmatched skeletons appear in the current frame, this means that some new skeletons appear, or these skeletons lose track during the association process due to errors or occlusion. At this time, the search frame is set as t-2, and the process of pairing and updating is repeated continuously. This process continues until t- τsIn which τ issIs the number of frames of the maximum allowed search frame. If there is still an unpaired gesture at this time, the gesture can be considered to be newly present, and unique ID information is given to the gesture.
The invention has the advantages that:
1. the invention designs a lightweight monocular multi-person human body 3D posture estimator meeting the real-time requirement in combination, and the estimator is used for estimating the positions of the multi-person human body 3D joint points in a scene. Through human body kinematic constraint and by combining joint redundancy and posture coding priority, the problem of shielding among people in different scenes is solved.
2. The invention designs a greedy strategy-based continuous frame three-dimensional pose tracking algorithm, solves the problems of continuous tracking and identification of different people in a scene, and improves the stability of a system.
Drawings
Fig. 1 is a schematic diagram of a monocular multi-person human body posture estimation network structure according to the present invention.
FIG. 2 is a schematic diagram of the human body pose estimation network of the present invention.
Fig. 3 is a schematic diagram of the network architecture composition of the present invention.
4(a) -4 (c) are schematic diagrams of the gesture tracking of the present invention, wherein the number of gestures in the previous and subsequent frames of FIG. 4(a) is the same, and the skeletons between the different frames are connected by corresponding confidence degrees; fig. 4(b) the number of poses in the current frame is greater than the number of poses in the previous frame. For unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τt(ii) a Fig. 4(c) the number of poses in the current frame is also greater than the number of poses in the previous frame. After the forward search process is completed, there is still an unpaired skeleton (skeleton 3) in the current frame, and at this time, an ID should be assigned to the skeleton 3.
Fig. 5(a) -5 (c) are schematic diagrams illustrating pose estimation of a multi-person human body according to the present invention, wherein fig. 5(a), 5(b), and 5(c) show pose estimation results of three different frames of images, the top is an RGB image input by the system, and the bottom shows the corresponding pose estimation results.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
The invention relates to a real-time multi-person human body three-dimensional posture estimation method based on deep learning, which comprises the following specific processes:
in the embodiment, a color camera is used for capturing the complete postures of the human bodies of a plurality of people. The method has no limit on the number of people in the scene and the scene, and has good universality.
Step 1: designing a neural network structure;
the present invention uses a lightweight MobileNet V3 as the backbone network and modifies it to have a multi-stage multitasking architecture. The network has two output branches including a 2D pose estimation branch and a 3D pose estimation branch, wherein the 2D pose branch simultaneously regresses keypoint heat Map and paf, and the 3D pose branch regresses Location Map. As shown in fig. 1, given an RGB image, human 2D pose information (Keypoints HeatMap and PAFs Map) and human 3D pose information (Location Map) are obtained through a neural network. And then, distributing the regressed joint information by a priority redundancy method to further obtain the 3D posture. Network regression method as shown in fig. 2, the RGB image is subjected to a backbone network to extract a feature matrix, and the feature matrix is input to a two-dimensional branch, and keypoint heatma and paf are obtained based on CPM. Then, the feature matrix and the 2D gesture are input into the three-dimensional branch network by using ResNet, and the 3D Location Map is regressed at the 2D body limb pixel. Furthermore, we supervise Location maps between different phases to reduce the dependency of the network on the data set. The specific implementation of each module is shown in fig. 3.
Step 2: constructing a loss function;
the invention constructs a loss function based on 2D and 3D poses and a supervision process. During training, L2The loss applies to all branches. 2D attitude loss L2DIs the pixel position error between the HeatMap and PAF and their true values. 3D attitude loss LlocIs the joint error between the Location Map and its true value. Loss of supervision LsupIs the Location Map error at different stages.
The 2D pose branch and the 3D branch in the network are trained and validated on the COCO and CMU datasets, respectively. COCO is a large-scale target detection data set containing over 20 million images and 25 million people's keypoint labeling. The training and testing set (over 15 million people and 170 million labeled points) is public. In the experiment, the pixel positions of the two-dimensional key points of a plurality of persons are regressed on the data set. CMU Panoptic is a large data set containing various indoor social activities (playing musical instruments, dancing, etc.) collected by multiple cameras. Mutual occlusion and truncation between individuals makes restoring the 3D pose challenging. Also, the present invention regresses the three-dimensional position of the human joint on this data set.
The present invention implements the proposed network scheme using a pytorech framework. The optimizer used in the training process is Adam optimized with parameter beta1=0.9,β2When the learning rate is 0.999, the learning rate is 0.0002, and the batch size is 32. 20 epochs were trained on COCO and CMU Panoptic datasets as final models. The image is resized to a fixed size of 455x256 as an input to the network and 20 from a different sequence is selectedThe 0K images serve as our training set. Two cameras (16 and 30) in four activities (Haggling, Sports, Ultimatum, Pizza) served as our test set. Since the COCO data set lacks 3D gesture annotations, the weight of 3D loss is set to zero when the COCO data is input.
And step 3: reconstructing three-dimensional postures of a plurality of human bodies;
and (4) performing joint distribution and three-dimensional posture reconstruction based on the characteristic diagram obtained by network regression in the step (2). The network in the present invention allows reading the position of the limb from any 2D joint of the respective limb. For an individual, first read the basic posture at the root joint
Figure BDA0003280453690000091
This basic posture is the average posture in the data set. Then, the limb postures continue to be read from the joints near the roots to obtain the complete 3D posture. If the joint is valid, the limb pose will replace the joint of the basic pose. Otherwise, the other joints of the limb are examined along the kinematics of the body. If all joints of the limb are not effective, the limb posture cannot be further refined. And inputting the RGB image into a network for prediction to obtain a multi-person human body posture estimation result, and visualizing the result in an ROS environment. The recognition result of the human body posture is shown in fig. 5.
And 4, step 4: tracking a multi-person three-dimensional pose over a continuous time sequence;
the above-described 3D pose estimation method processes only data of the current frame. Therefore, a 3D gesture belonging to the same person cannot be recognized in consecutive frames. At this stage, a continuous frame three-dimensional pose tracking algorithm based on a greedy strategy is designed by using the three-dimensional pose estimation result of each frame, and the tracking problem of multi-person poses on a time sequence is solved. The posture tracking is divided into three cases as shown in fig. 4(a) to 4 (c). In fig. 4(a), the number of poses in the previous and subsequent frames is the same, and the skeletons between different frames are connected by corresponding confidence degrees. In fig. 4(b), the number of poses in the current frame is greater than the number of poses in the previous frame. For unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τt. In FIG. 4(c), in the current frameThe number of poses is also greater than the number of poses in the previous frame. After the forward search process is completed, there is still an unpaired skeleton (skeleton 3) in the current frame, and at this time, an ID should be assigned to the skeleton 3. During the tracking process, the present invention will track the pose using a sequence of images within 2 seconds, the result of which is shown in fig. 5. The 45 th, 384 th and 731 th frames in the sequence of the scene were chosen to demonstrate the robustness of the algorithm, respectively, and it can be seen that for each person in the scene, the algorithm is able to track efficiently even in the presence of occlusion.
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof which may occur to those skilled in the art upon consideration of the present inventive concept.

Claims (1)

1. A real-time multi-person human body three-dimensional posture estimation and tracking method based on deep learning comprises the following specific steps:
step 1: designing a neural network structure;
using lightweight MobileNet V3 as backbone network, and modifying it into multi-stage multitask structure; the network has two output branches including a 2D posture estimation branch and a 3D posture estimation branch, wherein the 2D posture branch simultaneously regresses keypoint heatmap and paf, and the 3D posture branch regresses Location Map; giving an RGB image, acquiring a characteristic matrix through a lightweight trunk, inputting the characteristic matrix into a two-dimensional branch, and obtaining keypoint heatma and paf based on CPM; then, inputting the feature matrix and the 2D posture into a three-dimensional branch network by using ResNet, and returning to a 3D Location Map at a 2D human body limb pixel; in addition, the Location Map between different stages is supervised to reduce the dependence of the network on the data set; assuming that the predefined number of nodes is N, the network will output a fixed number of maps, including N HeatMaps, 2N paf, and 3N Location Maps; the output is represented as follows:
heatmaps: pixel positions where all human body joint points in the image may exist; recording all human body 2D postures in imageSet of states as
Figure FDA0003280453680000011
Each attitude piThere are 15 articulated points, each containing corresponding pixel coordinates
Figure FDA0003280453680000012
And corresponding confidence
Figure FDA0003280453680000013
Represents the detection evaluation of the neural network on the joint point when
Figure FDA0003280453680000014
If yes, the joint point is not detected; wherein i represents the number of the gesture in the image, and j represents the joint number corresponding to the gesture;
PAFs: a set of two-dimensional vectors, the vector at each node of interest representing the 2D direction of the respective body part; its role is to correctly assign the detected 2D joint points to the corresponding persons;
location map: the joint characteristic channel is used for storing 3D coordinates regressed at the 2D pixel position; for each joint, three maps are required to represent the corresponding x, y, z estimated coordinates; for an image of size W H, storing the three-dimensional positions of all n joints using 3n position maps of size W/k H/k, where k is a down-sampling factor; similar to Heat Map, the 3D pose of the network fit is noted
Figure FDA0003280453680000015
Each 3D pose PiComprises 15 joints, each joint
Figure FDA0003280453680000016
Consisting of corresponding x, y, z coordinates;
step 2: constructing a loss function;
constructing a loss function based on the 2D and 3D poses and the supervision process; during training, L2The loss applies to all branches; 2D attitude loss L2DIs the pixel position error between the HeatMap and PAF and their true values; 3D attitude loss LlocIs the joint error between Location Map and its true value; loss of supervision LsupIs the Location Map error at different stages; overall loss LtotalIs represented as follows:
Figure FDA0003280453680000021
wherein N and S are the number of joint and network stages, respectively, p represents each pixel position, and the superscript represents the true condition; w is a2D,wlocAnd wsupIs a penalty factor;
and step 3: distributing three-dimensional joints of a multi-person human body and reconstructing three-dimensional postures;
associating the detected joint with the corresponding individual according to the position coordinates of key points in the Keypoint HeatMap and the Location Map; because of occlusion, joints are assigned directly using PAF scores, and pose information is unreliable; during the reasoning process, since the number of people in the input image is unknown, a root depth map is used to reflect the number of operators; the neck joint of the human body is regarded as a root joint; if a person's root joints are visible, continuing to assign joints to the person; otherwise, the person is not visible in the scene, and the posture cannot be predicted;
prioritizing people that are not occluded when assigning joints; the occlusion state can be deduced in a depth Map (Location Map Z channel) predicted by a network; the root depth value represents the absolute position of each person; thus, rather than a PAF score, each person's priority is ordered from near to far by the predicted root depth; the network allows reading the position of the limb from any 2D joint of the respective limb; for an individual, first read the basic posture at the root joint
Figure FDA0003280453680000022
This basic posture is the average posture in the dataset; then, the limb is continuously read from the joint near the rootPose to obtain a full 3D pose; if the joint is valid, the limb posture will replace the joint of the basic posture; otherwise, checking other joints of the limb along the kinematic chain; if all joints of the limb are invalid, the posture of the limb cannot be refined; finally, another refinement method for further reducing errors based on a camera model is provided; given the visible 2D coordinates and joint depth, the 3D joint is recovered by the camera model as follows:
[X,Y,Z]T=ZK-1[x,y,1]T (2)
wherein [ X, Y, Z ] and (X, Y) represent the 3D and 2D coordinates of the joint, respectively, and K is a camera reference matrix;
in general, in this process, the pose estimator outputs 2D and 3D poses for each person in each frame; each posture consists of corresponding joint points, wherein the 2D posture information comprises pixel coordinate values corresponding to the joint points in the image and confidence degrees corresponding to the joint points; the 3D pose information contains the spatial coordinate position of each joint point relative to the root joint and is ultimately represented in camera coordinates;
and 4, step 4: tracking a multi-person three-dimensional pose over a continuous time sequence;
the 3D attitude estimation method only processes data of a current frame; therefore, 3D gestures belonging to the same person cannot be recognized in consecutive frames; at this stage, a continuous frame three-dimensional pose tracking algorithm based on a greedy strategy is designed by using the three-dimensional pose estimation result of each frame, so that the tracking problem of multi-person poses on a time sequence is solved;
at this step, the symbols of the 3D pose are redefined taking into account the time index t, StRepresenting the set of all 3D skeletons at time t,
Figure FDA0003280453680000031
the gesture numbered i representing the current time,
Figure FDA0003280453680000032
represents the nth joint of the skeleton, and
Figure FDA0003280453680000033
is used for indicating whether the nth joint exists at the time t;
taking unordered 3D poses in the t frames as input, and then outputting a 4D pose sequence with time information; finding skeletons belonging to the same person in continuous frames by adopting a forward search method; in the pairing process of connecting different frames by calculating corresponding costs through a greedy algorithm, under the condition that the skeleton does not exist in some frames due to correlation errors or occlusion, the skeleton can still be ensured to be effectively tracked; because only three-dimensional poses exist at the current stage, the cost function between skeletons can be defined as:
Figure FDA0003280453680000034
wherein | | · | | represents the posture
Figure FDA0003280453680000035
And posture
Figure FDA0003280453680000036
The euclidean distance between them, N ═ (1, 2, 3,. N) denotes the joint number, and N is the total number of joints of the skeleton;
attitude tracking is divided into three cases; in the step (a), the number of the postures in the front frame and the rear frame is the same, and skeletons among different frames are connected through corresponding confidence degrees; in (b), the number of poses in the current frame is greater than the number of poses in the previous frame; for unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τt(ii) a In (c), the number of poses in the current frame is also greater than the number of poses in the previous frame; after the forward search process is completed, an unpaired skeleton (skeleton 3) still exists in the current frame, and an ID should be allocated to the skeleton 3 at this time;
defining a current frame t as a frame to be matched, and initializing a search frame to t-1; calculating the matching degrees of all matched frameworks in the current frame to be matched and the search frame;all skeleton pairs in the sequence are sorted according to the incremental value of zeta; after ordering all candidate skeleton correspondences, if ζminBelow a set threshold δ, the pairing ζ can be considered valid; current frame
Figure FDA0003280453680000041
The gesture in (1) inherits the search frame
Figure FDA0003280453680000042
ID information of the successfully paired gestures; at the same time, ζminAnd its associated "redundancy pair" should be deleted; if some unmatched skeletons appear in the current frame, this means that some new skeletons appear, or these skeletons lose track during the association process due to errors or occlusion; at the moment, setting a search frame as t-2, and continuously repeating the pairing and updating process; this process continues until t- τsIn which τ issIs the number of maximum allowed search frames; if there is still an unpaired gesture at this time, the gesture can be considered to be newly present, and unique ID information is given to the gesture.
CN202111130790.1A 2021-09-26 2021-09-26 Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method Pending CN114066932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111130790.1A CN114066932A (en) 2021-09-26 2021-09-26 Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111130790.1A CN114066932A (en) 2021-09-26 2021-09-26 Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method

Publications (1)

Publication Number Publication Date
CN114066932A true CN114066932A (en) 2022-02-18

Family

ID=80233706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111130790.1A Pending CN114066932A (en) 2021-09-26 2021-09-26 Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method

Country Status (1)

Country Link
CN (1) CN114066932A (en)

Similar Documents

Publication Publication Date Title
CN109636831B (en) Method for estimating three-dimensional human body posture and hand information
CN110135375B (en) Multi-person attitude estimation method based on global information integration
CN110020611B (en) Multi-person motion capture method based on three-dimensional hypothesis space clustering
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
Jojic et al. Tracking self-occluding articulated objects in dense disparity maps
CN111311729B (en) Natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network
CN114220176A (en) Human behavior recognition method based on deep learning
WO2020225562A1 (en) Processing captured images
CN110135249A (en) Human bodys' response method based on time attention mechanism and LSTM
CN112530019B (en) Three-dimensional human body reconstruction method and device, computer equipment and storage medium
CN113706699B (en) Data processing method and device, electronic equipment and computer readable storage medium
JP2022510417A (en) Systems and methods for detecting articulated body posture
CN110427890B (en) Multi-person attitude estimation method based on deep cascade network and centroid differentiation coding
CN106030610A (en) Real-time 3D gesture recognition and tracking system for mobile devices
Tang et al. Joint multi-view people tracking and pose estimation for 3D scene reconstruction
US10970849B2 (en) Pose estimation and body tracking using an artificial neural network
CN111046734A (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN106815855A (en) Based on the human body motion tracking method that production and discriminate combine
CN112307892A (en) Hand motion recognition method based on first visual angle RGB-D data
Zou et al. Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking
Amrutha et al. Human Body Pose Estimation and Applications
CN114066932A (en) Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method
CN113255514B (en) Behavior identification method based on local scene perception graph convolutional network
Yu et al. Real time fingertip detection with kinect depth image sequences
Taguchi et al. Unsupervised Simultaneous Learning for Camera Re-Localization and Depth Estimation from Video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination