CN114066932A

CN114066932A - Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method

Info

Publication number: CN114066932A
Application number: CN202111130790.1A
Authority: CN
Inventors: 欧林林; 许成军; 张旭环; 张鑫; 禹鑫燚
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-02-18

Abstract

A real-time multi-person human body three-dimensional posture estimation and tracking method based on deep learning comprises the steps of firstly, obtaining feature proof (feature map) from RGB image information through a feature extraction network. Then, the CPM network is used for regressing the positions of the 2D joint points of the human body and the confidence scores (Heat Map and PAFs) corresponding to each joint; then inputting the acquired 2D joint point information and the characteristic matrix into a ResNet residual block to return to a 3D posture (Location Map); finally, a priority redundancy association algorithm is provided, and detected two-dimensional key points and three-dimensional position maps are distributed to individuals. In addition, the invention provides a greedy strategy-based multi-person human body posture tracking algorithm aiming at the problem of multi-person human body posture tracking, which can effectively track the three-dimensional postures of multiple persons even under the condition that some frames are lacked due to correlation error or shielding in the pairing process. The invention meets the requirements of real-time performance and light weight and can be applied to various practical applications.

Description

Real-time deep learning-based multi-person human body three-dimensional posture estimation and tracking method

Technical Field

The invention relates to a monocular multi-person human body three-dimensional attitude estimation and tracking technology, and provides a real-time multi-person human body three-dimensional attitude estimation method based on deep learning and a human body attitude tracking method based on greedy matching respectively. Particularly, aiming at the problems of instantaneity and occlusion in the estimation of the three-dimensional posture of the human body of multiple persons, a lightweight neural network and a joint priority redundancy strategy are respectively provided.

Background

Human body posture information perception, namely Human body posture Estimation (Human position Estimation), is an important task in computer vision, is also an essential step for a computer to understand Human actions and behaviors, and is widely applied to the fields of Human-computer interaction, AR, VR, games and the like. In recent years, methods for estimating human body posture based on deep learning are proposed in succession, and the performance far beyond the traditional method is achieved. During actual solution, estimation of human body posture is often converted into a prediction problem of human body key points, namely, position coordinates of each key point of a human body are predicted firstly, and then a spatial position relation between the key points is determined according to priori knowledge, so that a predicted human body skeleton is obtained. In previous work, such as: a kinect depth camera is widely used for three-dimensional human body posture estimation in a book Real-time human-robot interaction in complex environment using v2 image recognition published by YANG Y, YAN H, DEHGHAN M, et al, but the problem of depth blurring caused by occlusion exists when the positions of human body joints are read based on depth information. In order to solve the problem of shielding, multi-view attitude information fusion is generally a good method, and more accurate attitude information can be obtained by utilizing estimation results under different views. However, due to the large amount of multi-view information fusion calculation, the real-time performance is poor along with the increase of the number of people, and the method is not suitable for multi-person human posture detection. Thanks to the development of deep learning in recent years, the three-dimensional human body posture estimation based on images is greatly advanced, and the method has good effects on the problems of real-time performance and occlusion handling.

Disclosure of Invention

The invention overcomes the defects in the prior art, provides a real-time multi-person human body three-dimensional posture estimation method based on deep learning, improves the posture estimation instantaneity on the premise of reducing the number of sensors, and ensures the motion capture precision.

The invention estimates and tracks the 2D and 3D postures of the people under the current camera view angle through a lightweight neural network aiming at the RGB image or video stream acquired from the camera view angle. First, the present invention obtains a feature matrix (feature map) from RGB image information through a feature extraction network. Then, the CPM network is used for regressing the position of the 2D joint point of the human body and the father joint (HeatMap and PAFs) corresponding to each joint; then inputting the acquired 2D joint point information and the characteristic matrix into a ResNet residual block to return to a 3D posture (Location Map); finally, a priority redundancy association algorithm is provided, and detected two-dimensional key points and three-dimensional position maps are distributed to individuals. In addition, the invention provides a greedy strategy-based multi-person human body posture tracking algorithm aiming at the problem of multi-person human body posture tracking. The algorithm can effectively track the three-dimensional poses of multiple people, even in the absence of certain frames due to correlation errors or occlusions during pairing.

A real-time multi-person human body three-dimensional posture estimation and tracking method based on deep learning comprises the following specific steps:

step 1: designing a neural network structure;

the present invention uses a lightweight MobileNet V3 as the backbone network and modifies it to have a multi-stage multitasking architecture. The network has two output branches including a 2D pose estimation branch and a 3D pose estimation branch, wherein the 2D pose branch simultaneously regresses keypoint heat Map and paf, and the 3D pose branch regresses Location Map. Given an RGB image, a feature matrix is obtained by a lightweight trunk and input to a two-dimensional branch, CPM-based resulting in keypoint heatma and paf. Then, the feature matrix and the 2D gesture are input into the three-dimensional branch network by using ResNet, and the 3D Location Map is regressed at the 2D body limb pixel. Furthermore, we supervise Location maps between different phases to reduce the dependency of the network on the data set. Assuming the predefined number of nodes is N, the network will output a fixed number of maps, including N HeatMaps, 2N paf, and 3N Location Maps. The output is represented as follows:

heatmaps: all in the imagePixel locations where a human joint may exist. All the human body 2D postures in the image are collected into

Each attitude p_iThere are 15 articulated points, each containing corresponding pixel coordinates

And corresponding confidence

Represents the detection evaluation of the neural network on the joint point when

It indicates that the node is not detected. Wherein i represents the number of the pose in the image, and j represents the joint number corresponding to the pose.

PAFs: a set of two-dimensional vectors, the vector at each node of interest representing the 2D direction of the respective body part. Its role is to correctly assign the detected 2D joint points to the corresponding persons.

Location map: the joint feature channels are used to store 3D coordinates that are regressed at 2D pixel locations. For each joint, three maps are required to represent the corresponding x, y, z estimated coordinates, respectively. For an image of size W H, the three-dimensional positions of all n joints are stored using 3n position maps of size W/k H/k, where k is a down-sampling factor. Similar to Heat Map, the 3D pose of the network fit is noted

Each 3D pose P_iComprises 15 joints, each joint

Consisting of corresponding x, y, z coordinates.

Step 2: constructing a loss function;

in the present invention, we construct a loss function based on 2D and 3D poses and a supervised process. In thatDuring training, L₂The loss applies to all branches. 2D attitude loss L_2DIs the pixel position error between the HeatMap and PAF and their true values. 3D attitude loss L_locIs the joint error between the Location Map and its true value. Loss of supervision L_supIs the Location Map error at different stages. Overall loss L_totalIs represented as follows:

where N and S are the number of joint and network stages, respectively, p represents each pixel position and the superscript represents the true case. w is a_2D，w_locAnd w_supIs a penalty factor.

And step 3: reconstructing three-dimensional postures of a plurality of human bodies;

from the position coordinates of the key points in the Keypoint HeatMap and Location Map, we need to associate the detected joints with the corresponding individuals. Joints are assigned directly using PAF scores, and pose information is unreliable due to occlusion. During the inference process, we use root depth mapping to reflect the number of operators, since the number of people in the input image is unknown. Generally, the mid-body trunk joints (neck, hips) are not occluded, which is the best choice for the root joints. In this study, we consider the neck joint of the human body as the root joint. If a person's root joints are visible, we continue to assign joints to the person. Otherwise, the person is not visible in the scene, nor is the pose predictable.

To address the occlusion problem, the present invention prioritizes the non-occluded persons when assigning the joints. The occlusion state can be inferred in a depth Map (Location Map Z channel) predicted by the network. The root depth value represents the absolute position of each person. Thus, rather than a PAF score, the priority of each person is ordered from near to far by the predicted root depth. Our network allows reading the position of the limb from any 2D joint of the respective limb. For an individual, first read the basic posture at the root joint

This basic posture is the average posture in the data set. We then continue to read the limb poses from the joints near the roots to obtain the full 3D pose. If the joint is valid, the limb pose will replace the joint of the basic pose. Otherwise, the other joints of the limb are examined along the kinematic chain. If all joints of the limb are not effective, the limb posture cannot be refined. Finally, another refinement method for further reducing errors based on a camera model is proposed. Given the visible 2D coordinates and joint depth, the 3D joint can be recovered by the camera model as follows:

[X，Y，Z]^T＝ZK^-1[x，y，1]^T (2)

where [ X, Y, Z ] and (X, Y) represent the 3D and 2D coordinates of the joint, respectively, and K is the camera reference matrix.

In general, in this process, the pose estimator outputs 2D and 3D poses for each person in each frame. Each posture consists of corresponding joint points, wherein the 2D posture information comprises pixel coordinate values corresponding to the joint points in the image and confidence degrees corresponding to the joint points; the 3D pose information contains the spatial coordinate position of each joint point relative to the root joint and is ultimately represented in camera coordinates.

And 4, step 4: tracking a multi-person three-dimensional pose over a continuous time sequence;

the above-described 3D pose estimation method processes only data of the current frame. Therefore, a 3D gesture belonging to the same person cannot be recognized in consecutive frames. At this stage, a continuous frame three-dimensional pose tracking algorithm based on a greedy strategy is designed by using the three-dimensional pose estimation result of each frame, and the tracking problem of multi-person poses on a time sequence is solved.

At this step, we need to redefine the sign of the 3D pose taking into account the time index t. For example, S^tRepresenting the set of all 3D skeletons at time t,

the gesture numbered i representing the current time,

represents the nth joint of the skeleton, and

is used to indicate whether the nth joint is present at time t.

The algorithm takes unordered 3D poses in t frames as input and then outputs a 4D pose sequence with temporal information. The skeleton belonging to the same person is found in the continuous frames by adopting a forward search method. And connecting the skeletons among different frames by calculating corresponding costs through a greedy algorithm. In the pairing process, under the condition that the skeleton does not exist in certain frames due to association error or occlusion, the method still ensures that the skeleton can be effectively tracked. Because only three-dimensional poses exist at the current stage, the cost function between skeletons can be defined as:

wherein | | · | | represents the posture

And posture

The euclidean distance between them, N ═ (1, 2, 3.., N) denotes the joint number, and N is the total number of joints of the skeleton.

Pose tracking is divided into three cases. In (a), the number of poses in the previous and subsequent frames is the same, and skeletons between different frames are connected through corresponding confidence degrees. In (b), the number of poses in the current frame is greater than the number of poses in the previous frame. For unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τ_t. In (c), the number of poses in the current frame is also greater than the number of poses in the previous frame. After the forward search process is completed, there is still an unpaired skeleton (skeleton 3) in the current frame, and at this time, an ID should be assigned to the skeleton 3.

Defining the current frame t as a frame to be matched, and initializing a search frame to t-1. And calculating the matching degree of all matched skeletons in the current frame to be matched and the search frame. All pairs of skeletons in the sequence are ordered by increasing values of ζ. After ordering all candidate skeleton correspondences, if ζ_minBelow a set threshold δ, the pairing ζ may be considered valid. Current frame

The gesture in (1) inherits the search frame

ID information of the successfully paired gesture. At the same time, ζ_minAnd its associated "redundancy pairs" should be deleted. If some unmatched skeletons appear in the current frame, this means that some new skeletons appear, or these skeletons lose track during the association process due to errors or occlusion. At this time, the search frame is set as t-2, and the process of pairing and updating is repeated continuously. This process continues until t- τ_sIn which τ is_sIs the number of frames of the maximum allowed search frame. If there is still an unpaired gesture at this time, the gesture can be considered to be newly present, and unique ID information is given to the gesture.

The invention has the advantages that:

1. the invention designs a lightweight monocular multi-person human body 3D posture estimator meeting the real-time requirement in combination, and the estimator is used for estimating the positions of the multi-person human body 3D joint points in a scene. Through human body kinematic constraint and by combining joint redundancy and posture coding priority, the problem of shielding among people in different scenes is solved.

2. The invention designs a greedy strategy-based continuous frame three-dimensional pose tracking algorithm, solves the problems of continuous tracking and identification of different people in a scene, and improves the stability of a system.

Drawings

Fig. 1 is a schematic diagram of a monocular multi-person human body posture estimation network structure according to the present invention.

FIG. 2 is a schematic diagram of the human body pose estimation network of the present invention.

Fig. 3 is a schematic diagram of the network architecture composition of the present invention.

4(a) -4 (c) are schematic diagrams of the gesture tracking of the present invention, wherein the number of gestures in the previous and subsequent frames of FIG. 4(a) is the same, and the skeletons between the different frames are connected by corresponding confidence degrees; fig. 4(b) the number of poses in the current frame is greater than the number of poses in the previous frame. For unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τ_t(ii) a Fig. 4(c) the number of poses in the current frame is also greater than the number of poses in the previous frame. After the forward search process is completed, there is still an unpaired skeleton (skeleton 3) in the current frame, and at this time, an ID should be assigned to the skeleton 3.

Fig. 5(a) -5 (c) are schematic diagrams illustrating pose estimation of a multi-person human body according to the present invention, wherein fig. 5(a), 5(b), and 5(c) show pose estimation results of three different frames of images, the top is an RGB image input by the system, and the bottom shows the corresponding pose estimation results.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention relates to a real-time multi-person human body three-dimensional posture estimation method based on deep learning, which comprises the following specific processes:

in the embodiment, a color camera is used for capturing the complete postures of the human bodies of a plurality of people. The method has no limit on the number of people in the scene and the scene, and has good universality.

Step 1: designing a neural network structure;

the present invention uses a lightweight MobileNet V3 as the backbone network and modifies it to have a multi-stage multitasking architecture. The network has two output branches including a 2D pose estimation branch and a 3D pose estimation branch, wherein the 2D pose branch simultaneously regresses keypoint heat Map and paf, and the 3D pose branch regresses Location Map. As shown in fig. 1, given an RGB image, human 2D pose information (Keypoints HeatMap and PAFs Map) and human 3D pose information (Location Map) are obtained through a neural network. And then, distributing the regressed joint information by a priority redundancy method to further obtain the 3D posture. Network regression method as shown in fig. 2, the RGB image is subjected to a backbone network to extract a feature matrix, and the feature matrix is input to a two-dimensional branch, and keypoint heatma and paf are obtained based on CPM. Then, the feature matrix and the 2D gesture are input into the three-dimensional branch network by using ResNet, and the 3D Location Map is regressed at the 2D body limb pixel. Furthermore, we supervise Location maps between different phases to reduce the dependency of the network on the data set. The specific implementation of each module is shown in fig. 3.

Step 2: constructing a loss function;

the invention constructs a loss function based on 2D and 3D poses and a supervision process. During training, L₂The loss applies to all branches. 2D attitude loss L_2DIs the pixel position error between the HeatMap and PAF and their true values. 3D attitude loss L_locIs the joint error between the Location Map and its true value. Loss of supervision L_supIs the Location Map error at different stages.

The 2D pose branch and the 3D branch in the network are trained and validated on the COCO and CMU datasets, respectively. COCO is a large-scale target detection data set containing over 20 million images and 25 million people's keypoint labeling. The training and testing set (over 15 million people and 170 million labeled points) is public. In the experiment, the pixel positions of the two-dimensional key points of a plurality of persons are regressed on the data set. CMU Panoptic is a large data set containing various indoor social activities (playing musical instruments, dancing, etc.) collected by multiple cameras. Mutual occlusion and truncation between individuals makes restoring the 3D pose challenging. Also, the present invention regresses the three-dimensional position of the human joint on this data set.

The present invention implements the proposed network scheme using a pytorech framework. The optimizer used in the training process is Adam optimized with parameter beta₁＝0.9，β₂When the learning rate is 0.999, the learning rate is 0.0002, and the batch size is 32. 20 epochs were trained on COCO and CMU Panoptic datasets as final models. The image is resized to a fixed size of 455x256 as an input to the network and 20 from a different sequence is selectedThe 0K images serve as our training set. Two cameras (16 and 30) in four activities (Haggling, Sports, Ultimatum, Pizza) served as our test set. Since the COCO data set lacks 3D gesture annotations, the weight of 3D loss is set to zero when the COCO data is input.

and (4) performing joint distribution and three-dimensional posture reconstruction based on the characteristic diagram obtained by network regression in the step (2). The network in the present invention allows reading the position of the limb from any 2D joint of the respective limb. For an individual, first read the basic posture at the root joint

This basic posture is the average posture in the data set. Then, the limb postures continue to be read from the joints near the roots to obtain the complete 3D posture. If the joint is valid, the limb pose will replace the joint of the basic pose. Otherwise, the other joints of the limb are examined along the kinematics of the body. If all joints of the limb are not effective, the limb posture cannot be further refined. And inputting the RGB image into a network for prediction to obtain a multi-person human body posture estimation result, and visualizing the result in an ROS environment. The recognition result of the human body posture is shown in fig. 5.

the above-described 3D pose estimation method processes only data of the current frame. Therefore, a 3D gesture belonging to the same person cannot be recognized in consecutive frames. At this stage, a continuous frame three-dimensional pose tracking algorithm based on a greedy strategy is designed by using the three-dimensional pose estimation result of each frame, and the tracking problem of multi-person poses on a time sequence is solved. The posture tracking is divided into three cases as shown in fig. 4(a) to 4 (c). In fig. 4(a), the number of poses in the previous and subsequent frames is the same, and the skeletons between different frames are connected by corresponding confidence degrees. In fig. 4(b), the number of poses in the current frame is greater than the number of poses in the previous frame. For unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τ_t. In FIG. 4(c), in the current frameThe number of poses is also greater than the number of poses in the previous frame. After the forward search process is completed, there is still an unpaired skeleton (skeleton 3) in the current frame, and at this time, an ID should be assigned to the skeleton 3. During the tracking process, the present invention will track the pose using a sequence of images within 2 seconds, the result of which is shown in fig. 5. The 45 th, 384 th and 731 th frames in the sequence of the scene were chosen to demonstrate the robustness of the algorithm, respectively, and it can be seen that for each person in the scene, the algorithm is able to track efficiently even in the presence of occlusion.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof which may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A real-time multi-person human body three-dimensional posture estimation and tracking method based on deep learning comprises the following specific steps:

step 1: designing a neural network structure;

using lightweight MobileNet V3 as backbone network, and modifying it into multi-stage multitask structure; the network has two output branches including a 2D posture estimation branch and a 3D posture estimation branch, wherein the 2D posture branch simultaneously regresses keypoint heatmap and paf, and the 3D posture branch regresses Location Map; giving an RGB image, acquiring a characteristic matrix through a lightweight trunk, inputting the characteristic matrix into a two-dimensional branch, and obtaining keypoint heatma and paf based on CPM; then, inputting the feature matrix and the 2D posture into a three-dimensional branch network by using ResNet, and returning to a 3D Location Map at a 2D human body limb pixel; in addition, the Location Map between different stages is supervised to reduce the dependence of the network on the data set; assuming that the predefined number of nodes is N, the network will output a fixed number of maps, including N HeatMaps, 2N paf, and 3N Location Maps; the output is represented as follows:

heatmaps: pixel positions where all human body joint points in the image may exist; recording all human body 2D postures in imageSet of states as

And corresponding confidence

If yes, the joint point is not detected; wherein i represents the number of the gesture in the image, and j represents the joint number corresponding to the gesture;

PAFs: a set of two-dimensional vectors, the vector at each node of interest representing the 2D direction of the respective body part; its role is to correctly assign the detected 2D joint points to the corresponding persons;

location map: the joint characteristic channel is used for storing 3D coordinates regressed at the 2D pixel position; for each joint, three maps are required to represent the corresponding x, y, z estimated coordinates; for an image of size W H, storing the three-dimensional positions of all n joints using 3n position maps of size W/k H/k, where k is a down-sampling factor; similar to Heat Map, the 3D pose of the network fit is noted

Each 3D pose P_iComprises 15 joints, each joint

Consisting of corresponding x, y, z coordinates;

step 2: constructing a loss function;

constructing a loss function based on the 2D and 3D poses and the supervision process; during training, L₂The loss applies to all branches; 2D attitude loss L_2DIs the pixel position error between the HeatMap and PAF and their true values; 3D attitude loss L_locIs the joint error between Location Map and its true value; loss of supervision L_supIs the Location Map error at different stages; overall loss L_totalIs represented as follows:

wherein N and S are the number of joint and network stages, respectively, p represents each pixel position, and the superscript represents the true condition; w is a_2D，w_locAnd w_supIs a penalty factor;

and step 3: distributing three-dimensional joints of a multi-person human body and reconstructing three-dimensional postures;

associating the detected joint with the corresponding individual according to the position coordinates of key points in the Keypoint HeatMap and the Location Map; because of occlusion, joints are assigned directly using PAF scores, and pose information is unreliable; during the reasoning process, since the number of people in the input image is unknown, a root depth map is used to reflect the number of operators; the neck joint of the human body is regarded as a root joint; if a person's root joints are visible, continuing to assign joints to the person; otherwise, the person is not visible in the scene, and the posture cannot be predicted;

prioritizing people that are not occluded when assigning joints; the occlusion state can be deduced in a depth Map (Location Map Z channel) predicted by a network; the root depth value represents the absolute position of each person; thus, rather than a PAF score, each person's priority is ordered from near to far by the predicted root depth; the network allows reading the position of the limb from any 2D joint of the respective limb; for an individual, first read the basic posture at the root joint

This basic posture is the average posture in the dataset; then, the limb is continuously read from the joint near the rootPose to obtain a full 3D pose; if the joint is valid, the limb posture will replace the joint of the basic posture; otherwise, checking other joints of the limb along the kinematic chain; if all joints of the limb are invalid, the posture of the limb cannot be refined; finally, another refinement method for further reducing errors based on a camera model is provided; given the visible 2D coordinates and joint depth, the 3D joint is recovered by the camera model as follows:

[X，Y，Z]^T＝ZK^-1[x，y，1]^T (2)

wherein [ X, Y, Z ] and (X, Y) represent the 3D and 2D coordinates of the joint, respectively, and K is a camera reference matrix;

in general, in this process, the pose estimator outputs 2D and 3D poses for each person in each frame; each posture consists of corresponding joint points, wherein the 2D posture information comprises pixel coordinate values corresponding to the joint points in the image and confidence degrees corresponding to the joint points; the 3D pose information contains the spatial coordinate position of each joint point relative to the root joint and is ultimately represented in camera coordinates;

the 3D attitude estimation method only processes data of a current frame; therefore, 3D gestures belonging to the same person cannot be recognized in consecutive frames; at this stage, a continuous frame three-dimensional pose tracking algorithm based on a greedy strategy is designed by using the three-dimensional pose estimation result of each frame, so that the tracking problem of multi-person poses on a time sequence is solved;

at this step, the symbols of the 3D pose are redefined taking into account the time index t, S^tRepresenting the set of all 3D skeletons at time t,

the gesture numbered i representing the current time,

represents the nth joint of the skeleton, and

is used for indicating whether the nth joint exists at the time t;

taking unordered 3D poses in the t frames as input, and then outputting a 4D pose sequence with time information; finding skeletons belonging to the same person in continuous frames by adopting a forward search method; in the pairing process of connecting different frames by calculating corresponding costs through a greedy algorithm, under the condition that the skeleton does not exist in some frames due to correlation errors or occlusion, the skeleton can still be ensured to be effectively tracked; because only three-dimensional poses exist at the current stage, the cost function between skeletons can be defined as:

wherein | | · | | represents the posture

And posture

The euclidean distance between them, N ═ (1, 2, 3,. N) denotes the joint number, and N is the total number of joints of the skeleton;

attitude tracking is divided into three cases; in the step (a), the number of the postures in the front frame and the rear frame is the same, and skeletons among different frames are connected through corresponding confidence degrees; in (b), the number of poses in the current frame is greater than the number of poses in the previous frame; for unpaired skeleton 1, it will continue to search forward and pair with the skeleton until t- τ_t(ii) a In (c), the number of poses in the current frame is also greater than the number of poses in the previous frame; after the forward search process is completed, an unpaired skeleton (skeleton 3) still exists in the current frame, and an ID should be allocated to the skeleton 3 at this time;

defining a current frame t as a frame to be matched, and initializing a search frame to t-1; calculating the matching degrees of all matched frameworks in the current frame to be matched and the search frame;all skeleton pairs in the sequence are sorted according to the incremental value of zeta; after ordering all candidate skeleton correspondences, if ζ_minBelow a set threshold δ, the pairing ζ can be considered valid; current frame

The gesture in (1) inherits the search frame

ID information of the successfully paired gestures; at the same time, ζ_minAnd its associated "redundancy pair" should be deleted; if some unmatched skeletons appear in the current frame, this means that some new skeletons appear, or these skeletons lose track during the association process due to errors or occlusion; at the moment, setting a search frame as t-2, and continuously repeating the pairing and updating process; this process continues until t- τ_sIn which τ is_sIs the number of maximum allowed search frames; if there is still an unpaired gesture at this time, the gesture can be considered to be newly present, and unique ID information is given to the gesture.