CN113850221A

CN113850221A - Attitude tracking method based on key point screening

Info

Publication number: CN113850221A
Application number: CN202111168498.9A
Authority: CN
Inventors: 刘庆杰; 左文航; 王蕴红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-12-28

Abstract

The invention discloses a gesture tracking method based on key point screening, which constructs a multi-user gesture tracking model based on key point screening, and comprises a target detection module, a gesture estimation and key point screening module, a matching module and a pedestrian re-identification module. By the technical scheme, the problems that the threshold value of the key point of the human body is difficult to adjust in a multi-person posture tracking task, and the multi-target tracking accuracy (MOTA) index is low due to low key point identification can be solved.

Description

Attitude tracking method based on key point screening

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a gesture tracking method based on key point screening.

Background

The task of tracking the postures of multiple persons in the field of computer vision is to process an input video image, detect the posture of each pedestrian in each frame of a video, calculate and analyze information such as appearance characteristics, positions, motion states and the like of a target, and correctly record continuous posture tracks of each person along with the time. As a hot problem in the field of computer vision, human posture tracking is the basis of almost all human-related problem researches in computer vision, and has wide application in the fields of human-computer interaction, video monitoring, activity recognition and sports video analysis, and the actual demands arouse great interest of people in the topic.

The existing multi-person posture tracking method generally adopts a top-down frame, firstly detects the bounding box information of each human body by a multi-person posture estimation method, and then carries out posture estimation on each human body bounding box to obtain the posture information. And then matching the information of the human body to be tracked in the current frame of the video with the information obtained in the next frame.

Although the multi-person posture tracking technology has been widely applied, the multi-person posture estimation and tracking not only needs to consider solving the generally challenging problems in multi-target tracking, such as scale change, plane rotation and illumination change, but also has the situation that the posture of a person is changed in a video in a complex way, and the like, and thus the multi-person posture estimation and tracking are difficult. In addition, in the Pose Track challenge data set, key points lower than a certain threshold value need to be cleared, all the conventional methods are based on experience to adjust the threshold value, and the selection of the threshold value often affects the final performance of the model. These problems leave the multi-target tracking accuracy (MOTA) index of the model to be improved.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a gesture tracking method based on key point screening, which can solve the problems that the threshold value of the key point of a human body is difficult to adjust in a multi-person gesture tracking task, and the multi-target tracking accuracy (MOTA) index is low due to low key point identification. The specific technical scheme of the invention is as follows:

a gesture tracking method based on key point screening comprises the following steps:

s1: constructing a data set;

s1-1: detecting and estimating the attitude by adopting a COCO2017 data set, and estimating and tracking the attitude by using a Posetack 2018 data set;

s1-2: on the basis of the Posetack 2018 data set, constructing a pedestrian re-identification data set, cutting all human body frames in the Posetack 2018 data set, storing the human body frames as pictures, giving the same labels to the human body frames with id in the same video, and finally generating a pedestrian re-identification data set which comprises 4613 labels and 119656 pictures in total;

s2: constructing a multi-person posture tracking model based on a top-down framework and a universal key point screening network;

s3: training a model;

s4: and inputting the video to be detected into the model to obtain a detection result.

Further, the multi-person posture tracking model comprises a target detection module, a posture estimation and key point screening module, a matching module and a pedestrian re-identification module, wherein,

the target detection module is a multitask multistage-based mixed cascade structure model, namely an HTC model, a backbone network of the HTC model uses a ResNeXt network with 101 layers, and a characteristic diagram pyramid network, namely an FPN network, and a PCK index-based non-maximum suppression algorithm are adopted;

the posture estimation and key point screening module adopts a general posture estimation network, obtains thermodynamic diagrams and existence probabilities of each key point through a key point prediction module and a key point screening module respectively, and finally obtains human body postures;

the matching module performs data association on the human body posture of the current frame and the existing human body posture trajectory, and performs association by adopting a Hungarian algorithm based on intersection-comparison distance of a bounding box; for the human body with the lost track, adding the human body into the pedestrian re-identification module; and when a newly added human body appears, the pedestrian re-identification module is used for finding the lost human body track.

Further, the non-maximum suppression algorithm based on the PCK index uses the PCK distance as a similarity measure of two human body frames, and when it is confirmed that one human body frame is reserved, other human body frames with the PCK distance larger than a set threshold are deleted.

Further, the key point prediction module and the key point screening module adopt the same network structure and are composed of a transposed convolution with a three-layer convolution kernel of 3x3, the key point prediction module finally generates a thermodynamic diagram for each key point, and the key point screening module obtains the binary classification probability of each key point through a full connection layer for the generated feature diagram of each key point.

Further, the pedestrian re-identification module adopts a posture estimation network HRNet as a backbone network to extract the characteristics to obtain a characteristic diagram of each person, associates the existing posture track with the posture of the current frame each time, and stores the lost track; for the gesture which is not successfully matched, the pedestrian re-recognition module is used for comparing with the lost pedestrian track, if the comparison is successful, the new gesture and the lost gesture track are combined, and if the comparison is failed, a new track is generated;

in the training stage, setting each person with the same label as a category, obtaining a characteristic diagram of each person, then accessing the characteristic diagram into a classification network for classification, and using multi-classification cross entropy as a loss function;

in the gesture tracking process, a feature map of each person is extracted to perform feature matching with the lost pedestrian, the distances of the two persons are compared by using the L1 distance, namely the Hamilton distance, and when the distance is smaller than a threshold value, the two persons are considered to be the same person.

Further, the thermodynamic diagram output by the keypoint prediction module will also be smoothed by second-order gaussian filtering.

Further, the step S3 is to separately train the model, specifically:

s3-1: the target detection module adopts an HTC detector, the mAP on the Posetrack 2018 verification set reaches 0.346, the target detection module only trains on the COCO2017 data set, and fine adjustment is not performed on the Posetrack 2018 data set;

s3-2: the pose estimation and keypoint screening module trains on COCO2017, fine-tunes the trained model on Posetack 2018 dataset, obtains a basic truth frame of a person instance by extending the bounding box of all the keypoints of the key points by 20% length because only the keypoints are annotated, and the human body bounding box is made into a fixed aspect ratio by extending the bounding box in height or width;

in the training of the attitude estimation and key point screening module, performing data enhancement on the attitude, including random scaling, random rotation and random overturning, and adopting adam as an optimizer;

s3-3: the pedestrian re-identification module cuts pedestrians with the same label from an original image by using the Posetrack 2018 data set to form a new data set, uses the HRNet network pre-trained by the Posetrack 2018 data set as a backbone network, classifies the feature graph output by the HRNet network during training through the full connection layer, and only uses the feature graph output by the HRNet network during verification without passing through the full connection layer.

The invention has the beneficial effects that: on the basis of the existing multi-person posture estimation method based on top-down, the universal key point screening network is added on the existing posture estimation network, so that the accuracy of the model is improved, and the influence of the key point threshold on the tracking accuracy is reduced.

Drawings

In order to illustrate embodiments of the present invention or technical solutions in the prior art more clearly, the drawings which are needed in the embodiments will be briefly described below, so that the features and advantages of the present invention can be understood more clearly by referring to the drawings, which are schematic and should not be construed as limiting the present invention in any way, and for a person skilled in the art, other drawings can be obtained on the basis of these drawings without any inventive effort. Wherein:

FIG. 1 is a network flow diagram of the present invention;

FIG. 2 is an attitude estimation and keypoint screening module of the present invention;

FIG. 3 is a HRNet network structure;

FIG. 4 is a key point screening module network architecture;

FIG. 5 is a pose estimation visualization;

FIG. 6 is a pose tracking visualization.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

The method can be added with a general key point screening model on the basis of the existing multi-person posture estimation method based on top-down, so that the accuracy of the model is improved, and the influence of a key point threshold on the tracking accuracy is reduced.

s1: constructing a data set;

s3: training a model;

As shown in fig. 1-4, the invention constructs a multi-user gesture tracking model based on a top-down framework and a general type key point screening network, and the model comprises a target detection module, a gesture estimation and key point screening module, a matching module and a pedestrian re-identification module. The target detection module is a multitask multistage-based mixed cascade structure model, namely an HTC model, a backbone network of the HTC model uses a ResNeXt network with 101 layers, and a characteristic diagram pyramid network, namely an FPN network, and a PCK index-based non-maximum suppression algorithm are adopted;

the posture estimation and key point screening module adopts a general posture estimation network, obtains thermodynamic diagrams and existence probabilities of each key point through the key point prediction module and the key point screening module respectively, and finally obtains human body postures;

the matching module performs data association on the human body posture of the current frame and the existing human body posture trajectory, and performs association by adopting a Hungarian algorithm based on intersection-comparison distance of a bounding box; for the human body losing the track, adding the human body into a pedestrian re-identification module; when a newly added human body appears, the lost human body track is found back by using the pedestrian re-identification module.

(1) The target detection module is based on a multitask multi-stage hybrid cascade structure (HTC) model, uses ResNeXt as a backbone network for feature extraction, uses a Feature Pyramid Network (FPN) for multi-scale testing, and finally uses PCK distance to perform a non-maximum suppression algorithm to filter redundant human body proposal boxes, so that a final human body boundary box is obtained.

Each stage of the HTC network combines cascading and multitasking to improve information flow and to take advantage of spatial context to further improve accuracy. The backbone network of the HTC model uses a 101-layer ResNeXt network, which is a combination of a deep residual error network (ResNet) and an inclusion, unlike the inclusion v4, ResNeXt does not need to manually design complicated inclusion structure details, but each branch adopts the same topology. The essence of resenext is the packet convolution, controlling the number of groups by the variable cardinality.

The FPN network solves the multi-scale problem in human body detection, and greatly improves the performance of small human body detection under the condition of basically not increasing the calculation amount of an original model through simple network connection change. In addition, redundant human body proposal frames are removed by adopting a non-maximum inhibition algorithm based on PCK indexes, the PCK is used for calculating the proportion that the normalized distance between detected key points is smaller than a set threshold value, under the condition that human body interaction is shielded, the traditional human body interaction shielded is easily filtered by mistake based on IOU distance, and the PCK distance can better keep the human body proposal frames under the condition that interaction shielding exists through attitude information, and is more suitable for human body detection than the IOU indexes used in the traditional detection model. The PCK distance formula is:

wherein i represents a key point with ID of i, and k represents a k-th threshold value T_kP represents the p-th pedestrian, d_pqiThe Euclidean distance between the predicted value of the key point with id being i in the p-th person and the predicted value of the key point with id being i in the q-th person,

scale factor, T, representing the p-th person_kA manually set threshold.

(2) The backbone network of the pose estimation and keypoint screening module is shown in fig. 2, and the backbone network adopts a high-resolution subnetwork (HRNet). As shown in fig. 3, the HRNet is implemented by gradually adding a low-resolution feature map sub-network to a high-resolution feature map main network in parallel, and implementing multi-scale fusion and feature extraction for different networks.

According to the invention, the third layer of output in the HRNet network multi-resolution output is used as a characteristic extraction part of a posture estimation model to obtain a human posture characteristic diagram, and then the thermodynamic diagram and the existence probability of each key point are obtained through a posture estimation module and a key point screening module respectively.

The key point prediction module and the key point screening module adopt the same network structure and are composed of the transposed convolution with three layers of convolution kernels of 3x3, the attitude estimation module finally generates a thermodynamic diagram for each key point, and the key point screening module respectively obtains the binary classification probability of each key point for the generated feature diagram of each key point through a full connection layer on the basis of the transposed convolution. In addition, the thermodynamic diagram output by the key point prediction module is smoothed through second-order Gaussian filtering, and a more accurate key point position is obtained.

Specifically, the output of the key point prediction module is a thermodynamic diagram of 15 key points, the labels of the key points are probability diagrams of gaussian distribution, and the trained loss function is as follows:

wherein i represents the ith key point, j represents the probability that the jth position in the thermodynamic diagram is the key point,

indicating the probability of the ith keypoint in the label at the jth position,

denotes the probability, v, of the ith keypoint predicted by the network at the jth location_iIndicates whether the ith key point is visible, v_iAnd 0 means that the key point is invisible and does not account for the loss of the network.

The output of the key point screening module is the binary classification probability of 15 key points, the labels of the key points are one-hot binary distribution, and the loss function is as follows:

wherein, y_iIndicating whether the ith key point is visible, z_iRepresenting the visibility of the ith keypoint prediction.

(3) The matching module performs data association on the human body posture of the current frame and the existing posture track of the previous frame, and performs association by adopting a Hungarian algorithm based on the intersection-to-parallel ratio (IOU) distance of a bounding box. The calculation formula of the IOU distance is as follows:

wherein, the bounding box corresponding to the human body posture of the box1 current frame and the bounding box corresponding to the existing posture track of the box2 previous frame, and area is the area for calculating the box.

(4) The pedestrian re-identification module is used for extracting features by adopting a posture estimation network HRNet as a backbone network to obtain a feature map of each person, associating the existing posture track with the posture of the current frame every time, and storing the lost track; for the gesture which is not successfully matched, the pedestrian re-recognition module is used for comparing with the lost pedestrian track, if the comparison is successful, the new gesture and the lost gesture track are combined, and if the comparison is failed, a new track is generated;

In addition, the network model of the invention is trained by modules:

s3-1: the target detection module adopts an HTC detector, the mAP reaches 0.346 on the Posetrack 2018 verification set, and the target detection module only trains on the COCO2017 data set and does not finely adjust on the Posetrack 2018 data set.

S3-2: the pose estimation and key point screening module trains on COCO2017, the trained model is finely adjusted on a Posetack 2018 data set, as only key points are annotated, a human body boundary frame is obtained by extending the boundary frames corresponding to all the key points by 20% of the length, and the human body boundary frame is made into a fixed aspect ratio by extending the boundary frames in height or width; for example, the height: width 4: 3, the resolution adopted is 384: 288, the size of the thermodynamic diagram is one quarter of the resolution, i.e. 96: 72.

in the training of the attitude estimation and key point screening module, data enhancement is carried out on the attitude, wherein the data enhancement comprises random scaling (30%), random rotation (40 degrees) and random overturning, and adam is used as an optimizer.

S3-3: the pedestrian re-identification module cuts pedestrians with the same label from an original image by using the Posetrack 2018 data set to form a new data set, wherein the new data set comprises 4613 pedestrian labels and 119656 pictures, the HRNet network pre-trained by the Posetrack 2018 data set is used as a backbone network, the feature maps output by the HRNet network are classified through the full-connection layer during training, and only the feature maps output by the HRNet network are used during verification and do not pass through the full-connection layer any more.

Example 1

To verify the validity of each module of the model, ablation experiments were performed on the key modules. As shown in table 1, the experimental procedure included a baseline experiment and the addition of two modules on this basis in series:

reference experiment: an HTC detector, an HRNet attitude estimation network and IOU matching are adopted as reference methods.

Benchmark + keypoint screening module: on the basis of a benchmark method, a key point screening module (checkmodel) is added.

Benchmark + keypoint screening module + pedestrian re-identification (re-id): on the basis of a reference method, key points are added

A screening module (check model) and pedestrian re-identification (re-id).

It is found from the table that each module in the network of the present invention contributes to performance improvement, and especially the key point screening module contributes most to performance improvement.

Table 1 ablation experiments on the PoseTrack 2018 dataset with the method of the invention

The embodiment also verifies the effectiveness of the key point screening module in different top-down attitude estimation networks, as shown in table 2, HRNet, Hourglass and pos _ respet respectively test multi-target tracking accuracy (MOTA) indexes under the condition that an irrelevant key point screening module is present, and experimental results prove that the key point screening module can effectively improve the performance of the attitude tracking network, and also show that the method of the invention has good generalization capability.

TABLE 2 ablation experiments of keypoint screening modules in different pose estimation networks

As shown in table 3, in the framework of the top-down posture estimation network, the threshold setting of the human body key point greatly affects the tracking accuracy MOTA index. And under the condition of using the key point screening module, the influence of the human body key point threshold on the MOTA (motion estimation of the orientation of the user) index of the tracking precision is obviously reduced, and the problem that the human body key point threshold is difficult to adjust in a multi-person posture tracking task is solved.

TABLE 3 Performance of the keypoint screening module at different thresholds

Table 4 reports the results of the method of the invention compared to other methods, including: LightTrack, Miracle +, OpenSV AI, STAF. It is found from the table that the method of the present invention shows excellent results on the postrack 2018 data set, wherein fig. 5 is a visualization result of pose estimation on the data set, and fig. 6 is a visualization result of pose tracking.

TABLE 4 comparison of the Process of the invention with other Processes

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A gesture tracking method based on key point screening is characterized by comprising the following steps:

s1: constructing a data set;

s3: training a model;

2. The method of claim 1, wherein the multi-person pose tracking model comprises a target detection module, a pose estimation and key point screening module, a matching module, and a pedestrian re-identification module,

3. The method for tracking the pose based on the key point screening of claim 2, wherein the PCK index-based non-maximum suppression algorithm uses PCK distance as a similarity measure of two body frames, and when it is determined that one body frame is reserved, other body frames with PCK distance greater than a set threshold are deleted.

4. The method according to claim 2, wherein the keypoint prediction module and the keypoint filtering module have the same network structure, and are formed by a three-layer convolution kernel of 3 × 3 transpose convolution, the keypoint prediction module finally generates a thermodynamic diagram for each keypoint, and the keypoint filtering module obtains the classification probability of each keypoint through a fully-connected layer with respect to the generated feature diagram of each keypoint.

5. The method for tracking the poses based on the keypoint screening according to claim 2, wherein the pedestrian re-identification module performs feature extraction to obtain the feature map of each person by using a pose estimation network HRNet as a backbone network, and stores the lost trajectories each time the existing pose trajectories are associated with the poses of the current frame; for the gesture which is not successfully matched, the pedestrian re-recognition module is used for comparing with the lost pedestrian track, if the comparison is successful, the new gesture and the lost gesture track are combined, and if the comparison is failed, a new track is generated;

6. The method of claim 2, wherein the thermodynamic diagram output by the keypoint prediction module is further smoothed by a second-order Gaussian filter.

7. The method for tracking pose based on keypoint screening according to claim 1, wherein said step S3 is a separate training of model, specifically: