CN116109673A

CN116109673A - Multi-frame track tracking system and method based on pedestrian gesture estimation

Info

Publication number: CN116109673A
Application number: CN202310095186.2A
Authority: CN
Inventors: 田炜; 高众; 艾文瑾
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2023-01-20
Filing date: 2023-01-20
Publication date: 2023-05-12

Abstract

The invention relates to a multi-frame track tracking system based on pedestrian gesture estimation and a method thereof, wherein the system is realized based on a Tracking by Object Detection tracking frame, and the invention increases the point detection on the basis of single-frame object detection, and simultaneously introduces the information of the point into tracking so as to track through a target and a gesture detection paradigm thereof; the method comprises the following steps: the single frame image is input into a detector after feature extraction, and the detection confidence and the coordinates of a detection frame are output; respectively predicting the gesture of pedestrians in different detection frames; initializing a tracker according to output data of a corresponding single-frame model of a first frame image of the video; on the basis of matching based on the detection frame and matching based on the gesture information, tracking optimization is performed based on the reference point of the detector, and target association is established between two frames of images by using the relevant judgment of the gesture and the reference point of the detector. Compared with the prior art, the method and the device can optimize the overall tracking effect, improve the detection and association performance under the scene with shielding and movement, and effectively improve the tracking effect.

Description

Multi-frame track tracking system and method based on pedestrian gesture estimation

Technical Field

The invention relates to the technical field of automatic driving, in particular to a multi-frame track tracking system and method based on pedestrian gesture estimation.

Background

Autopilot, one of the major trends in the development of the automotive industry in recent years, has become a current research hotspot for its corresponding detection and control technology.

In addition to the vehicles coming and going in the common traffic environment, a considerable number of pedestrians are indispensable, and pedestrian detection naturally becomes an unavoidable link in the automatic driving technology. How to detect and track the human body gesture in the automatic driving environment by using the vehicle-mounted visual perception system becomes an important proposition of the direction. The traditional algorithm adopts manual features and complex human models to acquire local representation and global gesture structures, and more models begin to adopt a deep learning method to extract relevant features in consideration of the complexity of the human body.

In recent years, deep learning algorithms have evolved rapidly, and a large number of efficient models and sophisticated data sets have been published, which enable multi-objective understanding and tracking based on human body pose estimation. However, most of the existing methods directly use the results of the detector to analyze the tracking track, and although many powerful detector models have excellent performance, simply tracking with the results of the detector will reduce the overall correlation performance in the scene with occlusion and motion, resulting in poor actual tracking effect.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-frame track tracking method based on pedestrian gesture estimation, which effectively improves the tracking effect by integrating human gesture information on the basis of the prior detector.

The aim of the invention can be achieved by the following technical scheme: the multi-frame track tracking system based on pedestrian gesture estimation is realized based on a Tracking by Object Detection tracking frame, the point detection is added on the basis of single-frame object detection, and meanwhile, the information of the point is introduced into tracking so as to track through a target and a gesture detection paradigm thereof;

the multi-person gesture estimation module is used for carrying out 2D human gesture estimation on the pedestrian target and outputting corresponding 2D gesture key point coordinates;

the tracker is used for tracking and matching the pedestrian targets in the current frame image and the previous frame image according to the output data of the single frame model and synchronously updating the parameters of the tracker.

Further, the detector specifically adopts a transformable DETR framework based on a transformator.

A multi-frame track tracking method based on pedestrian gesture estimation comprises the following steps:

s1, extracting a single-frame image from video data acquired by a vehicle-mounted camera, and inputting a single-frame model;

s2, processing the input single-frame image by the single-frame model, and outputting detection confidence degrees, detection frames and 2D gesture key point coordinates of all pedestrian targets in the single-frame image;

s3, outputting data according to a single frame model corresponding to a first frame image in the video data, initializing parameters of the tracker, and updating the output data to the tracker by the single frame model;

and S4, the tracker performs tracking matching on the pedestrian targets in the current frame image and the previous frame image, outputs tracking results, and synchronously updates the parameters of the tracker.

Further, the step S2 specifically includes the following steps:

s21, inputting a single-frame image into a detector of a single-frame model, and outputting detection confidence degrees and detection frames corresponding to all pedestrian targets in the single-frame image;

s22, according to the data output by the detector, carrying out 2D gesture estimation on each pedestrian target in the single-frame image by a multi-person gesture estimation module of the single-frame model, and outputting corresponding 2D gesture key point coordinates.

Further, the initializing parameters of the tracker in step S3 specifically includes initializing the following parameters of the tracker:

the method comprises the steps of detecting confidence coefficient, a detection frame, 2D gesture key points and track IDs, wherein the detection confidence coefficient, the detection frame and the 2D gesture key points are respectively corresponding to single-frame model output data corresponding to a first frame image in video data, and the track IDs are unrepeatable marks for starting to assign from 0.

Further, the specific process of the tracker in step S4 for tracking and matching the pedestrian targets in the current frame image and the previous frame image is as follows:

s41, respectively calculating the matching degree of a detection frame, the similarity of key points and the matching score based on reference points according to single-frame model data corresponding to the current frame image and single-frame model data corresponding to the previous frame image, and accumulating the three calculation results to obtain a final matching score matrix;

s42, determining the track similarity between the current frame image and the previous frame image according to the final matching score matrix, and judging that a target in the previous frame image finds a matched object in the current frame image, namely the matching is successful if the track similarity exceeds a corresponding preset threshold;

otherwise, judging that the target in the previous frame image does not have a matched object in the current frame image, namely the matching fails.

Further, the matching degree of the detection frame is specifically:

wherein A and B correspond to the area occupied by the two detection frames respectively, and C is the area occupied by the circumscribed minimum rectangle of AB.

Further, the key point similarity is specifically:

wherein d is Euclidean distance between corresponding key points, S is the size of the object, x and y are coordinate values of the key points,

is the two vertex coordinates on the diagonal of the object truth box.

Further, the calculating process of the matching score based on the reference point comprises the following steps:

1) Performing primary matching on the reference points according to the characteristics and the detection frame;

2) Rearranging the embedded features according to the sequence of the primary matching result of the reference points to obtain the sequence of the reference points of the current frame;

3) Obtaining the offset of a corresponding reference point by using a group of multi-layer perceptrons to obtain the reference point coordinate required by tracking the branch;

4) And rearranging the reference points of the previous frame according to the current frame, and sending the rearranged reference points to a decoder of the current frame to obtain the matching score based on the reference points.

Further, the specific process of updating the parameters of the tracker in step S4 is as follows:

if the matching is successful, updating the confidence coefficient, the detection frame and the 2D key point parameters which are currently stored in the tracker by utilizing the single frame model data corresponding to the current frame, and simultaneously keeping the activation state of the successfully matched target;

if the matching fails, the state corresponding to the target is turned into suspension, a suspension counter is increased by one, and when the target matching object is tracked and found in the subsequent frame image, the suspension counter is cleared; if the suspension counter exceeds a preset threshold, the tracking of the target is turned off.

Compared with the prior art, the invention has the following advantages:

1. the invention is based on Tracking by Detection tracking frame, add the phase detection on the basis of the object detection of the single frame, introduce tracking in the information of phase at the same time, thus construct a multi-frame track tracking system, including single frame model and tracker, wherein there are detector and multiple person gesture estimation module in the single frame model, utilize the detector to predict the detection frame that the pedestrian is located; respectively carrying out gesture prediction on a single person by utilizing a multi-person 2D gesture prediction module, and outputting coordinates of 2D key points of the human body; initializing a tracker by utilizing single-frame model output data of a first frame image of the video; and establishing a connection between two frames of images by using a tracker, and updating a tracking track by simultaneously predicting the detection frame and the key point coordinate information of the pedestrian, so that a tracking result is optimized, and the tracking effect is effectively improved.

2. In the invention, a transformator-based default DETR frame is adopted by a detector in a single frame model and is used for outputting single detection confidence coefficient and a detection frame, and the feature extraction capability of the default DETR can be fully utilized, so that the priori structural information of a human body is mined based on the sampling point in the default DETR, a tracking optimization structure based on a reference point is realized, and the overall detection and association performance of the tracker under a scene with shielding and movement can be improved.

3. According to the invention, the tracker performs tracking matching on the targets in the current frame and the previous frame by calculating the matching degree of the detection frame, the similarity of the key points and the matching score based on the reference points according to the data output by the single frame model, and performs synchronous updating on the tracker parameters based on the matching result. Therefore, real-time optimization of the parameters of the tracker can be ensured, and the track tracking accuracy can be fully ensured.

4. In the invention, the tracker further performs tracking optimization based on the reference point of the detector on the basis of the matching based on the detection frame and the matching based on the gesture information, and the target association is established between two frames by utilizing the correlation judgment of the gesture and the reference point of the detector through calculating the matching score based on the reference point. Compared with a method for tracking the track by directly using the result of the detector, the method can greatly optimize the whole tracking effect and improve the detection and association performance under the scene with shielding and movement.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a process by which a tracker calculates a reference point based matching score.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

Examples

The multi-frame track tracking system based on pedestrian gesture estimation is realized based on a Tracking by Object Detection tracking frame, the point detection is added on the basis of single-frame object detection, meanwhile, the information of the point is introduced into tracking, namely, a Tracking by Object and Pose Detection tracking frame is constructed, tracking is carried out through a target and a gesture detection mode thereof, and the system concretely comprises a single-frame model and a tracker which are sequentially connected, wherein the single-frame model is connected with a vehicle-mounted camera so as to acquire a single-frame image from video data acquired by the vehicle-mounted camera;

a detector and a multi-person gesture estimation module are arranged in the single-frame model, and the detector is used for outputting detection confidence degrees and detection frames corresponding to all pedestrian targets in the single-frame image; the multi-person gesture estimation module is used for carrying out 2D human gesture estimation on the pedestrian target and outputting corresponding 2D gesture key point coordinates;

and the tracker is used for tracking and matching the pedestrian targets in the current frame image and the previous frame image according to the output data of the single frame model and synchronously updating the parameters of the tracker.

In this embodiment, the detector specifically employs a transformable DETR framework based on a transformator, and is used to output a single detection confidence level and a single detection frame. The human body detection frame and corresponding feature extraction can be provided for the tracking task by the Deformable DETR, but prior structural information of a human body is not considered, so that the technical scheme is provided with a multi-person gesture estimation module for processing human body gesture information in an emphasized mode, the multi-person gesture estimation module adopts a top-down scheme, performs 2D human body gesture estimation on a single person according to the human body detection frame predicted by the detector, and takes 2D key point coordinates as output.

Because relevant labels of human body gestures are needed to be utilized, the embodiment trains the tracker based on the gesture estimation on a multi-frame image dataset PoseTrack of key points of human bones in advance so as to integrate a detection frame and joint points to carry out multi-frame tracking.

By applying the system, a multi-frame track tracking method based on pedestrian gesture estimation is realized, as shown in fig. 1, and the method comprises the following steps:

s3, initializing parameters of the tracker according to single-frame model output data corresponding to the first frame image in the video data, and updating the output data of the single-frame model to the tracker, wherein the parameters for initializing the tracker comprise detection confidence, a detection frame, 2D gesture key points and track IDs, the track IDs are unrepeatable marks which are assigned from 0, and the detection confidence, the detection frame and the 2D gesture key points correspond to the single-frame model output parameters corresponding to the first frame image;

s4, the tracker performs tracking matching on pedestrian targets in the current frame image and the previous frame image (specifically, performs tracking matching on three information of the matching degree of the detection frame, the similarity of the key points and the matching score of the tracking optimization module based on the reference points), and synchronously updates the parameters of the tracker;

wherein, the matching degree score based on the detection frame is calculated as:

wherein A, B corresponds to the area occupied by two detection frames, and C is the area occupied by the circumscribed minimum rectangle of AB;

similarity score based on keypoints is calculated as:

where d represents the Euclidean distance between the corresponding nodes, and S represents the size of the object, and x, y correspond to the coordinate values of the nodes.

Two vertex coordinates on a diagonal of the object truth box;

as shown in fig. 2, the multi-frame tracking optimization module flow based on the reference point is as follows:

2) Rearranging the sounding features according to the sequence of the primary matching result of the reference points to obtain the sequence of the reference points of the current frame;

3) Calculating a group of MLPs to obtain offset of a corresponding reference point, and obtaining reference point coordinates required by a tracking branch;

4) Rearranging the reference point sequence of the previous frame according to the current frame and sending the reference point sequence into a decoder of the current frame, so that a matching result based on the reference point can be obtained;

the primary matching result of the reference points is as follows:

for the ith detection result of a given T-th frame,

representing the corresponding ebedding feature, +.>

Reference point for DETR, +.>

A predicted detection frame size;

and obtaining a reference point matching result between two continuous frames of images by using a Hungary algorithm through the weighted sum of the two distance matrixes.

Accumulating the three calculation results to obtain a final matching score matrix, determining the track similarity between the current frame image and the previous frame image according to the final matching score matrix, and judging that a target in the previous frame image finds a matched object in the current frame image, namely the matching is successful if the track similarity exceeds a corresponding preset threshold;

In addition, when updating the parameters of the tracker itself, updating is mainly performed based on the following principles:

1) For tracks that match successfully in the current frame: and updating the data such as the confidence coefficient, the detection frame, the 2D key points and the like stored in the tracker by using the result of the current frame, and simultaneously maintaining the activation state of the data.

2) For trajectories with no matching object in the current frame: and converting the state of the object into suspension, adding one to the suspension counter, resetting the suspension counter if a matching object is found, and closing the track if the suspension counter exceeds a threshold value.

In summary, the technical scheme considers that multi-target tracking can be generally divided into two subtasks, namely detection and association, and the existing method directly adopts the result of the detector to analyze the tracking track, so that the overall detection and association performance is reduced in a scene with shielding and movement. The human body detection frame and the corresponding feature extraction can be provided for the tracking task, but the prior structural information of the human body is not considered, so that the technical scheme is used for emphasizing the human body posture information by adding the multi-person posture estimation module, the improved human body posture detection frame can be used for simultaneously detecting the key points of the postures of pedestrians and pedestrians, the integration and integration of the human body posture information are realized, and a certain connection can be established between two frames instead of isolated detection by utilizing the judgment related to the postures and the reference point of the detection frame, so that the tracking result is effectively optimized.

Claims

1. The multi-frame track tracking system based on pedestrian gesture estimation is characterized in that a Tracking by Object Detection tracking frame is adopted, the point detection is added on the basis of single-frame object detection, meanwhile, the information of the point is introduced into tracking so as to track through targets and gesture detection norms thereof, the system comprises a single-frame model and a tracker which are sequentially connected, the single-frame model is connected with a vehicle-mounted camera so as to acquire a single-frame image from video data acquired by the vehicle-mounted camera, a detector and a multi-person gesture estimation module are arranged in the single-frame model, and the detector is used for outputting detection confidence and detection frames corresponding to all pedestrian targets in the single-frame image;

2. The multi-frame trajectory tracking system based on pedestrian pose estimation according to claim 1, wherein said detector specifically employs a transformable DETR framework based on a transducer.

3. The multi-frame track tracking method based on the pedestrian gesture estimation is characterized by comprising the following steps of:

4. A multi-frame trajectory tracking method based on pedestrian pose estimation according to claim 3, wherein said step S2 specifically comprises the steps of:

5. The multi-frame trajectory tracking method based on pedestrian pose estimation according to claim 3, wherein the initializing of parameters of the tracker in step S3 is specifically initializing the following parameters of the tracker:

6. The multi-frame track tracking method based on pedestrian gesture estimation according to claim 3, wherein the specific process of tracking and matching the pedestrian target in the current frame image and the previous frame image by the tracker in step S4 is as follows:

7. The multi-frame trajectory tracking method based on pedestrian pose estimation according to claim 6, wherein the detection frame matching degree is specifically:

8. The multi-frame trajectory tracking method based on pedestrian pose estimation according to claim 6, wherein the key point similarity is specifically:

is the two vertex coordinates on the diagonal of the object truth box.

9. The method for multi-frame trajectory tracking based on pedestrian pose estimation according to claim 6, wherein the process of calculating the reference point-based matching score comprises:

10. The multi-frame trajectory tracking method based on pedestrian pose estimation according to claim 3, wherein the specific process of updating the parameters of the tracker in step S4 is as follows: