CN116645402A

CN116645402A - Online pedestrian tracking method based on improved target detection network

Info

Publication number: CN116645402A
Application number: CN202310327267.0A
Authority: CN
Inventors: 蒋畅江; 舒鹏; 刘朋
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-08-25

Abstract

The invention relates to an online pedestrian tracking method based on an improved target detection network, which belongs to the field of target detection and comprises the following steps: the current frame image is input into a YOLOX target detection fused with a CA attention mechanism, the current frame image is divided into a high-score detection frame and a low-score detection frame according to the confidence level of the target detection, the high-score detection frame and the Kalman filtering prediction frame are subjected to similarity matching by utilizing a ReID network and a GIoU, the high-score detection frame and the Kalman filtering prediction frame are successfully matched and updated by Kalman filtering, a tracking track is newly established for the detection frame with a higher score but without matching the tracking track, the second matching is performed by using the low-score detection frame and the tracking track of the high-score detection frame which is not matched for the first time (due to the fact that the current frame is subjected to an object with a low serious shielding score), 30 frames are reserved for the tracking track without matching the upper detection frame, and the tracking track is matched when the tracking track appears again, and if the tracking track is not matched, the tracking track is deleted. The invention can effectively reduce the influence of shielding on recognition and improve the recognition rate and recognition speed.

Description

Online pedestrian tracking method based on improved target detection network

Technical Field

The invention belongs to the field of target detection, and relates to an online pedestrian tracking method based on an improved target detection network.

Background

Multi-target tracking in video is a fundamental and important task for many visual applications, such as video surveillance and autopilot. The purpose of this task is to locate multiple objects in each frame and to acquire each identified trajectory. Most of the current methods are tracking methods based on detection frames, including an online tracking mode and an offline tracking mode: the online tracking mode is to construct an association matrix according to the similarity between the target and the detection frame, and match the positions of the target and the detection frame by using a matching algorithm; the off-line tracking mode is to construct a graph according to the detection frames and the similarity between the detection frames in a period, and the object tracking problem is solved by utilizing sub-graph division.

The SORT algorithm proposed by Bewley et al is a simple online real-time multi-target tracking algorithm, mainly uses Kalman filtering to propagate a target object into a future frame, and then uses the IOU as a measurement index to establish a relation. The Wojke Nicolai et al propose that the deep Sort method is an improved result of the Sort method, fuses the re-recognition network, and performs similarity matching on the detection frame and the predicted trajectory by using the hungarian algorithm.

The existing target tracking system is easily affected by factors such as shielding, camera pixels, background changes and the like, and problems such as target identification errors can occur, most of the existing tracking systems are based on detection results, detection performance directly affects tracking effects, but the detector is too accurate, and overall speed can be reduced greatly. When a target is disturbed or blocked by the background, the ID number of the target frame may change, and the number of the target frame may also change when a numbered target reenters the shooting range.

Disclosure of Invention

In view of the above, the present invention aims to provide a pedestrian tracking method using YOLOX with a fused attention mechanism as a target detection network, so as to inhibit the influence of occlusion on recognition and improve recognition rate and recognition speed. In the method, coordinate attention is fused (Coordinate Attention) in the Neck part of the Yolox, the feature extraction capability is improved, then a target detection result is input into a tracker, a ReID re-recognition module and a GIoU are utilized to carry out similarity matching on a detection frame and a prediction frame, and finally the matching track is obtained according to the matching distance cost.

In order to achieve the above purpose, the present invention provides the following technical solutions:

an online pedestrian tracking method based on an improved target detection network, comprising the following steps:

s1: acquiring a pedestrian image in a pedestrian video frame, and preprocessing the pedestrian image;

s2: inputting a picture into a YOLOX target detection network, wherein a CA attention module is fused at a backbond output position of the YOLOX target detection network;

s3: inputting the output of target detection to a tracker to obtain a high-score detection frame and a low-score detection frame, wherein the confidence coefficient of the high-score detection frame is higher than that of the low-score detection frame;

s4: performing similarity matching on the high-resolution detection frame and the Kalman filtering prediction frame; for a detection frame which is not matched with the upper track but has high score, a track is newly opened, the track is updated by using Kalman filtering, the successfully matched detection frame updates the track set by using the Kalman filtering, and the track which is failed to match waits for the second similarity matching;

s5: and performing similarity matching on the low-resolution detection frames and the tracks which are not matched, wherein IoU is adopted as a matching measure, the successfully matched tracks are updated by Kalman filtering, the detection frames which are failed to match are deleted, and after the tracks which are failed to match are reserved for a certain time, the deletion is performed if the tracks which are not matched with the detection frames again cannot be performed.

Further, step S1 is to acquire a pedestrian image in the pedestrian video frame, and pre-process the pedestrian image, which specifically includes:

s11: acquiring pedestrian images in pedestrian video frames, and sampling the pedestrian images into a plurality of frame image sets { I } ₁ ,I ₂ ，···,I _n Detecting each frame of image using YOLOX-s and outputting a detection set { D ] of 1 st to n th frames ₁ ,D ₂ ，···,D _n -and coordinate position information { P } of pedestrians 1 to m ₁ ,P ₂ ，···,P _m -comprising a center coordinate, an aspect ratio, and accelerations in various directions thereof;

s12: preprocessing pedestrian images, and adopting Mosaic data enhancement and MixUp data enhancement.

Further, in step S2, the CA attention module input/output flow includes:

s21: data is input from a backclone output position, each channel is encoded in the horizontal direction and the vertical direction by using a pooling core of (H, 1) and (1, W), and a direction perception attention characteristic diagram z with the size of C multiplied by H multiplied by 1 is output ^H And a direction-aware attention profile z of size c×1×w ^W ；

S22: will z ^H And z ^W Splicing by Concat, and generating a process characteristic diagram f E R by using a 1X 1 convolution module ^C ^/r×1×(H+W) R represents a downsampled scale of the channel downsampling;

s23: then divide f into f along the horizontal and vertical directions ^h ∈R ^C/r×H And f ^w ∈R ^C/r×W And f is convolved with the other two 1 x 1 convolutions ^h And f ^w Adjusting to be tensor which is the same as the number of the input X channels;

s24: then using Sigmiod activation function to obtain the attention weight g of two independent space directions ^h And g ^w Then to g ^h And g ^w And (3) expanding to finally obtain an output characteristic diagram with stronger characterization information, outputting the output characteristic diagram to a Neck part of the YOLOX-s, and finally passing through a detection head.

Further, the step S3 specifically includes: inputting the output of target detection into a tracker, wherein the tracker is provided with two confidence thresholds, including a high-threshold (high-shrsh) and a low-threshold (low-shrsh); and the high-score detection frames are higher than the high-score threshold, the low-score detection frames are arranged between the high-score threshold and the low-score threshold, all pedestrian frames with confidence less than the low-score threshold in the pedestrian frames are deleted to obtain a set of the low-score detection frames, and finally a set of the high-score detection frames and a set of the low-score detection frames are obtained.

Further, in step S4, the kalman filter prediction box is configured to predict the track set through kalman filter, and the state update equation is as follows:

wherein the method comprises the steps ofA posterior state estimate representing time k, < >>Representing a priori estimates, i.e. based on the optimal predicted estimate at the previous time, z _k Representing the observed value. Forming a prediction frame set D _t 。

In step S4, performing similarity matching on the high-resolution detection frame and the kalman filtering prediction frame, specifically, obtaining a final similarity c through the ReID module and the GIoU, and performing matching on the track by using a hungarian algorithm;

the ReId module is used for detecting pedestrians under the pedestrian track library and the high-resolution detection frame, extracting the appearance feature distance of the pedestrians under the high-resolution detection frame so as to judge whether the pedestrians are the same, and updating the pedestrian information in the pedestrian track library; the pedestrian track library comprises pedestrian appearance characteristics and pedestrian positions;

the GIoU considers the non-overlapping part of the detected frame and the predicted frame and reflects the overlapping mode and the overlapping degree of the detected frame and the predicted frame.

Further, the ReID module extracts feature vectors from the prediction frame and the detection frame respectively by using a ReID network model, and uses P _j Cutting the image by the medium coordinates, inputting the cut pedestrian image set under the high-resolution detection frame into a pedestrian re-recognition network model to obtain the appearance characteristics of the pedestrians under the high-resolution detection frame, comparing the appearance characteristics with the appearance characteristics of the pedestrians appearing in the image again, and calculating the similarity d of the feature vectors ⁽¹⁾ (i,j)。

Further, the formula for calculating the similarity between the prediction frame and the detection frame by the GIoU is as follows:

wherein IoU is the cross-over ratio to obtain the similarity d ⁽²⁾ (i, j) according to formula c _i,j ＝μd ⁽¹⁾ (i,j)+(1-μ)d ⁽²⁾ (i, j) setting a super parameter mu to obtain the final similarity c.

Further, the matching the track by using the hungarian algorithm specifically includes:

initializing a bipartite graph, confirming a previous frame target and a current frame target possibly matched according to an input cost matrix, setting U as a previous frame set, setting V as a current frame set, and sequentially matching according to an ID sequence:

first, matching target 1 of the current frame, which may match target 1 of the previous frame, then matching target 2, then matching target 3, if the previous frame target 3 is already matched by targets 1,2, matching the previous target in U to target 1 with another target, if the target 1 in U is already matched by target 2 in V, then changing target 2 in U to match targets 1,2,3 in U to targets, then matching the following targets until finally no matched target in V is considered as a new target, which is a recursive process in general.

The invention has the beneficial effects that:

(1) Because the problems of too small targets, frequent shielding and the like exist in the actual situation of multi-target tracking, the method adds the coordinate attention module in the Neck part of the YOLOX, so that the system can better pay attention to the character detail characteristics of the video stream, the information loss in the characteristic extraction process is reduced, the characteristic fusion part has more abundant information, and the calculation cost is small, thereby improving the detection effect, reducing the false detection result and having better tracking effect.

(2) The invention uses the target detection network to detect pedestrians on the current frame image to obtain a high-resolution detection frame and a low-resolution detection frame, the high-resolution detection frame and the prediction frame use ReID and GIoU as metrics to carry out first-time similarity matching, and the low-resolution detection frame and the unmatched track carry out second-time similarity matching by IoU. The pedestrian tracking precision is enhanced by utilizing twice pedestrian matching, the tracking effect is better under the condition of shielding by fusing the ReID module, and the problems of pedestrian tracking loss, identity ID exchange and the like can be avoided.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an online pedestrian tracking method based on an improved target detection network according to the present invention;

FIG. 2 is a schematic illustration of CA attention module addition locations;

fig. 3 is a diagram showing the experimental results and the detection frame results of the present invention compared with the deepSORT algorithm in a section of verification video, wherein (a) and (b) are screenshots of the deepSORT algorithm, and (c) and (d) are experimental results of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

The invention relates to a pedestrian tracking algorithm for improving an object detection network, and a flow chart of the pedestrian tracking algorithm is shown in figure 1. The method specifically comprises the following steps:

step one, acquiring pedestrian images in pedestrian video frames, and sampling the pedestrian images into a plurality of frame image sets { I } ₁ ,I ₂ ，···,I _n Detecting each frame of image using YOLOX-s and outputting detection result { D } ₁ ,D ₂ ，···,D _n }. Wherein D is _i Represents the detection set of the ith frame, { P ₁ ,P ₂ ，···,P _m }，P _j And representing the position information of each pedestrian coordinate, wherein the position information comprises a center coordinate, an aspect ratio and accelerations in all directions of the center coordinate, preprocessing pedestrian images, and enhancing by using Mosaic data and MixUp data.

Inputting the picture into a Yolox-s target detection network, wherein the network merges a CA attention mechanism, the CA attention is positioned at a back bone output position of the Yolox-s, the position is shown as figure 2, and the CA attention module input and output flow is as follows: input from a backboost output position, usingThe pooling cores of (H, 1) and (1, W) encode each channel along the horizontal direction and the vertical direction, and output a direction sensing attention characteristic diagram z with the size of CxHx1 ^H And a direction-aware attention profile z of size c×1×w ^W . Will z ^H And z ^W Splicing by Concat, and generating a process characteristic diagram f E R by using a 1X 1 convolution module ^{C/r×1×(H+W)} R represents a downsampled ratio of the channel downsampling. Then divide f into f along the horizontal and vertical directions ^h ∈R ^C/r×H And f ^w ∈R ^C/r×W And f is convolved with the other two 1 x 1 convolutions ^h And f ^w Adjusted to the same tensor as the number of input X channels, then the Sigmiod activation function is used to obtain the attention weights g of two independent spatial directions ^h And g ^w Then to g ^h And g ^w And (3) expanding to finally obtain an output characteristic diagram with stronger characterization information, outputting the output characteristic diagram to a Neck part of the YOLOX-s, and finally passing through a detection head.

Inputting the output of target detection to a tracker, setting two confidence thresholds, namely a high-threshold (high-shrsh) and a low-threshold (low-shrsh), wherein a high-score detection frame is higher than the high-score threshold, a low-score detection frame is arranged between the high-score threshold and the low-score threshold, and all pedestrian frames with confidence in the pedestrian frames being lower than the low-score threshold are deleted to obtain a set of the low-score detection frames, so that a set of the high-score detection frames and a set of the low-score detection frames are finally obtained, and the confidence of the high-score detection frames is higher than that of the low-score detection frames.

And fourthly, predicting a track set by Kalman filtering, wherein a state updating equation is as follows:

wherein the method comprises the steps ofA posterior state estimate representing time k, < >>Representative firstThe experimental estimate, i.e. the optimal predicted estimate from the last moment, z _k Representing the observed value. Forming a prediction frame set D _t 。

Extracting feature vectors from the prediction frame and the detection frame respectively by using a ReID network model, and using P _j Cutting the image by the medium coordinates, inputting the cut pedestrian image (namely the pedestrian image under the high-resolution detection frame) set into a pedestrian re-recognition network model to obtain pedestrian appearance characteristics (the pedestrian appearance characteristics under the high-resolution detection frame), comparing the pedestrian appearance characteristics with the appearance characteristics of pedestrians appearing in the image again, and calculating the similarity d of the feature vectors ⁽¹⁾ (i, j); and then calculating the similarity of the prediction frame and the detection frame by using a GIoU formula, wherein the GIoU formula is as follows:

IoU is the cross-over ratio to obtain the similarity d ⁽²⁾ (i, j) according to formula c _i,j ＝μd ⁽¹⁾ (i,j)+(1-μ)d ⁽²⁾ (i, j) setting a super parameter mu to obtain the final similarity c.

And performing similarity matching on the high-resolution detection frame and the Kalman filtering prediction frame, wherein the matching measurement uses the ReID characteristic measurement and the final similarity c obtained by the GIoU. The ReId module can detect pedestrians under the pedestrian track library and the high-resolution detection frame, can extract the appearance feature distance of the pedestrians under the high-resolution detection frame so as to judge whether the pedestrians are the same, and can update the pedestrian information in the pedestrian track library. The pedestrian trajectory library includes pedestrian appearance characteristics and pedestrian positions. The GIoU considers the non-overlapping portion of the detected frame and the predicted frame, which is not considered by IoU, and can reflect the overlapping manner and the overlapping degree of the detected frame and the predicted frame. According to the similarity c, matching the tracks by using a Hungary algorithm, and for the detection frames which are not matched with the tracks but have high scores, newly opening a track, updating the track by using Kalman filtering, updating the track set by using the Kalman filtering for the detection frames successfully matched with the tracks, and waiting for matching of the similarity for the second time for the tracks which are failed to match.

And fifthly, performing similarity matching on the low-resolution detection frames and the tracks which are not matched, wherein IoU is adopted as a matching measure, performing track updating by using Kalman filtering after successful matching, deleting the detection frames which are failed to match, reserving 30 frames for the tracks which are failed to match, and deleting if the detection frames which are not matched again.

Examples: multi-target tracking experiment

The data set is trained by MOT17 and CrowdHuman, and is verified on half of the MOT17 test set. The online pedestrian tracking algorithm based on the improved target detection network uses Mosaic and MixUp for data enhancement, adopts a cosine annealing strategy for dynamic updating of learning rate, and adopts FP16 mixed precision technology for accelerating convergence, and experimental data are shown in Table 1.

TABLE 1

Compared with other methods, the method has the advantages of greatly improving the precision, along with low ID switching frequency (high IDF 1) and good instantaneity (high FPS), which fully proves that the method not only can improve the precision of multi-target tracking, but also can effectively control the influence of the missed detection target on the experimental result.

FIG. 3 is a comparison chart of an experimental result and a detection frame result in a section of verification video, wherein (a) and (b) are screen shots of a deepSORT algorithm, and the situation that a target is misdetected due to a background problem can be found, and position information is missing; (c) And (d) are experimental results of the invention, and it can be seen that false detection of the dummy as a true person does not occur. From (b) and (d), it can be seen that the present invention can effectively track the target when the target pedestrian is blocked or the target is small, and even if the target is blocked, the present invention can be matched to the same target next, and has good robustness to the occurrence of blocking.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. An online pedestrian tracking method based on an improved target detection network is characterized by comprising the following steps of: the method comprises the following steps:

2. The improved object detection network-based online pedestrian tracking method of claim 1, wherein: step S1, acquiring a pedestrian image in a pedestrian video frame, and preprocessing the pedestrian image, specifically including:

3. The improved object detection network-based online pedestrian tracking method of claim 1, wherein: in step S2, the CA attention module input/output flow includes:

4. The improved object detection network-based online pedestrian tracking method of claim 1, wherein: the step S3 specifically comprises the following steps: inputting the output of target detection to a tracker, wherein two confidence thresholds are arranged in the tracker, and the confidence thresholds comprise a high-score threshold and a low-score threshold; and the high-score detection frames are higher than the high-score threshold, the low-score detection frames are arranged between the high-score threshold and the low-score threshold, all pedestrian frames with confidence less than the low-score threshold in the pedestrian frames are deleted to obtain a set of the low-score detection frames, and finally a set of the high-score detection frames and a set of the low-score detection frames are obtained.

5. The improved object detection network-based online pedestrian tracking method of claim 1, wherein: the kalman filter prediction frame in step S4 is used for predicting a track set through kalman filter, and the state update equation is as follows:

wherein the method comprises the steps ofA posterior state estimate representing time k, < >>Representing a priori estimates, i.e. based on the optimal predicted estimate at the previous time, z _k Representing the observed values to form a prediction frame set D _t 。

6. The improved object detection network-based online pedestrian tracking method of claim 5 wherein: in the step S4, the high-resolution detection frame and the kalman filtering prediction frame are subjected to similarity matching, specifically, a final similarity c is obtained through the ReID module and the GIoU, and the track is matched by using a hungarian algorithm;

7. The improved object detection network-based online pedestrian tracking method of claim 6 wherein: the ReID module extracts characteristic vectors from the prediction frame and the detection frame respectively by using a ReID network model and uses P _j Cutting the image by the medium coordinates, inputting the cut pedestrian image set under the high-resolution detection frame into a pedestrian re-recognition network model to obtain the appearance characteristics of the pedestrians under the high-resolution detection frame, comparing the appearance characteristics with the appearance characteristics of the pedestrians appearing in the image again, and calculating the similarity d of the feature vectors ⁽¹⁾ (i，j)。

8. The improved object detection network-based online pedestrian tracking method of claim 7 wherein: the formula for calculating the similarity between the prediction frame and the detection frame by the GIoU is as follows:

wherein IoU is the cross-over ratio to obtain the similarity d ⁽²⁾ (i, j) according to formula c _i，j ＝μd ⁽¹⁾ (i，j)+(1-μ)d ⁽²⁾ (i, j) setting a super parameter mu to obtain the final similarity c.

9. The improved object detection network-based online pedestrian tracking method of claim 7 wherein: the track matching method using the Hungary algorithm specifically comprises the following steps:

first, matching target 1 of the current frame, which may match target 1 of the previous frame, then matching target 2, then matching target 3, if the previous frame target 3 is already matched by targets 1,2, then matching the previous target in U to target 1 with another target, if target 1 in U is already matched to target 2 in V, then changing target 2 in U to match targets 1,2,3 in U to targets, then matching the following targets until finally no matched target in V is considered as a new target.