CN117237411A

CN117237411A - Pedestrian multi-target tracking method based on deep learning

Info

Publication number: CN117237411A
Application number: CN202311210569.6A
Authority: CN
Inventors: 刘李漫; 杨光; 田金山; 韩逸飞; 罗官生; 潘宁; 胡怀飞
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-15

Abstract

The invention provides a pedestrian multi-target tracking method based on deep learning, which relates to the technical field of computer vision and comprises the following steps: s1, acquiring video data, performing frame extraction conversion on the video data into image frames, and preprocessing the image frames; s2, pedestrian target detection is carried out on the image frame by utilizing the improved YOLOv5-S model, and a pedestrian target is obtained; and S3, carrying out target tracking on the pedestrian targets of each frame by adopting the improved Strong SORT model, distributing an ID for each tracked pedestrian, and generating a pedestrian tracking result. The invention can accurately position each pedestrian in the video and track the pedestrian.

Description

Pedestrian multi-target tracking method based on deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian multi-target tracking method based on deep learning.

Background

Multi-objective tracking is an important task in the field of computer vision. Its main task is to detect a number of specific targets in the video and assign IDs for track following without knowing the number of targets in advance. Different targets have different IDs so as to realize subsequent track prediction, accurate search and other works. Pedestrian multi-target tracking involves real-time tracking of multiple pedestrian targets in a complex environment, with significant social and technical background. The fields of urban traffic, safety monitoring, intelligent traffic systems, crowd management and the like are increasingly in demand for accurate and efficient pedestrian multi-target tracking technology.

Depending on the algorithm structure, multi-objective tracking algorithms can be broadly divided into three categories: a multi-target tracking algorithm for separating detection and tracking, a multi-target tracking algorithm based on detection and tracking combination and a multi-target tracking algorithm based on an attention mechanism.

The multi-target tracking algorithm for separating detection and tracking is constructed on the tight combination of target detection and target association, and aims to accurately track a plurality of targets in a complex scene. Firstly, an algorithm analyzes an image by using a target detection technology aiming at each frame of a video sequence to obtain the position information of all targets in the current frame and corresponding bounding boxes. And cutting the targets according to the bounding box to obtain all the targets in the image. Then, the method is converted into a target association problem, a similarity matrix is constructed through an intersection ratio (IoU), appearance characteristics and the like, and the similarity matrix is solved by using a Hungary algorithm, a greedy algorithm and the like, so that optimal matching of target association is achieved. SORT (Simple Online and Realtime Tracking) is a simple online real-time tracking method that for the first time incorporates a kalman filter for predicting the position of a target in a future frame based on the position of the target in the current frame. Although the SORT method has achieved some success in reducing the interference of similar targets to the tracking model, and because of its simplicity, it can implement high-speed reasoning, but has some limitations in dealing with fast moving targets and long-term occlusion problems. Deep SORT is improved on the basis of SORT, and deep learning feature expression and a more powerful target association mode are introduced, so that the number of target identity switching times is effectively reduced, and the problem of target re-identification is successfully relieved.

The multi-target tracking algorithm based on the combination of detection and tracking realizes the simultaneous completion of target detection and feature extraction in a single network by adding feature extraction branches in the target detection network. The JDE integrates target detection and embedded learning in the same network, so that repeated calculation is well avoided, the JDE frame has remarkable improvement on speed under the condition of the same precision, and real-time multi-target tracking can be realized. FairMOT adopts a fair learning detection task and a re-identification task, two uniform branches are used for predicting pixel-level target scores and appearance characteristics, and the fairness realized between the tasks enables the tasks to obtain high-level detection and tracking accuracy.

Attention mechanism-based multi-target tracking algorithms are applied to multi-target tracking in recent years by attention mechanisms such as transformers, which are big fires in the field of computer vision. Peize et al construct a TransTransTrans, use a Transformer architecture to generate two sets of bounding boxes from two types of queries (i.e., object query and trajectory query), and determine the final set of bounding boxes by simple IoU matching, which is the tracking box for each object. It is exactly the same as tracking through the detection paradigm. In addition, it uses a transformer challenge key mechanism to track with predictions of prior detection knowledge. Daitao et al increase computation time by using a lightweight attention layer applied by a transfomer model inserted into the pyramid network.

The invention patent with the Chinese application number of 202310529165.7 discloses a multi-target tracking method and a system, which are used for tracking a point cloud deep learning target, filtering the position and the angle of the target, and correlating the targets in adjacent frames in a similarity matching mode to stabilize the output state of the target and improve the detection rate of the target; meanwhile, a high-low threshold algorithm is adopted according to the score of the target, and information such as the position, angle and speed of a new target conforming to a high-threshold strategy is added into the constructed tracking target information table, and then tracking is carried out, so that the false detection condition of the target is reduced. The prior art aims at the condition of false detection, and has no good effect on missed detection.

Disclosure of Invention

In view of the above, the invention provides a pedestrian multi-target tracking method based on deep learning, which mainly solves the problem that in the pedestrian multi-target tracking, when a pedestrian is blocked by other objects or other pedestrians, the pedestrian tracking is lost or wrong.

The technical scheme of the invention is realized as follows:

the invention provides a pedestrian multi-target tracking method based on deep learning, which comprises the following steps:

S1, acquiring video data, performing frame extraction conversion on the video data into image frames, and preprocessing the image frames;

s2, pedestrian target detection is carried out on the image frame by utilizing the improved YOLOv5-S model, and a pedestrian target is obtained;

and S3, carrying out target tracking on the pedestrian targets of each frame by adopting the improved Strong SORT model, distributing an ID for each tracked pedestrian, and generating a pedestrian tracking result.

Based on the above technical solution, preferably, step S2 includes:

s21, constructing a YOLOv5-S model, wherein the model comprises an input layer, a backbone network, a feature fusion network and a prediction network;

s22, inputting the image frames into a YOLOv5-S model, and extracting features of the image frames by a backbone network to obtain semantic features of different scales;

s23, inputting semantic features with different scales into a feature fusion network to be fused, so as to obtain fusion features;

s24, converting the fusion characteristic into a target detection result, namely a pedestrian target, according to the prediction network.

Based on the technical scheme, the network structure of the YOLOv5-s model is preferably as follows:

the backbone network comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for respectively extracting semantic features of a first scale, semantic features of a second scale and semantic features of a third scale, the first extraction module is composed of three CBS modules and two C3-DCN modules, the second extraction module is composed of a CBS module and a C3-DCN module, and the third extraction module is composed of a CBS module, a C3-DCN module, a pyramid pooling layer and a convolution attention module;

The feature fusion network comprises four CBS modules, two up-sampling modules, four C3-DCN modules and four fusion modules, and performs shunt fusion on semantic features of different scales, so that three fusion features are obtained according to three fusion routes;

the prediction network comprises three convolution modules, and pedestrian targets are detected by the three fusion features respectively.

On the basis of the above technical solution, preferably, the three merging routes are a first merging route, a second merging route and a third merging route, and the structure of the feature merging network according to the data flow direction is defined as: the system comprises a first CBS module, a first upsampling module, a first fusion module, a first C3-DCN module, a second CBS module, a second upsampling module, a second fusion module, a second C3-DCN module, a third CBS module, a third fusion module, a third C3-DCN module, a fourth CBS module, a fourth fusion module and a fourth C3-DCN module;

the first fusion route is:

sequentially inputting semantic features of a third scale into a first CBS module and a first up-sampling module, fusing the semantic features with semantic features of a second scale in a first fusion module, inputting the fused semantic features into a first C3-DCN module and a second CBS module to obtain first intermediate features, inputting the first intermediate features into the second up-sampling module, fusing the first intermediate features with the semantic features of the first scale in a second fusion module, and inputting the fused semantic features into a second C3-DCN module to obtain first fused features;

The second fusion route is:

inputting the first fusion feature into a third CBS module, fusing the first fusion feature with the first intermediate feature in the third fusion module, and inputting the fused first fusion feature into a third C3-DCN module to obtain a second fusion feature;

the third fusion route is:

and inputting the second fusion feature into a fourth CBS module, fusing the second fusion feature with the semantic feature of the third scale in the fourth fusion module, and inputting the fused second fusion feature into a fourth C3-DCN module to obtain a third fusion feature.

On the basis of the above technical solution, preferably, the semantic feature of the third scale is a feature that increases the channel attention and the spatial attention, wherein:

the channel attention calculation formula is:

wherein M is _c (F) To increase the semantic features of the third scale of channel attention, σ represents a sigmoid function, MLP is a multi-layer perceptron, avgPool is average pooling, maxPool is maximum pooling, μ ₁ Sum mu ₀ Is the weight coefficient of the multi-layer perceptron, F is the original third-scale semantic feature,representing the characteristics obtained by the averaging pooling operation, < >>Representing the features obtained through the maximum pooling operation;

the spatial attention calculation formula is:

wherein M is _s (F) To increase the semantic features of the third dimension of spatial attention, σ represents a sigmoid function, avgPool is an average pooling, MaxPool is the maximum pooling, f ^7×7 Representing the semantic features of the third scale of the original for a size 7 x 7,F filter in the convolution operation,representing the characteristics obtained by the averaging pooling operation in the channel direction,representing the characteristics obtained by the maximum pooling operation of the channel direction.

Based on the technical scheme, preferably, the Yolov5-s model is a pre-training model, and the loss function of the Yolov5-s model in pre-training is as follows:

L _WIoU ＝rR _WIoU L _IoU

S _u ＝wh+w _gt h _gt -W _j H _j

wherein L is _WIoU As a loss function, L _IoU R is the cross-ratio loss _WIoU R is a non-monotonic focusing coefficient, x, y represents the central coordinate of the prediction frame, x is a penalty term _gt ,y _gt Representing the center coordinates of a real frame, W _g ,H _g Representing width and height of the smallest peripheral frame containing both the predicted and real frames, superscript denoting separation from the computational graph, W _j ,H _j Representing the width and height of the overlapping region between the real frame and the predicted frame, S _u Representing the area of the union of the predicted and real frames, w, h representing the width and height of the predicted frame, w _gt ,h _gt Representing the width and height of the real box.

Based on the above technical solution, preferably, step S3 includes:

s31, initializing a predicted track according to a Kalman filter for each pedestrian target, wherein the track comprises the position, the speed and related information of the pedestrian;

S32, matching the predicted track with the pedestrian target in the current image frame by using a Hungary algorithm to obtain matched pairs, and distributing a unique ID for each matched pair to the tracked pedestrian target;

s33, for each tracked pedestrian target, updating the track of each pedestrian by using a Kalman filter according to the historical track and the pedestrian characteristics of the current image frame, wherein the track comprises the position of the predicted pedestrian, the update state estimation and the update covariance matrix;

and S34, outputting a pedestrian tracking result of the Strong SORT model after the tracking stop condition is reached.

Based on the above technical solution, preferably, in step S33, when the track of each pedestrian is updated by using the kalman filter, the appearance state of the i-th track at the image frame t is updated by using the dynamic EMA

α＝α _min +(α _max -α _min )×s _d

In the method, in the process of the invention,is the appearance state of the i-th track at image frame t,/, and>is the embedding of the appearance of the current matching pair detection, alpha is the dynamic weight, alpha _min And alpha _max Is a constant, s _d Is the confidence of the detection frame.

On the basis of the above technical solution, preferably, in step S33, when the track of each pedestrian is updated by using a kalman filter, the tracking state is updated by using adaptive noise, and the adaptive noise covariance formula is:

In the method, in the process of the invention,is adaptive noise covariance, R _k Is a preset constant noise covariance, c _k Is the confidence score of the detection in state k, c is detected when the noise is low _k Score is higher, resulting in +.>Lower.

On the basis of the above technical solution, preferably, step S32 includes:

s321, if the pedestrian target in the current image frame is successfully matched with the predicted track, a matched pair is formed, and step S33 is executed;

s322, if the matching of the pedestrian target and the predicted track in the current image frame fails, performing secondary matching, if the secondary matching is successful, forming a matching pair, executing step S33, and if the secondary matching fails, turning to step S323;

s323 judges that the secondary matching failure is the pedestrian target matching failure or the predicted track tracking failure, if the secondary matching failure is the pedestrian target matching failure, the step S324 is shifted, and if the secondary matching failure is the predicted track tracking failure, the step S325 is shifted;

s324, creating a new predicted track for the pedestrian target which fails to match, setting the new predicted track as unacknowledged, verifying the new predicted track for three times, if the verification is passed, modifying the new predicted track as confirmed to form a matched pair, executing step S33, and if the verification is failed, deleting the pedestrian target and the new predicted track;

S325, checking whether the setting of the predicted track with failed matching is unacknowledged, if so, deleting the predicted track, if so, setting the life time for the predicted track, and executing step S326;

s326, examining the predicted track with failed matching in the life time, deleting the predicted track if the matching still fails in the life time, forming a matching pair if the matching is successful in the life time, and executing the step S33.

Compared with the prior art, the method has the following beneficial effects:

(1) The pedestrian target detection method and the pedestrian target detection system firstly utilize the improved YOLOv5-s model to detect the pedestrian target in the image frame rapidly and accurately. The method can help the system to quickly acquire the target position information, provide an accurate initial position for subsequent target tracking, and then carry out target tracking on the pedestrian target of each frame by adopting the improved Strong SORT model, so that high-precision target tracking can be realized. The model can accurately estimate the states of the target such as the position, the speed, the scale and the like according to the motion mode and the observation information of the target, and distributes a unique ID for each tracked pedestrian to generate a pedestrian tracking result;

(2) The invention can reduce the number of processed frames by performing frame extraction processing on the video data, thereby improving the real-time performance of the system. Meanwhile, the improved YOLOv5-s and the Strong SORT model have higher processing speed, can finish target detection and tracking tasks in a short time, meet the requirements of real-time application, have stronger robustness, and can adapt to different scenes and target changes. This means that the system can perform pedestrian target detection and tracking in various complex environments, such as places with dense crowd, environments with large illumination changes, and the like;

(3) The invention utilizes the improved YOLOv5-s model to detect the pedestrian target, and the model can be better suitable for pedestrian targets with different scales by introducing the C3-DCN module with the deformable convolution design, thereby improving the detection precision of pedestrians. Conventional object detection models may present certain difficulties in the detection of small-scale or large-scale pedestrians, while improved models are better able to accommodate these variations. By introducing a CBAM module, the model can generate attention characteristic map information in two dimensions of a channel and a space, and multiply the attention characteristic map information with an original input characteristic map to carry out self-adaptive characteristic correction. The model can be better focused on important features in the image, unnecessary regional response is restrained, and therefore detection accuracy of pedestrian targets is improved;

(4) In order to enhance the tracking robustness, the self-adaptive noise is used in the algorithm, and the self-adaptive noise can help the tracking algorithm to model and process the uncertainty and noise in the target tracking process. In practical application, certain noise and uncertainty are introduced into factors such as the movement of a target, the external environment and the like, and the self-adaptive noise can automatically adjust a noise model according to the current tracking condition, so that the self-adaptive noise tracking method is better suitable for different tracking scenes, and the robustness of a tracking algorithm is improved. The adaptive noise may dynamically adjust the weights and variances of the state updates based on the current tracking state and the observed information. The state of the target can be estimated more accurately by adaptively adjusting the noise model, the situation of overfitting or under fitting is avoided, and the accuracy of state updating is improved;

(5) The matching process of the invention adopts a secondary matching mechanism, a confirmation and deletion mechanism and a life time mechanism, can timely detect the track of failure matching and carry out secondary verification or deletion, avoids carrying out overlong tracking on the invalid track, can improve the matching accuracy of the pedestrian target and the track tracking effect, and improves the robustness and accuracy of the system, thereby better completing the tracking task of the pedestrian target.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a diagram of a network structure of a YOLOv5-s model according to an embodiment of the present invention;

FIG. 3 is a matching flow chart in object tracking according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will clearly and fully describe the technical aspects of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

As shown in fig. 1, the present invention provides a pedestrian multi-target tracking method based on deep learning, including:

Specifically, in an embodiment of the present invention, step S1 includes:

acquiring video data: video data is acquired using a video acquisition device (e.g., a camera) or by reading an existing video file.

The extraction frame is converted into an image frame: and extracting frames from the video data according to a certain frame rate, and extracting each frame of image in the video. The extraction and conversion of frames may be performed using a video processing library, such as the functions provided by OpenCV.

Image frame preprocessing: preprocessing the image of each frame, comprising the following steps:

image size adjustment: the size of the image is adjusted to the size of the model input so as to ensure the input requirement of the model.

Image enhancement: the image can be enhanced in terms of brightness, contrast, saturation, etc. to improve image quality.

Image normalization: the pixel values of the image are normalized so as to fall within a proper range.

In the embodiment, the video data is converted into the image frame data which can be processed by the model by carrying out frame extraction and image preprocessing on the video data, so that preparation is provided for subsequent target detection and tracking, and the frame extraction can reduce the frame number which needs to be processed, thereby improving the processing speed of the system. The preprocessing can optimize the image, reduce noise and interference, and improve the detection and tracking effects of the model. By preprocessing the image frames, the input image data can have consistent size and quality, and the stable operation of the model and the reliability of the result are facilitated.

Specifically, in an embodiment of the present invention, step S2 includes:

As shown in fig. 2, the network structure of the YOLOv5-s model in this embodiment is:

The three fusion routes are a first fusion route, a second fusion route and a third fusion route respectively, and the structure of the characteristic fusion network according to the data flow direction is defined as follows: the system comprises a first CBS module, a first upsampling module, a first fusion module, a first C3-DCN module, a second CBS module, a second upsampling module, a second fusion module, a second C3-DCN module, a third CBS module, a third fusion module, a third C3-DCN module, a fourth CBS module, a fourth fusion module and a fourth C3-DCN module;

The first fusion route is:

the second fusion route is:

the third fusion route is:

In this embodiment, the YOLOv5-s model is adopted for target detection, the network structure of the YOLOv5-s model is an end-to-end target detection model, and the YOLOv5-s model is improved, specifically: 1. the C3-DCN module is designed by utilizing deformable convolution so as to strengthen the extraction of pedestrian characteristics with different scales, and the previous C3 module is replaced by the C3-DCN module; 2. and a CBAM (convolution attention module) is added, the CBAM can generate attention characteristic map information in two dimensions of a channel and a space in a serialization manner, then the two kinds of characteristic map information are multiplied by the original input characteristic map to carry out self-adaptive characteristic correction, a final characteristic map is generated, important characteristics of an image are focused, and unnecessary regional response is restrained.

Specifically, the CBS module comprises a convolution layer, a BN layer and an activation layer, wherein the activation layer adopts a SiLU activation function; the C3-DCN module comprises three DCBS modules, a Bottleneck module and a Concat layer; the DCBS module replaces the convolutional layer with deformable convolution, and other structures are the same as the CBS module, and the bottleck module is divided into two modes, namely 1. Bottleckektrue: the CBS block of 1*1 is followed by the CBS block of 3*3 and finally added to the initial input by the residual structure. BottleneckFalse: the CBS module of 1*1 and then 3*3 are first, and no residual structure is added; the pyramid pooling layer comprises 3 5*5 maximum pooling layers; the convolution attention module comprises two sub-modules, namely a channel attention sub-module comprising a maximum pooling layer, an average pooling layer and a shared full-connection layer, and a space attention sub-module comprising a maximum pooling layer, an average pooling layer and a convolution layer.

In this embodiment, CBAM focuses on giving channel attention and spatial attention to the feature map output last by the backbone network, so the semantic features of the third scale are expressed as:

F∈R ^C×H×W

where F is the third scale semantic feature, C, H and W represent the number of channels, depth and width of the feature map, respectively.

CBAM generates 1D channel attention profile M respectively _c And a 2D spatial attention profile M _s ：

M _c ∈R ^c×H×W

M _s ∈R ^C×H×W

This process can be described as the following formula:

in the method, in the process of the invention,for element-by-element multiplication. During multiplication, the attention value will be broadcast (duplicated) accordingly, with the channel attention value being broadcast along the spatial dimension and vice versa. F "is the final output.

Each channel of the feature map is used to be regarded as a feature detector, the channel features focus on what the useful information in the image is, in order to calculate the channel attention features more efficiently, all that is required is to compress the spatial dimensions of the feature map, learn the features of the object using average pooling and maximum pooling, and the channel attention calculation formula is:

wherein M is _c (F) To increase the semantic features of the third scale of channel attention, σ represents a sigmoid function, MLP is a multi-layer perceptron, avgPool is average pooling, maxPool is maximum pooling, μ ₁ Sum mu ₀ Is the weight coefficient of the multi-layer perceptron, F is the original third-scale semantic feature,representing the characteristics obtained by the averaging pooling operation, < >>Representing the features obtained through the max-pooling operation.

Unlike channel attention, spatial attention focuses on where the valid information on the feature map is. To calculate spatial attention, the channel dimensions are first averaged and maximally pooled, and then the feature maps they produce are stitched together. Then, on the spliced feature map, a convolution operation is used to generate a final spatial attention feature map, and the process formula is as follows:

Wherein M is _s (F) To increase semantic features of the third scale of spatial attention, σ represents a sigmoid function, avgPool is average pooling, maxPool is maximum pooling, f ^7×7 Representing the semantic features of the third scale of the original for a size 7 x 7,F filter in the convolution operation,representing the characteristics obtained by the averaging pooling operation in the channel direction,representing the characteristics obtained by the maximum pooling operation of the channel direction.

The YOLOv5-s model in this embodiment is a pre-training model, a network structure in the YOLOv5-s model is utilized to perform feature extraction on a sample picture, the extracted feature picture outputs an attention feature picture through a CBAM, meanwhile, the picture is divided into small blocks and an anchor frame is generated, the marked prediction frame and the feature picture are associated, finally, a loss function is established and end-to-end training is started, the YOLOv5-s adopts CIOU to measure the loss of a rectangular frame, and the rectangular frame is replaced by WIoU in order to improve the overall performance of the model, wherein the loss function is as follows:

L _WIoU ＝rR _WIoU L _IoU

S _u ＝wh+w _gt h _gt -W _j H _j

Wherein, r is obtained according to a beta structure, and the calculation formula of r is as follows:

in the method, in the process of the invention,is a weight parameter, beta is a parameter value calculated by monotonous focusing coefficient and a moving average value with m momentum in the original YOLO model, and the calculation formula is as follows:

wherein L is _IoU ^* I.e. monotonic focusing coefficients in the original YOLO model,i.e. the moving average of the momentum m.

In summary, the embodiment uses the improved YOLOv5-s model to detect the pedestrian target, and the model can better adapt to the pedestrian targets with different dimensions by introducing the C3-DCN module with the deformable convolution design, so that the detection precision of pedestrians is improved. Conventional object detection models may present certain difficulties in the detection of small-scale or large-scale pedestrians, while improved models are better able to accommodate these variations. By introducing a CBAM module, the model can generate attention characteristic map information in two dimensions of a channel and a space, and multiply the attention characteristic map information with an original input characteristic map to carry out self-adaptive characteristic correction. This enables the model to better focus on important features in the image, suppressing unnecessary regional responses, thereby improving the detection accuracy of pedestrian targets. In general, the improved YOLOv5-s model of the present embodiment may improve the accuracy and robustness of the detection. The model can be better suitable for pedestrian targets with different scales and focuses on important features in the image, so that false detection and omission situation are reduced. This is very beneficial for pedestrian target detection tasks, which can improve the performance and reliability of the system in real-world scenarios.

Specifically, in an embodiment of the present invention, step S3 includes:

Strong SORT is an algorithmic model for target tracking. It is an improved version based on Deep SORT algorithm, aiming at improving the accuracy and robustness of target tracking. The Strong SORT uses a multi-feature fusion method to improve the performance of target tracking. It better describes and distinguishes different objects by fusing multiple features, apparent features and motion features extracted in this embodiment. Thus, confusion among targets can be reduced, and the recognition and tracking accuracy of the targets can be improved.

The present embodiment modifies the kalman filtering in the Strong SORT, using the width-height instead of the aspect ratio used before, which helps to improve the fit of the bounding box width to the object, resulting in a better fit target for the generated detection box. In order to be able to selectively incorporate the appearance information into the orbit model only in high quality cases, the present embodiment proposes a dynamically weighted EMA (exponential moving average) to replace the previous EMA.

Specifically, in step S33, when the track of each pedestrian is updated using the kalman filter, the appearance state of the i-th track at the image frame t is updated by means of dynamic EMA

α＝α _min +(α _max -α _min )×s _d

In the method, in the process of the invention,is the appearance state of the i-th track at image frame t,/, and>is the embedding of the current matching to the detected appearance, alpha is the dynamic weight dynamically generated by the confidence of the detection frame through a formula, so that only the high-quality appearance embedding is accepted, alpha _min And alpha _max Is a constant, s _d Is the confidence of the detection frame.

In addition, when the Kalman filter is used for updating the track of each pedestrian, the tracking state is updated by adopting adaptive noise, and the adaptive noise covariance formula is as follows:

Lower and lowerMeaning that the detection will have a higher weight in the state update step and vice versa. This helps to improve the accuracy of the update state.

In order to enhance the tracking robustness, the adaptive noise is used in the algorithm, and the adaptive noise can help the tracking algorithm to model and process uncertainty and noise in the target tracking process. In practical application, certain noise and uncertainty are introduced into factors such as the movement of a target, the external environment and the like, and the self-adaptive noise can automatically adjust a noise model according to the current tracking condition, so that the self-adaptive noise tracking method is better suitable for different tracking scenes, and the robustness of a tracking algorithm is improved. The adaptive noise may dynamically adjust the weights and variances of the state updates based on the current tracking state and the observed information. By adaptively adjusting the noise model, the state of the target can be estimated more accurately, the occurrence of the condition of over fitting or under fitting is avoided, and the accuracy of state updating is improved.

Specifically, referring to fig. 3, in one embodiment of the present invention, step S32 includes:

The following description is given by way of a specific example:

if the detection box and the already confirmed track are successfully matched, updating is performed, and then prediction-observation-updating is repeated. The first case of failure in matching is to perform IoU matching again on the track and the detection frame which are not matched, update the track after IoU matching is successful, and continue the track flow of prediction-observation-update.

IoU match failures are classified into detection box match failures and trace tracking failures. For a detection frame which still fails in re-matching, the reason for the failure of matching may be that the newly appeared target has no track tracks before or has no track tracks which are blocked for a long time, so a new track is established for the newly appeared target, the new track is set as unconfirmed, and is inspected for three times, and if the actual target is modified as firmed, prediction-observation-updating is performed. For the tracks still failing to be re-matched, it may be that the detector missed the target, see if it is confirmed, if it is unconfirmed, delete it; otherwise, setting a service life for the lens, deleting the lens if the service life is unchanged (> max_age) within the service life time, and considering that the lens is moved out; if it is within the lifetime (< max_age), it is also investigated three times, whether it is the tracking target, and then the prediction-observation-update is performed.

The matching process of the embodiment adopts a secondary matching mechanism, a confirmation and deletion mechanism and a life duration mechanism, so that the matching failure track can be timely detected and secondarily verified or deleted, the invalid track is prevented from being tracked for a long time, the matching accuracy of the pedestrian target and the track tracking effect can be improved, the robustness and accuracy of the system are improved, and the tracking task of the pedestrian target is better completed.

When the target tracking algorithm reaches a stop condition, outputting a pedestrian tracking result of the Strong SORT model, wherein the stop condition can be as follows:

time cut-off: and stopping target tracking when the preset time limit is reached.

And (5) manually stopping: and when the tracking result is considered to meet the requirement, manually stopping tracking or triggering a stopping signal to stop tracking.

The output pedestrian tracking result comprises:

target location and bounding box: the position of the object in each frame and the bounding box information are output for subsequent analysis and processing.

Target track: the trajectory of the output target throughout the tracking process may be a series of location points or a continuous bounding box.

Target ID: each tracked target is assigned a unique ID and the ID is associated with the position and trajectory of the target for subsequent identification and tracking.

And finally, outputting the generated tracking result on a system interface for display, so that a user can view the tracking result on the system interface according to the name of the picture set or the name of the video.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The pedestrian multi-target tracking method based on deep learning is characterized by comprising the following steps of:

2. The pedestrian multi-target tracking method based on deep learning as claimed in claim 1, wherein the step S2 includes:

3. The pedestrian multi-target tracking method based on deep learning as claimed in claim 2, wherein the network structure of the YOLOv5-s model is:

4. The pedestrian multi-target tracking method based on deep learning as claimed in claim 3, wherein the three fusion routes are a first fusion route, a second fusion route and a third fusion route, respectively, and the structure of the feature fusion network according to the data flow direction is defined as: the system comprises a first CBS module, a first upsampling module, a first fusion module, a first C3-DCN module, a second CBS module, a second upsampling module, a second fusion module, a second C3-DCN module, a third CBS module, a third fusion module, a third C3-DCN module, a fourth CBS module, a fourth fusion module and a fourth C3-DCN module;

the first fusion route is:

The second fusion route is:

the third fusion route is:

5. The pedestrian multi-objective tracking method based on deep learning of claim 4, wherein the semantic features of the third scale are features that increase channel attention and spatial attention, wherein:

the channel attention calculation formula is:

the spatial attention calculation formula is:

6. The pedestrian multi-target tracking method based on deep learning as claimed in claim 2, wherein the YOLOv5-s model is a pre-training model, and a loss function of the YOLOv5-s model during pre-training is:

L _WIoU ＝rR _WIoU L _IoU

S _u ＝wh+w _gt h _gt -W _j H _j

7. The pedestrian multi-target tracking method based on deep learning as claimed in claim 1, wherein the step S3 includes:

8. The pedestrian multi-target tracking method based on deep learning as claimed in claim 7, wherein in step S33, when the track of each pedestrian is updated using a kalman filter, the appearance state of the ith track at the image frame t is updated by means of dynamic EMA

α＝α _min +(α _max -α _min )×s _d

9. The pedestrian multi-target tracking method based on deep learning as claimed in claim 8, wherein in step S33, when the track of each pedestrian is updated using a kalman filter, the tracking state is updated using adaptive noise, and the adaptive noise covariance formula is:

10. The pedestrian multi-target tracking method based on deep learning of claim 7, wherein step S32 includes: