CN117392176A

CN117392176A - Pedestrian tracking method and system for video monitoring and computer readable medium

Info

Publication number: CN117392176A
Application number: CN202311381410.0A
Authority: CN
Inventors: 陈从平; 吴伟鹏; 陆洋; 陈奔; 刘雅玄
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-12

Abstract

The invention relates to the technical field of image processing, in particular to a pedestrian tracking method and system for video monitoring and a computer readable medium, wherein the method comprises the steps of processing a video stream containing a pedestrian target frame by frame, generating a video frame corresponding image and preprocessing the image; inputting continuous frame image data into an improved YOLOv5 network model, and carrying out feature extraction and target detection to obtain the boundary frame information of a target pedestrian; establishing a lightweight DeepSort tracking model, and replacing an original feature extraction module of the DeepSort with a MobileNet V2 feature extraction module; and correlating the pedestrian motion information frame by calculating the mahalanobis distance between the detected pedestrian position and the Kalman filter predicted position. The invention solves the problems that the traditional monitoring is easy to generate unstable monitoring result and easily leak important information; and the problem of constraints on limited computing resources on edge devices.

Description

Pedestrian tracking method and system for video monitoring and computer readable medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a pedestrian tracking method and system for video surveillance, and a computer readable medium.

Background

With the development of computer vision technology, cameras have been widely used in the field of video surveillance, and particularly with the increasing demands of people for public and personal security, cameras of various sizes and diversity are currently distributed in large numbers in frontiers, public buildings, public transportation, shops, office buildings, parking lots and even homes.

An intelligent monitoring technology needs to be introduced, so that the burden of a monitor in a traditional monitoring system is reduced, and the monitoring effect is improved; YOLOv5 is a target detection algorithm based on deep learning, however, it is not enough to perform target detection only, because in an actual monitoring scene, a shielding phenomenon often occurs to pedestrians, so that it is difficult for a conventional target tracking algorithm to accurately track the track of the pedestrian.

In addition, running complex models on embedded and edge devices is often constrained by computational resources and memory limitations, and therefore a lighter weight feature extraction module is employed to reduce model parameters and accommodate the resource limitations of the edge devices.

Disclosure of Invention

Aiming at the defects of the prior method, the invention solves the problems that the traditional monitoring is easy to generate unstable monitoring result and easily leak important information; and the problem of constraints on limited computing resources on edge devices.

The technical scheme adopted by the invention is as follows: the pedestrian tracking method for video monitoring comprises the following steps:

step one, processing a video stream containing a pedestrian target frame by frame to generate a video frame corresponding image, and preprocessing the image;

further, the preprocessing operation includes: the image is adjusted for brightness and contrast, noise reduction and resizing.

Inputting continuous frame image data into an improved YOLOv5 network model, and performing feature extraction and target detection to obtain boundary box information of a target pedestrian;

further, the improved YOLOv5 network model is to replace the C3 modules of layers 2, 4, 6 and 8 of the YOLOv5 network backbone with DAMC3 modules; the DAMC3 is a spatial attention module, a channel attention module, and a convolution module connected in sequence at the output layer of the C3 module.

Step three, a lightweight deep Sort tracking model is established, and the original characteristic extraction module of the deep Sort is replaced by a MobileNet V2 characteristic extraction module;

further, the MobileNetV2 feature extraction module includes:

the expansion layer increases the dimension of the low-dimension feature by performing point-by-point convolution by using a 1x1 convolution kernel; depth separable convolution channel convolution by a 3x3 convolution kernel reduces the number of computation parameters, and projection layers reduce the dimensions of high-dimensional features by point-by-point convolution by using a 1x1 convolution kernel.

And step four, correlating the pedestrian movement information frame by calculating the mahalanobis distance between the detected pedestrian position and the Kalman filter predicted position.

Further, the equation of the mahalanobis distance is:

wherein: d, d _j Is the position of the detection frame j; y is _i Is the predicted position of the Kalman filter i; s is S _i Is the covariance matrix of the detected and predicted positions.

Further, the fourth step further includes:

and carrying out IOU matching on the associated pedestrian prediction frame and the pedestrian detection frame transmitted by the second frame detection, setting a threshold value, confirming the tracking state and carrying out cascade matching.

Further, a pedestrian tracking system for video monitoring, comprising: a memory for storing instructions executable by the processor; and the processor is used for executing the instructions to realize the pedestrian tracking method for video monitoring.

Further, a computer readable medium storing computer program code, characterized in that the computer program code, when executed by a processor, implements a pedestrian tracking method for video surveillance.

The invention has the beneficial effects that:

1. by adding a dual-attentiveness mechanism DAM in the YOLOv5, pedestrian characteristic representation capability can be enhanced, target detection accuracy can be improved, target position and scale estimation can be improved, and the influence of noise and shielding can be restrained; the calculated amount is effectively reduced;

2. the combination of the improved YOLOv5 network model and the light-weight deep start tracking model can improve the robustness of pedestrian targets in complex scenes, the improved YOLOv5 network model has stronger target detection capability, the light-weight deep start tracking model utilizes appearance characteristics and a motion model to carry out target association, and the challenges of shielding, appearance change and the like can be processed, so that the accuracy and the robustness of pedestrian tracking in intelligent video monitoring are improved;

3. by replacing the original feature extraction convolution module with the MobileNetV2 feature extraction module, the parameter quantity of the model is greatly reduced, and the whole system can be easily deployed to edge equipment with limited computing resources and storage capacity.

Drawings

FIG. 1 is a schematic logic flow diagram of a pedestrian tracking method for video surveillance of the present invention;

FIG. 2 is a diagram of a modified yolov5 network model of the present invention;

FIG. 3 is a schematic diagram of a dual-attention mechanism DAM of the present invention;

FIG. 4 is a block diagram of a dual-attention mechanism DAMC3 of the present invention;

FIG. 5 is a diagram of a MobileNet V2 feature extraction module of the present invention;

FIG. 6 is a schematic diagram of a depth separable convolution structure of the present invention;

FIG. 7 is a tracking state flow diagram of the present invention;

fig. 8 is a graph of the pedestrian tracking effect of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic illustrations showing only the basic structure of the invention and thus showing only those constructions that are relevant to the invention.

As shown in fig. 1, the pedestrian tracking method for video monitoring includes the steps of:

step one, processing a video stream containing a pedestrian target frame by frame to generate a video frame;

firstly, acquiring a real-time monitoring video, and decoding each frame of video to obtain image data.

Secondly, before each frame is processed, preprocessing operation is performed on the image, and the image is optimized to improve the performance of a subsequent target detection and tracking algorithm, wherein the preprocessing operation comprises: adjusting brightness and contrast, reducing noise and adjusting image size of the image; the quality and definition of the image can be improved, so that the target can be more easily and accurately detected and tracked in the image;

finally, repeating the steps, and carrying out the same processing on the next frame of video until all the video frames are processed.

Step two, as shown in fig. 2, an improved YOLOv5 network model is shown, continuous frame image data is input into the network model for feature extraction and target detection, and boundary box information of a target pedestrian is obtained;

to improve the accuracy and stability of pedestrian detection and tracking, YOLOv5 was improved, adding a dual-attention mechanism DAM (Dual attention model); the mechanism of attention stems from the intuition of the human brain, which tends to focus on important information while ignoring secondary information when dealing with large amounts of information; the dual-attention mechanism DAM principle is shown in fig. 3, and the DAM module can better capture the spatial correlation and the channel correlation in the image by introducing the spatial attention mechanism and the channel attention mechanism; for the pedestrian tracking task, the two focusing mechanisms are very useful, and the spatial focusing mechanism is helpful to focus on important areas in the image, such as the positions of pedestrians, so that the accuracy of pedestrian detection is improved, and the key parts of the pedestrians are ensured not to be ignored; the channel attention mechanism allows the model to automatically learn weights in different channels of the feature map to emphasize channels related to pedestrian features, so that the model is facilitated to better distinguish pedestrians from background or other objects, and the distinguishing property of the features is improved.

The DAM module firstly applies a spatial attention mechanism, and the spatial attention can help the model focus on a specific area in the image when processing a visual task, which is beneficial to a pedestrian target detection task; features of specific positions can be selectively enhanced through spatial attention, and calculation cost is reduced; if the channel attention mechanism is applied first, the operation is performed on the whole characteristic diagram, and the calculation amount is increased without considering the information of a specific position.

Changing the original C3 layer to the DAMC3 layer as shown in fig. 4, introducing DAM into the C3 layer internal benefits include:

1. finer feature weighting may enable the attention mechanism of the DAM to be more accurately applied to a particular feature map, and the model may more finely adjust the channel and spatial attention on the feature map generated by the C3 layer to better adapt to the features of that layer, better integrate and utilize the features of the C3 layer, to improve detection and recognition performance.

2. Better model fine tuning: placing the DAM inside a particular layer may make the model easier to fine-tune, as the DAM may more precisely adapt to the features of that layer, which may be helpful for migration learning and fine-tuning for a particular task.

Inputting the obtained continuous frame image data into a network model for feature extraction and target detection, and obtaining the boundary frame information of the target pedestrianAnd output; wherein (x, y) is represented as the center coordinates of the bounding box of the target pedestrian, w is represented as the width, h is represented as the height,>and the corresponding speed information of x, y, w and h in the image coordinate system is represented.

Step three, a lightweight deep Sort tracking model is established, and the original characteristic extraction module of the deep Sort is replaced by a MobileNet V2 characteristic extraction module; under the condition of little precision loss, the parameter quantity of the model is greatly reduced, the model weight is reduced to 2.5M from the original 45M, and the tracking model is conveniently deployed to the edge equipment with limited computational power.

Because the computational effort of the embedded edge monitoring device is limited, the mobilenet v2 is designed to provide good performance in a resource-limited environment, and the main reason that the parameter number is reduced by replacing the mobilenet v2 with the deep original feature extraction module is as follows: mobileNetV2 is itself a lightweight convolutional neural network architecture with a small amount of parameters relative to some large feature extraction networks.

Some key factors for mobilenet v2 to reduce the number of parameters:

1. depth separable convolution: mobilenet v2 uses a depth separable convolution instead of a standard convolution, which is a convolution operation that separates space and channel dimensions, and performs a convolution operation on each channel instead of on the entire signature, which reduces the number of parameters in the model.

2. Lightweight structure: mobilenet v2 is designed with fewer layers and parameters to accommodate computing resource limitations of embedded devices and mobile devices, using a range of lightweight design strategies such as residual connection and batch normalization to reduce the complexity and number of parameters of the network.

3. Inversion residual: the mobilenet v2 also introduces an inverted residual block, further reducing the amount of parameters. These inverted residual blocks contain a lightweight, expanded convolution and a linear convolution to increase the dimension of the feature map.

The principle of the MobileNet V2 feature extraction module is shown in fig. 5, and an Expansion layer (Expansion layer) increases the low-dimensional feature by point-by-point convolution by using a 1x1 convolution kernel; the depth separable convolution (Depthwise Convolution) performs channel convolution through a 3x3 convolution kernel to reduce the number of calculation parameters, and the Projection layer (Projection layer) performs point-by-point convolution through a 1x1 convolution kernel to reduce the dimension of the high-dimensional feature; the depth separable convolution is characterized in that the convolution operation is decomposed into two steps, the space information is processed once, and the channel information is processed once, so that the parameter quantity required by calculation is obviously reduced; this separate operation allows the model to be more lightweight while still maintaining high detection performance.

In the channel-by-channel convolution, the feature map of each channel is calculated through a convolution kernel, as shown in the front section of fig. 6, and the number of channels of the feature map obtained after the process is consistent with the number of channels in input.

Bounding box information for a target pedestrian using a Kalman filter using bounding box information obtained by improving a YOLOv5 network model as inputInitializing, calculating through a Kalman filter to generate prediction frame information, and primarily predicting the position of the target pedestrian in the current frame.

Step four, correlating the pedestrian motion information frame by calculating the mahalanobis distance between the detected pedestrian position and the Kalman filter predicted position;

the mahalanobis distance expression is:

The IOU matching calculation formula is shown as follows;

where a represents a predicted frame and B represents an actual frame.

Tracking states fall into three categories: confirm status, do not confirm status, and delete track.

The tracking state as shown in fig. 7 is as follows: firstly, in an initialization stage, a track T is created according to a target detection result of a first frame, a Kalman filter is used for carrying out position prediction, the track is in an unacknowledged state at the moment, then an IOU (input-output) matching is carried out on a target pedestrian detection frame and a prediction frame of a current frame, a cost matrix is calculated, a Hungary algorithm is applied, and a linear matching result is obtained according to the cost matrix.

According to the matching result, the following cases are processed: 1. if the track T is mismatched, deleting the mismatched track T; 2. initializing a mismatched detection frame D as a new track T; if the Kalman filter prediction frame is successfully matched with the pedestrian detection frame, updating a variable of a track T successfully matched through the Kalman filter; and performing target matching step in a loop iteration mode until the track T of the confirmation state appears or the video frame ends.

In addition, the method also comprises a cascade matching stage, wherein a frame corresponding to the track T in the confirmation state and the track T in the non-confirmation state is predicted by using Kalman filtering, the frame of the track T in the confirmation state is subjected to cascade matching with the detection frame D, appearance characteristics and motion information are utilized in cascade matching, and the appearance characteristics and the motion information of the previous n frames are stored, so that the matching accuracy is improved.

Finally, in the stage of processing the matching result, three cases are processed according to the cascade matching result: if the track T is successfully matched, updating the variable of the track T successfully matched through Kalman filtering; if the mismatch of the target frame D is detected, carrying out IOU matching on the track T in the unacknowledged state and the mismatched track T with the detection frame D which is not successfully matched, and calculating a cost matrix; and then, the Hungary algorithm is applied again to obtain a linear matching result. And performing the step of processing the matching result in a loop iteration mode until the video frame is ended.

And outputting a result of successfully tracking the pedestrian according to the associated and updated tracker states, drawing a boundary box, allocating unique IDs and generating a pedestrian track, wherein the effect is shown in figure 8.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined according to the scope of claims.

Claims

1. The pedestrian tracking method for video monitoring is characterized by comprising the following steps of:

2. The pedestrian tracking method for video surveillance of claim 1, wherein the preprocessing operation includes: the image is adjusted for brightness and contrast, noise reduction and resizing.

3. The pedestrian tracking method for video surveillance of claim 1, wherein the improvement to the YOLOv5 network model is to replace C3 modules of layers 2, 4, 6 and 8 of the YOLOv5 network backup with DAMC3 modules; the DAMC3 is a spatial attention module, a channel attention module, and a convolution module connected in sequence at the output layer of the C3 module.

4. The pedestrian tracking method for video surveillance of claim 1, wherein the MobileNetV2 feature extraction module comprises:

5. The pedestrian tracking method for video surveillance of claim 1, wherein the equation of mahalanobis distance is:

6. The pedestrian tracking method for video surveillance of claim 1, wherein step four further includes:

7. A pedestrian tracking system for video surveillance, comprising: a memory for storing instructions executable by the processor; a processor for executing instructions to implement the pedestrian tracking method for video surveillance of any one of claims 1-6.

8. Computer readable medium storing computer program code, characterized in that the computer program code, when executed by a processor, implements the pedestrian tracking method for video monitoring as claimed in any one of claims 1-6.