CN114596340A

CN114596340A - Multi-target tracking method and system for monitoring video

Info

Publication number: CN114596340A
Application number: CN202210220010.0A
Authority: CN
Inventors: 丁萌; 周嘉麒; 曹云峰; 魏丽
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-07

Abstract

The invention provides a multi-target tracking method and a multi-target tracking system for a surveillance video, which relate to the technical field of computer vision technology and civil aviation traffic engineering, and are used for acquiring a real-time surveillance video to be tracked; performing framing processing on a real-time monitoring video to be tracked to obtain a video frame sequence to be tracked; inputting a video frame sequence to be tracked into a multi-target recognition model to obtain a target information group sequence; determining the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked by utilizing a Kalman filtering algorithm and a multi-target tracking model according to the target information group sequence; the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked can be automatically determined by the multi-target recognition model and the multi-target tracking model obtained by improving and training the YOLOv4 neural network and the DeepsORT neural network, so that the intelligent level of video monitoring is improved.

Description

Multi-target tracking method and system for monitoring video

Technical Field

The invention relates to the technical field of computer vision technology and civil aviation traffic engineering, in particular to a multi-target tracking method and system for a surveillance video.

Background

In recent years, with the rapid development of civil aviation industry, airport areas are increasingly enlarged, the traffic conditions of airport surfaces such as runways, taxiways and airports are increasingly complex, and the probability of collision of airplanes on the surfaces is increased. A plurality of runways exist in large airports such as Beijing, Shanghai, Xian and the like, and the airport scene is often congested. In addition, the sight of the terminal building is blocked, so that monitoring blind areas exist on airport parking ramps and partial taxiways, and potential safety hazards are buried for traffic control of airport surfaces. Therefore, it is very necessary for the auxiliary control personnel to effectively master the traffic situation of the airport to monitor the airport scene.

Because the current automatic monitoring system has simple functions, the current automatic monitoring system still is a monitoring means mainly adopted by domestic large airports depending on manual visual observation. At present, most of large-flow international airports (Hangzhou Xiaoshan international airport, Chongqing Jiangbei international airport, Shenzhen Baoan international airport and the like) adopt a semi-manual semi-automatic monitoring system, and along with the continuous increase of airport passenger flow, especially during the rush hour of flight, the supervision difficulty of safety monitoring work of an air park is increased, which puts higher requirements on the working capacity of air park monitoring personnel. Because the scene vehicle and personnel are more, and the operation time limit is high and the environment is relatively abominable, the general staff of security protection control on air park is not enough, and traditional artifical visual control management means has the safety bottleneck, supervises the emergence that the unartificial airport incident that leads to easily. Therefore, it is difficult for the conventional manual visual observation to meet the monitoring requirement of the airport scene, and there is a need to improve the intelligent level of the scene video monitoring system.

Disclosure of Invention

The invention aims to provide a multi-target tracking method and a multi-target tracking system for a surveillance video, which can track a plurality of targets in the surveillance video and improve the intelligent level of video surveillance.

In order to achieve the purpose, the invention provides the following scheme:

a multi-target tracking method for surveillance videos comprises the following steps:

acquiring a real-time monitoring video to be tracked;

performing framing processing on a real-time monitoring video to be tracked to obtain a video frame sequence to be tracked;

inputting the video frame sequence to be tracked into a multi-target recognition model to obtain a target information group sequence; any target information group in the target information group sequence comprises the coordinates and the types of all targets in the same video frame to be tracked; the multi-target recognition model is obtained by training a YOLOv4-TS neural network by using a historical monitoring video; the YOLOv4-TS neural network is obtained by adding a spatial pyramid pooling module in a YOLOv4 neural network;

determining the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked by utilizing a Kalman filtering algorithm and a multi-target tracking model according to the target information group sequence; the multi-target tracking model is obtained by training a DeepSORT neural network by using historical monitoring videos.

Optionally, before the obtaining the real-time monitoring video to be tracked, the method further includes:

acquiring a historical monitoring video;

extracting a plurality of historical monitoring video frames from the historical monitoring video according to preset time length to serve as a first historical monitoring video frame sequence;

labeling target information in each historical monitoring video frame in a first historical monitoring video frame sequence to obtain a labeled historical monitoring video frame sequence;

determining a historical target information group sequence of an annotated historical monitoring video frame sequence;

and training the YOLOv4-TS neural network by taking the marked historical monitoring video frame sequence as input and the historical target information group sequence as expected output to obtain the multi-target recognition model.

amplifying the sizes of the convolution layer, the residual module and the network output size of the DeepSORT neural network to obtain an amplified DeepSORT neural network;

carrying out dimensionality reduction on the network structure of the amplified DeepSORT neural network to obtain an improved DeepSORT neural network;

performing frame division processing on the historical monitoring video to obtain a plurality of historical monitoring video frames serving as a second historical monitoring video frame sequence;

carrying out related labeling on the same target information in the second historical monitoring video frame sequence by using a Darklabel tool to obtain historical tracking tracks of a plurality of targets;

and taking a plurality of historical target information group sequences as input, taking historical tracking tracks of a plurality of targets as expected output, and training the improved DeepsORT neural network to obtain the multi-target tracking model.

Optionally, before obtaining the real-time monitoring video to be tracked, the method further includes:

acquiring a historical monitoring video which is the same as the monitoring scene of the real-time monitoring video to be tracked as a pre-training video;

and training the multi-target tracking model by using the pre-training video to obtain a plurality of initial tracking tracks in a monitoring scene.

Optionally, the determining, according to the target information group sequence, the tracking trajectories of the multiple targets in the real-time monitoring video to be tracked by using a Kalman filtering algorithm and a multi-target tracking model specifically includes:

making the iteration number m equal to 1;

determining the initial tracking track as the tracking track of the 0 th iteration;

matching a plurality of targets in the mth target information group in the target information group sequence with a plurality of tracking tracks in the m-1 iteration by using a Hungarian algorithm and a cascade algorithm to obtain a first target-tracking track matching group and a first unmatched target group;

matching a plurality of targets in the first unmatched target group with a plurality of tracking tracks in the (m-1) th iteration by using an IoU matching algorithm to obtain a second target-tracking track matching group;

combining the first target-tracking track matching group and the second target-tracking track matching group into a total matching group;

updating the corresponding tracking tracks according to the coordinates of the targets in the total matching group to obtain a plurality of tracking tracks in the mth iteration;

performing real-time simulation display on a plurality of tracking tracks in the mth iteration by using a Kalman filtering algorithm;

and increasing the value of m by 1 and returning to the step of matching a plurality of targets in the mth target information group in the target information group sequence with a plurality of tracking tracks in the m-1 iteration by using a Hungarian algorithm and a cascade algorithm to obtain a first target-tracking track matching group and a first unmatched target group until the target information group sequence is traversed to obtain the tracking tracks of the plurality of targets in the time period of the real-time monitoring video to be tracked.

Optionally, the matching of multiple targets in the mth target information group in the target information group sequence and multiple tracking tracks in the m-1 st iteration by using the hungarian algorithm and the cascade algorithm to obtain a first target-tracking track matching group and a first unmatched target group specifically includes:

determining any target in the mth target information group as a current target;

determining any one of the plurality of tracking tracks in the m-1 iteration as a current tracking track;

using a formula based on the coordinates of the current target

Determining motion information values of a current target and a current tracking track; wherein d is⁽¹⁾(i, j) is the motion information value of the ith tracking track and the jth target, D_jDenotes the detection result of the jth target, P_iRepresents the ith tracking track, S_iRepresenting a covariance matrix between the average track position and the detection position obtained by Kalman filtering prediction;

determining a first matching metric value of the current target and the current tracking track according to the motion information value;

according to the appearance characteristics of the current target, using a formula

Is determined whenAppearance information values of the previous target and the current tracking track; wherein, d⁽²⁾(i, j) is the appearance information value of the ith tracking track and the jth target, r_j ^TIs the appearance characteristic matrix r of the jth target_jThe transposed matrix of (2);

representing the ith feature on the ith tracking track k; r_iRepresenting a feature set on the ith tracking track k;

determining a second matching metric value of the current target and the current tracking track according to the appearance information value;

utilizing a formula based on the first matching metric value and the second matching metric value

Determining a total matching metric value; wherein, b_ijThe total matching metric value of the ith tracking track and the jth target is obtained;

representing an mth matching metric value; m is 1 or 2;

judging whether the total matching metric value is 1 or not to obtain a first judgment result;

if the first judgment result is yes, adding the current target and the current tracking track into a first target-tracking track matching group;

if the first judgment result is negative, adding the current target into a first unmatched target group;

updating the current tracking track and returning to the step of utilizing a formula according to the coordinates of the current target

Determining the motion information values of the current target and the current tracking track till the plurality of tracking tracks in the (m-1) th iteration are traversed, updating the current target, and returning to the step of determining any tracking track in the plurality of tracking tracks in the (m-1) th iteration to be the current tracking track till the mth iteration is traversedAnd obtaining a first target-tracking track matching group and a first unmatched target group by the target information group.

Optionally, the determining a first matching metric value of the current target and the current tracking track according to the motion information value specifically includes:

judging whether the motion information value is larger than a motion information threshold value or not to obtain a second judgment result;

if the second judgment result is negative, the first matching metric value is made to be 1;

if the second determination result is yes, the first matching metric value is set to 0.

Optionally, the determining a second matching metric value of the current target and the current tracking track according to the appearance information value specifically includes:

judging whether the appearance information value is larger than an appearance information threshold value or not to obtain a third judgment result;

if the third judgment result is negative, the first matching metric value is made to be 1;

if the third determination result is yes, the first matching metric value is set to 0.

A multi-target tracking system for surveillance videos, comprising:

the to-be-tracked real-time monitoring video acquisition module is used for acquiring a to-be-tracked real-time monitoring video;

the framing module is used for framing the real-time monitoring video to be tracked to obtain a video frame sequence to be tracked;

the target information group sequence determining module is used for inputting the video frame sequence to be tracked into the multi-target recognition model to obtain a target information group sequence; any target information group in the target information group sequence comprises the coordinates and the types of all targets in the same video frame to be tracked; the multi-target recognition model is obtained by training a YOLOv4-TS neural network by using a historical monitoring video; the YOLOv4-TS neural network is obtained by adding a spatial pyramid pooling module in a YOLOv4 neural network;

the tracking track determining module is used for determining the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked by utilizing a Kalman filtering algorithm and a multi-target tracking model according to the target information group sequence; the multi-target tracking model is obtained by training a DeepSORT neural network by using historical monitoring videos.

Optionally, the system further includes:

the historical monitoring video acquisition module is used for acquiring historical monitoring videos;

the first historical monitoring video frame sequence extraction module is used for extracting a plurality of historical monitoring video frames from the historical monitoring video according to preset time length to serve as a first historical monitoring video frame sequence;

the first historical monitoring video frame sequence determining module is used for marking target information in each historical monitoring video frame in the first historical monitoring video frame sequence to obtain a first historical monitoring video frame sequence;

the historical target information group sequence determining module is used for determining a historical target information group sequence for marking a historical monitoring video frame sequence;

and the multi-target identification model determining module is used for training the YOLOv4-TS neural network by taking the marked historical monitoring video frame sequence as input and the historical target information group sequence as expected output to obtain the multi-target identification model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the method is obtained by improving and training the YOLOv4 neural network and the DeepsORT neural network, and the multi-target recognition model and the multi-target tracking model can automatically determine the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked, so that the intelligent level of video monitoring is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart of a multi-target tracking method for surveillance videos according to an embodiment of the present invention;

FIG. 2 is a flow chart of the construction of an ASMD dataset according to an embodiment of the present invention;

FIG. 3 is a flowchart of YOLOv4-TS target detection in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a YOLOv4 neural network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a YOLOv4-TS neural network structure according to an embodiment of the present invention;

FIG. 6 is a diagram of a structure of a PANET network according to an embodiment of the present invention;

FIG. 7 is a flow chart of the operation of the multi-target tracking model in an embodiment of the invention;

FIG. 8 is a schematic structural diagram of a multi-target tracking system for surveillance videos according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a data acquisition module according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of an airport surface target detection module based on YOLOv4-TS according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an improved DeepSORT-based multi-target tracking module in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides a multi-target tracking method of a surveillance video, which comprises the following steps:

acquiring a real-time monitoring video to be tracked;

inputting a video frame sequence to be tracked into a multi-target recognition model to obtain a target information group sequence; any target information group in the target information group sequence comprises the coordinates and the types of all targets in the same video frame to be tracked; the multi-target recognition model is obtained by training a YOLOv4-TS neural network by using a historical monitoring video; the YOLOv4-TS neural network is obtained by adding a spatial pyramid pooling module in the YOLOv4 neural network;

In addition, the multi-target tracking method for the surveillance videos, provided by the invention, further comprises the following steps before the real-time surveillance video to be tracked is obtained:

acquiring a historical monitoring video;

extracting a plurality of historical monitoring video frames from a historical monitoring video according to a preset time length to serve as a first historical monitoring video frame sequence;

labeling target information in each historical monitoring video frame in the first historical monitoring video frame sequence to obtain a labeled historical monitoring video frame sequence;

amplifying the size of the convolution layer, the residual module and the network output size of the DeepSORT neural network to obtain an amplified DeepSORT neural network;

and taking the historical target information group sequences as input, taking the historical tracking tracks of the targets as expected output, and training the improved DeepsORT neural network to obtain the multi-target tracking model.

acquiring a historical monitoring video which is the same as a monitoring scene of a real-time monitoring video to be tracked as a pre-training video;

and training the multi-target tracking model by utilizing the pre-training video to obtain a plurality of initial tracking tracks in the monitoring scene.

Specifically, determining the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked by using a Kalman filtering algorithm and a multi-target tracking model according to a target information group sequence, specifically comprising:

making the iteration number m equal to 1;

The method comprises the following steps of matching a plurality of targets in an mth target information group in a target information group sequence with a plurality of tracking tracks in an m-1 iteration by using a Hungarian algorithm and a cascade algorithm to obtain a first target-tracking track matching group and a first unmatched target group, and specifically comprises the following steps:

determining any target in the mth target information group as a current target;

using a formula based on the coordinates of the current target

Determining motion information values of a current target and a current tracking track; wherein d is⁽¹⁾(i, j) is the motion information value of the ith tracking track and the jth target, D_jDenotes the detection result of the jth target, P_iRepresents the ith tracking track, S_iRepresenting the covariance between the mean trajectory position predicted by Kalman filtering and the detected positionA matrix;

determining a first matching metric value of a current target and a current tracking track according to the motion information value;

Determining appearance information values of a current target and a current tracking track; wherein d is⁽²⁾(i, j) is the appearance information value of the ith tracking track and the jth target, r_j ^TIs the appearance characteristic matrix r of the jth target_jThe transposed matrix of (2);

using a formula based on the first matching metric value and the second matching metric value

representing an mth matching metric value; m is 1 or 2;

if the first judgment result is negative, adding the current target into the first unmatched target group;

And determining the motion information values of the current target and the current tracking track until the multiple tracking tracks are traversed during the (m-1) th iteration, updating the current target, and returning to the step of determining any tracking track of the multiple tracking tracks during the (m-1) th iteration to be the current tracking track until the mth target information group is traversed to obtain a first target-tracking track matching group and a first unmatched target group.

Specifically, determining a first matching metric value of the current target and the current tracking track according to the motion information value specifically includes:

Further, the method is characterized in that a second matching metric value of the current target and the current tracking track is determined according to the appearance information value, and specifically includes:

Referring to fig. 1, a multi-target tracking method for surveillance videos provided by the present invention is specifically described by taking an airport scene surveillance video as an example, and the present invention includes the following steps:

s100, constructing an airport scene multifunctional data set:

constructing the ASMD data set including a training set and a test set for training and evaluating the YOLOv4-TS target detection algorithm, a test set for evaluating the improved deepsrt multi-target tracking algorithm, and a training set for training the improved Re-ID network, as shown in fig. 2, in step S100, may include:

s101, data acquisition:

specifically, a high-definition camera is used for shooting a monitoring video of the airport scene, and a camera interface is remotely accessed to a part of airport scene cameras to obtain original image information under the condition of no secret involvement.

S102, constructing a target detection data set:

specifically, a picture is intercepted every 500ms, each screenshot is labeled by using a LabelImg labeling tool, and a training set and a test set for training and evaluating a YOLOv4-TS target detection algorithm are constructed.

S103, constructing a multi-target tracking data set:

specifically, each frame of the video is labeled by using a Darklabel tool, and a test set for evaluating the improved DeepSORT multi-target tracking algorithm is constructed.

S104, establishing a Re-ID data set:

specifically, each label picture is segmented from an original image according to the positions of different types of target detection frames by using the constructed data set, and a training set for training an improved Re-ID network is constructed.

S110, constructing an airport scene target detection model based on YOLOv 4-TS:

specifically, by training an airport surface target detection model of YOLOv4-TS on an ASMD data set, inputting an airport surface monitoring image, and obtaining a detection frame and coordinate information of a target, as shown in fig. 3, step S110 includes:

s111, feature extraction:

specifically, a CSPDarknet network structure is adopted as a feature extraction network to extract features of an input scene monitoring image. The convolution layers of the MS-COCO data set have 53 layers, namely CSPDarknet53 is used as a feature extraction network, and the feature extraction network is initialized in the pre-training weight of the MS-COCO data set to perform transfer learning.

S112, multi-scale feature information acquisition:

specifically, multi-scale feature information is obtained using an SPP network. The SPP (spatial pyramid pooling) module enables the ability to obtain multi-scale feature information, centered on three different sized kernels in the maximal pooling layer. Wherein each maximum pooling stepSetting the distance size as s and the kernel size as d, outputting the characteristic diagram size Y_sizeCan be expressed by equation (1).

Wherein,

for rounding down functions, X_sizeFor inputting the feature map size, the step s of the SPP module in the YOLOv4 network is 1, p (pad) is Padding for the image, and both pad and Padding indicate that some pixels are padded in the periphery of the image.

The Padding size depends on the kernel size d, and the calculation method is shown in formula (2).

In fig. 4, CBL is a volume block, which is composed of three network layers, i.e., Conv (volume layer), Batch Normalization (Batch Normalization layer), and leakyrleu (activation function), and CBLn is represented by n CBL modules connected together; SPP represents a spatial pyramid pooling layer; up sample and Down sample represent upsampling and downsampling, respectively. As shown in fig. 4-5, input X is obtained by calculation_sizeAnd output Y_sizeAnd equality, namely after the SPP module processing, the size of the output feature map is consistent with that of the original feature map. The lower level feature maps of CNN have higher resolution and therefore contain more detailed information, while the higher level feature maps have larger receptive fields and richer expression information. The SPP module integrates more scale features, absorbs the advantages of low-level and high-level feature maps and obtains more abstract information, thereby enlarging the range of obtaining feature information and improving the prediction performance of the model. The invention adds the SPP modules to the SPP modules originally positioned behind the 19X 19 characteristic diagram of the neck network at the same positions of 38X 38 and 76X 76 respectively, strengthens the detection capability of medium-scale and small-scale targets, and better adapts to the existence of multi-scale targets such as camera scenesThe complex environment of the target. Since the number of SPP modules in the original YOLOv4 network is increased from one to three, the improved network is named as "YOLOv 4-Triple SPP" and is called "YOLOv 4-TS" for short.

S113, multi-scale feature fusion:

in fig. 6, Class indicates Class, box indicates detection box, and mask indicates image mask, which is generally represented by a two-dimensional matrix array, as shown in fig. 6. Specifically, a PANet network is used to fuse different scale features. The method comprises the following steps: first, in the first part, with reference to the FPN structure, feature maps with the same spatial size are generated at the same network stage, using { P }₂,P₃,P₄,P₅Denotes generating feature levels for optimizing propagation paths. Second, the second part uses { N } for heuristic of ResNet architecture₂,N₃,N₄,N₅Denotes { P }₂,P₃,P₄,P₅Corresponding to the generated feature mapping, expanding the path from the N of the lowest level₂At first, gradually increase to N₅. Finally, each module uses the unprocessed profile P_iAnd high resolution feature map N_iAnd generating a new characteristic diagram in a transverse connection mode. The self-adaptive feature pool in the third part of the PANet plays the role of fusing single-layer features into multi-layer features, and the feature fusion is utilized to enable the network to have self-adaptive capacity, thereby providing powerful support for the architecture of bottom-up path expansion. The (fourth) part of the PANet is to classify and regress the feature layers fused from the (third) part. The fusion of the full connection layer is positioned in the fifth part of the PANet, the full connection layer mainly bears the work of semantic segmentation, plays the role of predicting and generating Mask, and the two branches fuse and generate Mask to finally obtain a prediction result.

S114, YOLOv4-TS network training:

specifically, training a network by using a data set of a YOLOv4-TS target detection algorithm, adjusting network hyper-parameters, and selecting an optimal network model;

s115, YOLOv4-TS target detection:

specifically, the airport scene monitoring image is sent to a YOLOv4-TS network model for multi-class target detection;

s120, constructing an improved DeepSORT-based multi-target tracking model;

specifically, the improved depth appearance model is trained on an ASMD data set, so that the re-recognition effect of an original network is improved; and then, a DeepsORT multi-target tracking algorithm is utilized, the detection values obtained in the step S110 are used as input, and the tracking tracks of a plurality of targets in the video are obtained. As shown in fig. 7, step S120 includes:

s121, Kalman filtering state prediction:

specifically, a Kalman filtering algorithm is used for carrying out state prediction on the scene multi-target. Wherein, the state prediction equation and covariance matrix equation involved in the state prediction process are shown in equations (3) and (4), and the gain equation, the updated state optimal equation and the optimal estimated covariance matrix equation involved in the state update are shown in equations (5), (6) and (7). Wherein, X_k,kAnd X_k-1,k-1Respectively representing the state vectors corresponding to the instants k and k-1, X_k,k-1Representing the state vector from the time k-1 to the time k; p_k,kAnd P_k-1,k-1Respectively representing the covariance matrix, P, corresponding to the k and k-1 instants_k,k-1Representing a covariance matrix from time k-1 to time k; z_kRepresenting observation vectors, A representing the state transition matrix at time k-1 to k, B and U_kRespectively representing the input gain matrix and the input gain vector, H represents an observation matrix, and covariance matrixes of system noise and observation noise are set to be Q and R, wherein Q and R are not influenced by the system state.

X_k,k-1＝AX_k-1,k-1+BU_k) (3)

P_k,k-1＝AP_k-1,k-1A^T+Q (4)

K_k＝P_k,k-1H^T(HP_k,k-1H^T+R)^-1 (5)

X_k,k＝X_k,k-1+K_k(Z_k-HX_k,k-1) (6)

P_k,k＝P_k,k-1-K_kHP_k,k-1 (7)

S122, Hungarian matching:

specifically, the detection values and the predicted trajectories are further matched using the Hungarian algorithm. The method comprises the steps of matching the similarity of motion information based on Mahalanobis distance and the similarity of a depth appearance model based on improvement, and performing data association fusion on the similarity of motion information and the similarity of the depth appearance model, and comprises the following specific steps:

when there is an uncertainty factor in the motion state of the target, mahalanobis distance can be used to express the motion information between the prediction frame and the detection frame, and the calculation expression is shown in formula (8). Wherein d is⁽¹⁾(i, j) represents the motion matching information of the ith prediction frame and the jth detection frame, P_iIndicates the prediction result of the i-th prediction box, D_jDenotes the detection result of the jth detection frame, S_iAnd representing a covariance matrix between the average track position and the detection position obtained by Kalman filtering prediction. Furthermore, to ensure that invalid associations are filtered out, mahalanobis distance is thresholded by a 95% confidence interval calculated by the chi-squared distribution, as shown by function (9). Wherein a threshold t is given for a four-dimensional prediction space (x, y, w, h)⁽¹⁾9.4877, when the prediction box is successfully associated with the detection box, the result is 1. Wherein (x, y) represents the coordinates of the center point of the target, w represents the aspect ratio of the frame of the target, and h represents the height of the frame of the target.

Since the original feature network is trained on the pedestrian data set, the feature extraction object is mainly a pedestrian, the input size is only 128 × 64, and the original feature network is shown in table 1. Aiming at the ASMD data set constructed by the invention, large-size targets such as vehicles and airplanes exist, and the input size of the original algorithm has certain limitation on the large-size targets. Therefore, the invention is improved on the basis of the original network model, the size and the residual error of the input convolution layer are amplified, the adjusted network input size is expanded to 128 multiplied by 128, meanwhile, the residual error network is deepened, the dimension of the amplified network is reduced, the convergence speed of the training model is ensured, and the improved network structure is shown in table 2.

Table 1 feature extraction network architecture before improvement

Table 2 improved feature extraction network architecture

The added depth appearance information is realized by the following steps of firstly solving each detection frame D_jCorresponding appearance characteristic r_jGiving a constraint | | | r_j1, |; secondly, establishing a feature set R on the tracking track k_iAnd recording one hundred appearance characteristics into a characteristic set R every time one appearance characteristic is successfully associated_iPerforming the following steps; finally, a minimum cosine distance expression prediction frame P is introduced_iAnd a detection frame D_jDepth appearance information of. And (3) calculating appearance information of the prediction frame and the detection frame, namely a cosine distance expression (10) through the improved network model training. In addition, d is defined by analogy with equation (8)⁽¹⁾(i, j) the thresholding is also performed, and the threshold t is defined as shown in function (11) and obtained by the improved network model training⁽²⁾。

S123, cascade matching:

fused data association model c_i,jThe expression is shown in equation (12). Wherein, λ represents the regulation and control parameter of the model, and the value interval is [0,1 ]]For example, under the condition that the target is blocked for a long time or the motion amplitude of the camera is large, the Kalman filtering prediction effect is extremely poor and has no reference value, and at the moment, the fusion model gives up the predicted motion information, namely, the lambda is made to be 0. In addition, d with spatial position matching is added⁽³⁾(i, j) which functions as the matching metric b_ijIf the trace does not match the predetermined value, IoU is used to indicate that the trace is not matched_σMeasures to make up for this loss, matching measures b_ijThe expression is shown in formula (13).

Specifically, a cascade matching algorithm is used to match the detected values with the predicted trajectory. The cascade matching algorithm takes the predicted track and the detection box as input, outputs a successfully matched set and an unsuccessfully matched set by calculating the matching measurement, and performs iterative operation by predicting the track, so that a target with frequent occurrence times obtains a priority matching right, and the problem of mismatching caused by probability dispersion is effectively solved. In addition, the prediction frames meeting the requirements are obtained through screening and are matched, so that the number of times of identity jump caused by long-time shielding of the target can be reduced, and the robustness of the algorithm is effectively improved.

S124, Hungarian-cascade joint matching algorithm:

the method comprises the following steps of inputting a predicted track set P ═ 1.. multidot.n, a detection frame set D ═ 1.. multidot.m, and a maximum continuous pause number A blocked_max。

And outputting the successfully matched set M and the unsuccessfully matched set U.

Step 1, a data association model obtained by fusing motion information and appearance information, namely, a value obtained by calculation by using a formula (12) is given to a set C_m＝{c_i，j}；

And 2, fusing the matching measurement. That is, the value calculated by the formula (13) is given to the set B_m＝{b_ij}；

Step 3, initializing an algorithm, and making the set M be an empty set;

step 4, initializing an unmatched detection frame set, and giving a detection frame set D to a set U;

step 5, starting from the predicted track matched to the target, traversing to A_maxThe predicted values of the unmatched targets and the set { x ] of successfully matched targets are obtained_i,j}；

Step 6, updating the set, namely the { x to be obtained_i,jThe value of the sum is added to the set M, and the set U is removed from the successfully matched set x_i,jAnd assigning the obtained difference set to the set U again, and completing a round of algorithm;

and 7, returning the M, U two sets as initial values to the step 5, and substituting the initial values into the next iteration.

S125, IoU matching:

specifically, IoU matching is performed on the trace of the unconfirmed state, the unmatched trace and the unmatched detection box, as shown in equation (12), followed by the next iteration using the hungarian algorithm.

S126, updating Kalman filtering parameters:

specifically, the parameters are updated and subsequently processed using Kalman filtering.

Aiming at the problem that the existing airport scene monitoring equipment does not completely meet the requirement of accurate perception of scene targets, the multi-target tracking method and the multi-target tracking device based on the airport scene monitoring video solve the problems of high working strength, low efficiency and the like of the existing visual monitoring method used by the airport scene, utilize a deep learning network to quickly perform multi-target tracking on the airport scene, improve the intelligent level of a scene video monitoring system to a certain extent, and reduce the dependence of scene monitoring on manual interpretation.

In addition, the invention also provides a multi-target tracking system of the monitoring video, which comprises the following steps:

the target information group sequence determining module is used for inputting the video frame sequence to be tracked into the multi-target recognition model to obtain a target information group sequence; any target information group in the target information group sequence comprises the coordinates and the types of all targets in the same video frame to be tracked; the multi-target recognition model is obtained by training a YOLOv4-TS neural network by using a historical monitoring video; the YOLOv4-TS neural network is obtained by adding a spatial pyramid pooling module in the YOLOv4 neural network;

As shown in fig. 8, taking an airport scene surveillance video as an example, the multi-target tracking system for surveillance video provided by the invention includes the following modules: the data acquisition module is used for reading monitoring videos or pictures in the ASMD data set of the local disk; an airport scene target detection module based on the YOLOv4-TS trains an airport scene target detection model of the YOLOv4-TS on an ASMD data set, and inputs an airport scene monitoring image to obtain a detection frame and coordinate information of a target; based on an improved DeepSORT multi-target tracking module, an improved depth appearance model is trained on an ASMD data set, and the re-identification effect of an original network is improved; and then, a DeepsORT multi-target tracking algorithm is utilized, and the detection values obtained by the target detection module are used as input, so that the tracking tracks of a plurality of targets in the video are obtained.

Further, as shown in fig. 9, the data acquisition module includes: the data set unit of the target detection algorithm is used for training and evaluating the YOLOv4-TS target detection algorithm; the multi-target tracking data set unit is used for evaluating an improved DeepSORT multi-target tracking algorithm; and a Re-ID data set unit trains the improved Re-ID network.

Further, as shown in fig. 10, the YOLOv4-TS based airport surface target detection module includes: the CSPDarknet53 feature extraction network unit is used for extracting features of an input scene monitoring image; the spatial pyramid pooling unit is used for acquiring multi-scale feature information; and the path aggregation network unit is used for fusing different scale characteristics.

As shown in fig. 11, the improved DeepSORT-based multi-target tracking module: the Kalman filtering model building unit is used for predicting a motion track on an airport scene; the Hungarian matching unit is used for matching the detection value with the predicted track from the motion information and the appearance information; the cascade matching unit is used for further matching the detection value with the predicted track; moU matching unit, for performing supplementary matching to unmatched value; a Kalman filtering parameter updating unit; for updating the system state.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In summary, this summary should not be construed to limit the present invention.

Claims

1. A multi-target tracking method for surveillance videos is characterized by comprising the following steps:

acquiring a real-time monitoring video to be tracked;

determining the tracking tracks of a plurality of targets in the real-time monitoring video to be tracked by using a Kalman filtering algorithm and a multi-target tracking model according to the target information group sequence; the multi-target tracking model is obtained by training a DeepSORT neural network by using historical monitoring videos.

2. The multi-target tracking method for the surveillance videos as claimed in claim 1, further comprising, before the obtaining of the real-time surveillance video to be tracked:

acquiring a historical monitoring video;

marking target information in each historical monitoring video frame in a first historical monitoring video frame sequence to obtain a marked historical monitoring video frame sequence;

3. The multi-target tracking method for the surveillance videos as claimed in claim 2, further comprising, before the obtaining of the real-time surveillance video to be tracked:

performing framing processing on the historical monitoring video to obtain a plurality of historical monitoring video frames serving as a second historical monitoring video frame sequence;

4. The multi-target tracking method for the surveillance videos as claimed in claim 1, further comprising, before obtaining the real-time surveillance video to be tracked:

5. The multi-target tracking method for the surveillance video according to claim 4, wherein the determining, according to the target information group sequence, the tracking trajectories of the multiple targets in the real-time surveillance video to be tracked by using a Kalman filtering algorithm and a multi-target tracking model specifically comprises:

making the iteration number m equal to 1;

updating the corresponding tracking tracks according to the coordinates of the targets in the total matching group to obtain a plurality of tracking tracks during the mth iteration;

6. The multi-target tracking method for the surveillance video according to claim 5, wherein the matching of the multiple targets in the mth target information group in the target information group sequence with the multiple tracking tracks in the (m-1) th iteration by using the Hungarian algorithm and the cascade algorithm to obtain a first target-tracking track matching group and a first unmatched target group specifically comprises:

determining any target in the mth target information group as a current target;

using a formula based on the coordinates of the current target

Determining motion information values of a current target and a current tracking track; wherein d is⁽¹⁾(i, j) is the motion information value of the ith tracking track and the jth target, D_jDenotes the detection result of the jth target, P_iRepresents the ith tracking track, S_iRepresenting a covariance matrix between the average track position predicted by Kalman filtering and the detection position;

Determining appearance information values of a current target and a current tracking track; wherein d is⁽²⁾(i, j) is the appearance information value of the ith tracking track and the jth target, r_j ^TAppearance feature matrix r for jth object_jThe transposed matrix of (2);

represents the ith traceThe ith feature on track k; r_iRepresenting a feature set on the ith tracking track k;

according to the first matching metric value and the second matching metric value, a formula is utilized

representing an mth matching metric value; m is 1 or 2;

7. The multi-target tracking method for the surveillance video according to claim 6, wherein the determining a first matching metric value of the current target and the current tracking trajectory according to the motion information value specifically comprises:

8. The multi-target tracking method for the surveillance video according to claim 6, wherein the determining a second matching metric value of the current target and the current tracking trajectory according to the appearance information value specifically includes:

9. A multi-target tracking system for surveillance video, the system comprising:

10. The multi-target tracking system for surveillance videos of claim 9, further comprising: