CN113744316A

CN113744316A - Multi-target tracking method based on deep neural network

Info

Publication number: CN113744316A
Application number: CN202111048838.4A
Authority: CN
Inventors: 邢建川; 蒋芷昕; 孔渝峰; 张栋; 卢胜; 陈洋; 周春文; 杨明兴
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-12-03

Abstract

The invention discloses a multi-target tracking method based on a deep neural network, which comprises the following steps: collecting a video to be tested, preprocessing the video to be tested, and extracting an original image frame of the video to be tested; carrying out target detection on each original image frame, identifying a target to be tracked, and acquiring a target detection frame of each original image frame; matching target detection frames in two continuous frames of images on a time axis, calculating the similarity of the target to be tracked of the target detection frames, comparing the similarity of the target to be tracked in the two continuous frames of images on the time axis, judging whether the target is the same target to be tracked or not, if so, distributing an ID (identity) and outputting a tracking result; if not, matching and judging again; and realizing continuous tracking of multiple targets of the video based on the ID and the tracking result. The invention integrates the motion characteristic and the appearance characteristic into the loss matrix calculation process, improves the accuracy of target prediction of the next frame, and reduces the ID Switch index, thereby truly realizing the continuous tracking of the target.

Description

Multi-target tracking method based on deep neural network

Technical Field

The invention relates to the technical field of machine vision, in particular to a multi-target tracking method based on a deep neural network.

Background

Early recognition and detection of images mainly rely on extraction of manually designed visual feature descriptors (such as color, shape and edge), however, the traditional manual design method is based on prior knowledge of the existing data set, has large limitation, has small coverage and containment degree of real-world objects, is not enough to find significant objects of complex scenes or accurately draw object boundaries, and is difficult to achieve satisfactory performance and effect.

The multi-target tracking problem appears in radar signal detection at the earliest time, and with the deep research in the field of computer vision and the continuous improvement of the precision of a target detection algorithm, the multi-target tracking algorithm based on detection is also developed greatly. In addition, with the intensive research on deep learning and the vigorous development thereof, a deep neural network trained using a large amount of data is applied to the recognition and detection of objects. Compared with the traditional method, the method of replacing manual design by the deep neural network obviously improves the robustness to various interference factors in the image processing process, and enables the object identification and detection tasks to be rapidly developed. Although the application of the deep neural network realizes the performance improvement of the target detection and identification tasks, the deep neural network has the problems of large parameter scale, complex calculation and the like, and has high requirements on calculation resources, storage resources and the like, so that the deep neural network is difficult to be effectively and widely used in mobile equipment such as smart phones and vehicle-mounted equipment with limited and large limitations on resources. In addition, when multi-target identification tracking is performed, appearance information of a target object to be tracked cannot be integrated into a correlation and matching link, false detection and frequent ID (identity) jumping are easily caused when the target object to be tracked is shielded, and continuous tracking of the target object to be tracked cannot be really realized.

In the future, a multi-target tracking algorithm based on detection still takes accuracy and high efficiency as research key points, and the operation precision is obviously improved while the operation speed is ensured.

Disclosure of Invention

The invention aims to provide a multi-target tracking method based on a deep neural network, which solves the problems in the prior art, increases the consideration of target apparent information while using target motion state information, and fuses motion characteristics and appearance characteristics into the calculation process of a loss matrix, thereby improving the accuracy of target prediction of the next frame, enabling a program to better cope with the interference caused by the problem of target shielding, reducing ID Switch indexes, and really realizing the continuous tracking of targets.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a multi-target tracking method based on a deep neural network, which comprises the following steps:

collecting a video to be tested, preprocessing the video to be tested, and extracting an original image frame of the video to be tested;

carrying out target detection on each original image frame, identifying a target to be tracked, and acquiring a target detection frame of each original image frame;

matching the target detection frames in two continuous frames of images on a time axis, calculating the similarity of the target to be tracked in the target detection frames, comparing the similarity of the target to be tracked in the two continuous frames of images on the time axis, judging whether the two continuous frames of images are the same target to be tracked, if so, allocating an ID (identity) and outputting a tracking result; if not, matching and judging again;

and realizing continuous tracking of multiple targets in the video based on the ID and the tracking result.

Preferably, the preprocessing the video to be tested, and the extracting the original image frame of the video to be tested includes:

reading the video to be tested by adopting an OpenCV library; acquiring the frame rate and the total frame number of the video to be tested by a get method; and extracting the original image frame of the video to be tested frame by frame or frame skipping based on the frame rate and the total frame number combined with image frame acquisition requirements.

Preferably, the target detection frame is obtained by adopting a target detector, wherein the target detector is built by adopting a YOLO network.

Preferably, before matching the target detection frame in two consecutive images on the time axis, kalman filtering is performed on the target detection frame.

Preferably, the kalman filtering process includes:

in the moving process of the target to be tracked, calculating an initial predicted value of the target detection frame in the current original image frame based on the target detection frame in the previous original image frame, wherein the initial predicted value is a vector; and acquiring a true value of the target detection frame in the current original image frame, calculating a linear weighted value of the initial predicted value and the true value, and acquiring a position predicted value of the target detection frame in the current original image frame.

Preferably, the target detection frame in two consecutive images on the matching time axis comprises:

calculating a geometric distance d between the predicted value and the real value based on the predicted value and the real value⁽¹⁾(i，j)：

In the formula, y_iTo predict value, d_iIs the true value;

adopting a CNN network to extract the apparent information of the target detection frame, storing the apparent information as an apparent information matrix, calculating the minimum cosine distance of the apparent information matrix, and obtaining an apparent distance d⁽²⁾(i，j)：

In the formula, r_jRepresenting the extraction of the j-th appearance vector via the CNN network,

representing a kth apparent vector in an ith target tracker;

calculating linear weighted values of the geometric distance and the apparent distance based on the geometric distance and the apparent distance of two continuous frames of images to obtain a loss matrix c_i，j，

c_i，j＝λd⁽¹⁾(i，j)+(1-λ)d⁽²⁾(i，j)；

In the formula, i is the ith tracking result, j is the jth detection result, and lambda is a weight value set manually;

when c is going to_i，jAnd simultaneously, the target detection frames in two continuous frames of images on the time axis are associated when the target detection frames fall within the threshold set by the two constraints of the geometric distance and the apparent distance.

Preferably, a convolutional neural network is adopted to perform multi-layer processing on the target detection frame of the original image frame.

Preferably, the tracking result includes the item type of the target to be tracked and the time period from appearance to disappearance.

The invention discloses the following technical effects:

the invention provides a multi-target tracking method based on a deep neural network, which increases the consideration of apparent information of a target while using the motion state information of the target, integrates motion characteristics and appearance characteristics into the calculation process of a loss matrix, improves the accuracy of target prediction of a next frame, enables a program to better cope with the interference caused by the problem of target shielding, reduces ID Switch indexes, thereby really realizing continuous tracking of the target, supports a multi-target tracking software program of mixed type tracking, can detect and track different types of moving objects in the same video without mutual interference, and simultaneously outputs the respective motion state information of the moving objects, thereby providing early-stage data support and basis for a machine vision task of the next step.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a multi-target tracking method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of capturing each frame of image of an input video according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the detection of an object on all frame images according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a target detection frame marked on a frame image according to an embodiment of the present invention;

fig. 5 is a schematic image diagram of a video outputting a detection result according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides a multi-target tracking method based on a deep neural network, which comprises the following steps of:

s101, collecting a video to be tested, preprocessing the video to be tested, and extracting an original image frame of the video to be tested.

In the embodiment, the crossroads with large traffic flow are selected as the test video scene, the video time of the video to be tested is 11.205 seconds, the video size is 2.91MB, the total frame number is 336 frames, and the target to be tracked is set to be car and truck, namely, different types of vehicles running in the road are tracked. Preprocessing a video to be tested, calling an existing OpenCV (open source video library) in python to read complete video data, and acquiring a frame rate and a total frame number of an original video by a get method, wherein a large amount of time is needed to acquire images frame by frame when the frame number of the video is large, and at the moment, whether frame skipping is needed to acquire a video image can be determined according to self requirements of different tasks, so that an original image frame for extracting the video to be tested is acquired, as shown in FIG. 2.

In the step, original picture frames are disassembled from the video, so that the subsequent target detection can be realized according to the picture processing method, and finally, each processed frame image is written into the result video. The method converts the multi-target tracking identification of the video into the processing of the picture, and is an important link for decomposing and converting the video problem into the picture problem.

S102, carrying out target detection on each frame of original image, identifying a target to be tracked, and obtaining a target detection frame of each frame of original image.

An object detector based on a YOLOv5m model is constructed, and the object detector is constructed and trained by using an MS COCO (Microsoft Common Objects in Context) data set and a YOLOv5m network to obtain the object detector.

The target detector is used for identifying and detecting targets 'car' and 'truck' in each frame of original image, the identified target object is marked by a rectangular frame, and a target detection frame is obtained and is used for representing position information of the target object in different image frames, which is shown in fig. 3-4. And constructing a target detector through the steps to finish the addition of the target detection frame in the image.

S103, matching target detection frames in two continuous frames of images on a time axis, calculating the similarity of the target to be tracked according to the target detection frames, comparing the similarity of the target to be tracked in the two continuous frames of images on the time axis, judging whether the target is the same target to be tracked, if so, distributing an ID (identity) and outputting a tracking result; and if not, performing matching and judgment again.

Although the target detection frame obtained by the target detector can relatively accurately give the position information of the object, more or less interference terms are inevitably introduced in the extraction process, so that the position coordinates of the target frame are not accurate enough for practical application, and the 'denoising' process is also needed.

In this embodiment, to facilitate storage and use of object information in the later stage, the motion state of the target is defined as a vector including 8 normal distribution parameters:

where (u, v, γ) is the center coordinates and aspect ratio of the frame candidate, h is the frame candidate height, and the remaining four parameters indicate their degree of change, which are all 0 initially. The prediction of the next frame image is the prediction of the four variables (u, v, γ, h).

When designing the filter, two models, namely a Constant Velocity Motion Model (Constant Velocity Motion Model) and a Linear Observation Model (Linear observer Model), are referred to, and the specific implementation of the algorithm comprises two steps: uniform velocity assumptions and linear updates. During the moving process of the target, the Tth information is fully utilized based on the prior known information and the prior experience_n-1Obtaining the Tth parameter value such as the position coordinate and the speed of the rectangular frame in the frame image_nHypothesis results in frame images. And after the assumed value and the real value are obtained, the linear weighted value of the two vectors is taken as the final assumed result of the current target.

In this embodiment, Hungarian algorithm is adopted to treat the Tth_nTarget frame and Tth frame of frame_n-1) And matching the target frames of the frames pairwise.

The matching process to achieve the goal often needs to follow certain rules and custom constraints to ensure that the assumptions made are of sufficient reasonableness and efficiency. Defining a loss matrix to describe the required opening to match two elements in two setsAnd (4) a pin. Defining the loss matrix requires computing the Tth_nTarget frame and Tth frame of frame_n-1Geometric and apparent distances of the frame's target box.

1. Calculating geometric distance

Vector parameter y predicted in Kalman filtering_iAnd directly detecting the resulting vector parameter d_iMahalanobis Distance (Mahalanobis Distance) between the detected position and the average tracking position, i.e., the standard deviation between the detected position and the average tracking position, defines the geometric Distance, as shown in equation (1):

therein, sigma^-1T represents the sign of the transpose of the matrix, which is the covariance matrix of the multidimensional vector.

2. Calculating an apparent distance

When the reliability and predictability of the motion of the object are relatively high, in some image or video scenes with strong interference and more noise (such as distortion of original data, high-speed movement of a camera or severe jitter), the effect of using the mahalanobis distance to realize the association is greatly reduced, and more frequent ID jumps are caused.

In order to solve the above problem, in this embodiment, a CNN network is used to extract an object in each target frame, and each pixel point in the image is represented, that is, a convolutional neural network is used, and multilayer processing such as local convolution calculation and pooling operation is performed on the target frame in each frame by using a convolution kernel, so as to combine low-level features to form a higher-level feature map. In this embodiment, the high-level feature map is stored as a 128-dimensional apparent information matrix, and the matrix is normalized to calculate its minimum Cosine Distance (Cosine Distance), i.e., apparent Distance. The apparent distance is calculated as shown in equation (2):

wherein r is_jRepresentation extraction via CNN networkThe j-th appearance vector is then calculated,

representing the kth apparent vector, R, in the ith target tracker_iRepresenting the appearance of the object in different frames.

3. Constructing a loss matrix

The geometric distance and the apparent distance of the target in the front frame and the back frame can be obtained through the formulas (1) to (2), the linear weighted value of the two metrics is used as the final metric for defining the loss, and the loss matrix construction method is shown as the formula (3):

c_i，j＝λd⁽¹)(i，j)+(1-λ)d⁽²⁾(i，j) (3)

wherein i is the ith tracking result, j is the jth detection result, and λ is a manually set weight value.

From the formula (3), only when c_i,jAnd the data association of two frames of targets before and after one successful time can be considered when the data association falls within the threshold set by the two constraints of the geometric distance and the apparent distance.

It is necessary to mix d⁽¹⁾(i, j) and d⁽²⁾The (i, j) two distance values are normalized, and when the value is much larger than 1, the value is considered as an error and should be discarded.

In addition, in the present embodiment, the similarity can be indirectly obtained by calculating the loss matrix, and when the loss is large, the similarity is small, and when the loss is small, the similarity is large, and the similarity between two frames of images is compared by the similarity.

In this embodiment, before calculating the loss matrix, the detection frames are filtered using the confidence matrix, and the detection frames and features with insufficient reliability are deleted. The confidence matrix is used for expressing the credibility of the observed value and also comprises a geometric part and an apparent part.

1. Geometric confidence

The geometric confidence is as shown in equation (4):

wherein, t⁽¹⁾Is a constant obtained by a set test method in probability statistics. The above formula is represented by⁽¹⁾(i,j)≤t⁽¹⁾Then, then

Otherwise

2. Apparent confidence

The apparent confidence definition is the same as the geometric confidence definition, and the apparent confidence is as shown in equation (5):

3. final confidence

And (3) integrating the geometric confidence coefficient and the apparent confidence coefficient to obtain a confidence matrix, namely the final confidence coefficient of the target, wherein the final confidence coefficient is shown as the formula (6):

in the formula (I), the compound is shown in the specification,

and

the values of (a) are equally and importantly limited. It is clear that a detection result can only be considered reasonable if both the geometric and apparent measures of the detection value are reasonable.

If the detection result is reasonable, indicating that the target objects in the front and rear target frames are the same target to be tracked, distributing ID and outputting the tracking result; and if the detection result is unreasonable, the target objects in the front and rear target frames are not the same target to be tracked, and then matching and judging are carried out again.

In this embodiment, the detection frames are first filtered by using the confidence matrix, and the detection frames and features with low confidence levels are deleted, so that the detection frames with large deviation can be filtered in advance, and the influence on the subsequent similarity calculation result is avoided. And then calculating a loss matrix, and taking the result of the loss matrix as the input of the Hungarian algorithm. And finally, filtering the results of the similarity by using the confidence matrix, and further filtering the results of the similarity. For the results with similar similarity, the results with higher confidence coefficient are reserved, and the accuracy can be further improved.

And S104, outputting and storing the multi-target tracking result, integrating the output object state information, and realizing the visualization of the result.

After the basic task of multi-target tracking of video content is completed, the resulting video with the continuous tracking target frame and the corresponding ID is output, as shown in fig. 5. And simultaneously outputting the text information of the targets, including the target types, IDs, appearing times, target frame position coordinates, and the like, as shown in table 1. And reading the content of the text to obtain the time period from appearance to disappearance of each target, and storing the time period in a clear text format.

TABLE 1

The result of the multi-target tracking method in this embodiment is tested, and the test result is shown in table 2, which indicates that it may be difficult to keep the tracking process for a single target continuous for a long time (the Frag index is high), but this may not cause a serious influence on the overall tracking effect of the same target (the ID-SW is maintained at a low level). In addition, when the image quality is good, the false detection rate of the program is kept at a low level, most target types can be correctly identified and detected (the FN and FP indexes are low), and the MOTA result shows that the program shows good processing performance in the aspects of object identification detection and continuity of tracking tracks. The step realizes the processing of the output information and the visualization of the information, so that the information is displayed more visually, and the method is more favorable for practical application.

TABLE 2

Measure

GT

ID_SW

FN

FP

Frag

MOTA

Result

7

1

0

1

11

71.429％

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A multi-target tracking method based on a deep neural network is characterized in that: the method comprises the following steps:

2. The deep neural network-based multi-target tracking method according to claim 1, wherein: preprocessing the video to be tested, and extracting the original image frame of the video to be tested comprises the following steps:

3. The deep neural network-based multi-target tracking method according to claim 1, wherein: and acquiring a target detection frame by adopting a target detector, wherein the target detector is built by adopting a YOLO network.

4. The deep neural network-based multi-target tracking method according to claim 1, wherein: and performing Kalman filtering on the target detection frame before matching the target detection frame in two continuous frames of images on a time axis.

5. The deep neural network-based multi-target tracking method according to claim 4, wherein: the Kalman filtering process comprises the following steps:

6. The deep neural network-based multi-target tracking method according to claim 5, wherein: the target detection frame in two continuous frames of images on the matching time axis comprises the following steps:

In the formula, y_iTo predict value, d_iIs the true value;

is shown in the ith orderA kth apparent vector in the target tracker;

c_i，j＝λd⁽¹⁾(i，j)+(1-λ)d⁽²⁾(i，j)；

7. The deep neural network-based multi-target tracking method according to claim 1, wherein: and performing multilayer processing on the target detection frame of the original image frame by adopting a convolutional neural network.

8. The deep neural network-based multi-target tracking method according to claim 1, wherein: the tracking result comprises the item type of the target to be tracked and the time period from appearance to disappearance.