CN112949615A

CN112949615A - Multi-target tracking system and method based on fusion detection technology

Info

Publication number: CN112949615A
Application number: CN202110519994.8A
Authority: CN
Inventors: 卢朝晖; 齐国栋; 王润发; 于慧敏; 顾建波
Original assignee: Zhejiang Lijia Electronic Technology Co ltd; Zhejiang University ZJU
Current assignee: Zhejiang Lijia Electronic Technology Co ltd; Zhejiang University ZJU
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-06-11
Anticipated expiration: 2041-05-13
Also published as: CN112949615B

Abstract

The invention discloses a multi-target tracking method fusing detection technology. The method comprises the steps of firstly modeling monitoring videos and pictures, stably and accurately outputting the category and position information of a concerned target and tracking. Specifically, firstly, a target detector is used for classifying and positioning targets in a picture, then a tracker based on motion modeling is used for continuously predicting a target track, then a detection area and a motion prediction area are input into an appearance feature acquisition network to obtain respective matching features, and data association between the detection area and the motion prediction area and correction of a final tracking result frame are completed according to the target position and the matching features. The method can accurately and quickly detect the target information, meanwhile, the motion prediction position and matching feature network fully utilizes the target motion and the appearance feature, and the result of multi-target tracking is more accurate through the correction of matching and frames.

Description

Multi-target tracking system and method based on fusion detection technology

Technical Field

The invention belongs to the technical field of intelligent identification, and particularly relates to a multi-target tracking method based on a fusion detection technology.

Background

The multi-target tracking method is widely applied to engineering, and plays a very key role in the work of monitoring road traffic, illegal behavior identification and the like. Given relevant videos, the traditional tracking method needs to manually initialize a tracked target frame, and with the development of deep learning, the tracking technology based on the detection of the neural network is increasing.

The detection method used in the current detection algorithm mostly adopts a one-stage or two-stage target detection algorithm, the one-stage target detection algorithm is that the features are directly mapped to the coordinate information and the category information of the target, the two-stage target detector firstly carries out the coarse positioning of all foreground targets as a regional candidate network and then inputs the regional candidate network into a classifier for the relocation and classification. The one-stage target detection algorithm has the advantages of high detection speed but low accuracy; the two-stage target detection algorithm has the advantages of high accuracy and low detection speed.

The multi-target tracking algorithm locates a plurality of targets in a video sequence based on a detection result, and forms a track through the corresponding relation between frames, so that the multi-target tracking algorithm focuses on difference learning among different target individuals. The common multi-target tracking framework gives association to the detection result of unified identity by measuring the distance between the appearance and the motion information of the detection targets and combining the association of the previous frame and the next frame. The method proves the effectiveness to a certain extent, but the design flow means that the tracking performance depends on the detection result unilaterally, and the visual characteristic with discrimination is obtained, so that a complex mechanism and a huge calculation amount are introduced, the tracking result is limited, and the efficiency is also deficient.

Disclosure of Invention

In order to solve the problems, the invention aims to provide a multi-target tracking method based on a fusion detection technology, and the multi-target tracking method has the characteristics of high specific accuracy and high detection rate.

In order to achieve the purpose, the invention adopts the following technical scheme: a multi-target tracking system based on fusion detection technology, the system comprising: the system comprises a deep convolutional neural network, a target detection network, a Kalman filter model, a matching feature acquisition network and a tracking result correction network; the output end of the deep convolutional neural network is connected with the input end of a target detection network, the output end of the target detection network is respectively connected with the input end of a matched feature acquisition network and the input end of a Kalman filter model, the output end of the Kalman filter model is connected with the input end of the matched feature acquisition network, and the output end of the matched feature acquisition network is connected with the input end of a tracking result correction network.

The invention also provides a tracking method of the multi-target tracking system based on the fusion detection technology, which specifically comprises the following steps:

(1) collecting video frame images;

(2) outputting a first frame image and a second frame image of the video frame image to a deep convolutional neural network to obtain the characteristics of each target in the first frame image and the second frame image;

(3) respectively inputting the characteristics of each target in the first frame image and the second frame image into a target detector network, outputting the confidence score, the category type, the category score and the coordinate information of each target in the first frame image and the second frame image, calculating the product of the confidence score and the category score of each target, and reserving the category type of the target with the product higher than a threshold value and the corresponding coordinate information;

(4) inputting the category type of the reserved first frame image and the corresponding coordinate information into a Kalman filter model for target tracking prediction, and predicting the coordinate information corresponding to the category type in the second frame image;

(5) inputting the second frame image, the coordinate information corresponding to the category type in the predicted second frame image and the reserved coordinate information of the second frame image into a matching feature acquisition network to obtain a predicted shape matching feature and a shape matching feature;

(6) calculating distance measurement according to the coordinate information of the second frame image, the coordinate information of the predicted second frame image, the predicted appearance matching characteristic and the appearance matching characteristic, and matching the coordinate information of the second frame image with a target corresponding to the coordinate information of the predicted second frame image by using a Hungarian algorithm;

(7) inputting the predicted shape matching feature and the shape matching feature of the second frame image obtained in the step (5) and the matched coordinate information and predicted coordinate information into a tracking result correction network, and outputting corrected coordinate information of multi-target tracking;

(8) and (5) sequentially repeating the steps (2) to (7) on the subsequent frame image until all the video frame images are tracked.

Further, the matching feature obtaining network is composed of a backbone network and a feature mapping module, and the step (5) is specifically: inputting the second frame image into a backbone network to obtain a complete feature map, cutting the coordinate information corresponding to the category type in the predicted second frame image and the part of the reserved coordinate information of the second frame image at the corresponding position in the feature map, respectively inputting the two parts of features obtained by cutting into a feature mapping module, and outputting a 1 x 128-dimensional predicted shape matching feature and a shape matching feature.

Further, the distance metric

: the calculation process of (2) is as follows:

wherein the content of the first and second substances,

represents the secondCoordinate information P of frame image and coordinate information of predicted second frame image

The ratio of the intersection of (a) to the union thereof,

in order to predict the shape-matching features,

is a shape matching feature.

Further, if the matching fails in the step (6), continuously predicting a certain number of frames of the motion prediction result until the matching is achieved in a certain frame, or continuously predicting the certain number of frames, and then stopping the track operation without prediction, wherein the matching with the detection result cannot be achieved yet; if there is no detection result matching the motion prediction result, re-recognition is performed to determine whether to restart the old target or to newly initialize a new target.

Further, the tracking method further comprises: after the tracking task of the frame image is completed, the shape matching feature F of the tracking corresponding detection result is used_PUpdating the shape matching feature F of each tracking track_T：

；

Wherein the content of the first and second substances,

it is indicated that the learning rate is,

representing the shape-matching features of the trace before processing the current frame,

and showing the track shape matching characteristics after the track corresponding to the target shape information of the current frame is updated.

Compared with the prior art, the invention has the beneficial effects that: the detection method of the multi-target tracking method adopts an advanced target detection network, and provides more accurate initialization and observation values for target tracking. In the aspect of motion prediction, the invention adopts a Kalman filtering-based motion modeling method, only uses motion information of a target to perform iterative update, accurately predicts the position of an object and introduces extremely low calculation cost, and on the basis of fusion of shape matching characteristics and position information based on a detection result and a tracking result, a more accurate tracking result can be obtained through a tracking result correction network.

Drawings

FIG. 1 is a flow chart of a multi-target tracking method based on fusion detection technology.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

The invention provides a multi-target tracking system based on fusion detection technology, which comprises: the system comprises a deep convolutional neural network, a target detection network, a Kalman filter model, a matching feature acquisition network and a tracking result correction network; the output end of the deep convolutional neural network is connected with the input end of a target detection network, the output end of the target detection network is respectively connected with the input end of a matched feature acquisition network and the input end of a Kalman filter model, the output end of the Kalman filter model is connected with the input end of the matched feature acquisition network, and the output end of the matched feature acquisition network is connected with the input end of a tracking result correction network.

Referring to fig. 1, a flowchart of a tracking method of a multi-target tracking system based on a fusion detection technology is shown, which specifically includes the following steps:

(1) obtaining surveillance video and video image frames through a road traffic surveillance camera

，

(2) Image of video frame

The first frame image and the second frame image are output to a deep convolutional neural network

Obtaining the characteristics of each target in the first frame image and the second frame image

(ii) a ResNet-50 is adopted as the deep convolution neural network in the invention

。

(3) The characteristics of each object in the first frame image and the second frame image are compared

Separately input into a network of target detectors

In the first frame image and the second frame image, the confidence score S, the category type C and the category score of each object in the first frame image and the second frame image are output

And coordinate information P

In which is definedThe category is

= { pedestrian, mei qu takeaway electric vehicle, hungry take-out electric vehicle, non-takeaway electric vehicle },

is the number of categories. And calculating a confidence score S and a category score for each object

Will be higher than a threshold value

Class of object of (2)

And corresponding coordinate information P are reserved;

(4) setting an observation vector of a target by motion modeling position prediction based on a custom uniform velocity linear Kalman filter model

Defining a modeled target state vector for the center coordinates, width and height, respectively, of the target bounding box

The last four items are respectively the moving speed, the width and the high change rate of the central point of the target enclosing frame in the x and y directions, the category type C of the reserved first frame image and the corresponding coordinate information P are input into a Kalman filter model for target tracking prediction, and the coordinate information corresponding to the category type in the second frame image is predicted

(ii) a Specifically, the method comprises the following steps:

(4.1) category of first frame image to be retained

And corresponding coordinate information P, initializing an observation vector of the target

Setting the motion speed and the width-height change rate to be 0;

(4.2) passing the motion state of the first frame image

Predicting the second frame image state with the state transition matrix H

：

(ii) a Error matrix through first frame image

State transition matrix H and noise covariance matrix Q, error matrix for predicting second frame image

；

(5) Predicting the coordinate information corresponding to the category in the second frame image

And inputting the coordinate information P of the reserved second frame image into a matching feature acquisition network to obtain a predicted appearance matching feature

And profile matching features

(ii) a The matching feature acquisition network consists of a backbone network F' and a feature mapping module, and the step (5) specifically comprises the following steps: inputting the second frame image into the backbone network FGet the complete characteristic diagram

Then, the coordinate information corresponding to the category in the second frame image is predicted

And the coordinate information P of the second frame image is retained in the feature map

Cutting the part corresponding to the position, inputting the two parts of features obtained by cutting into a feature mapping module respectively, and outputting 1 x 128-dimensional predicted shape matching features

And profile matching features

. Compared with the original output characteristic diagram, the characteristic diagram after mapping is integrated and compressed with effective characteristics, and is more efficient in measuring the similarity of the shape information of different positions.

The matching feature acquisition network realizes the functions of matching and re-identification, adopts triple loss during training, takes the same target image in a video sequence to form a positive sample pair, different target images to form a negative sample pair, and sets an anchor sample to obtain the shape matching feature

The positive sample has the shape matching characteristic of

The negative sample has the shape matching characteristic of

Then define

Comprises the following steps:

wherein i represents the ith training sample, and + represents that the original value is taken when the value is greater than 0 and 0 is taken when the value is less than 0, and alpha is the minimum interval between different samples.

(6) Predicting the coordinate information of the second frame image according to the coordinate information P of the second frame image

Predicted shape matching features

And profile matching features

Calculating a distance metric

The distance measure

: the calculation process of (2) is as follows:

wherein the content of the first and second substances,

coordinate information P representing the second frame image and coordinate information of the predicted second frame image

The ratio of the intersection of (a) to the union thereof.

Using Hungarian algorithm, coordinate information P of the second frame image is compared with coordinate information of the predicted second frame image

Matching corresponding targets; if the matching fails, continuously predicting a certain number of frames for the motion prediction result corresponding to the non-detection result until the matching is achieved in a certain frame, or continuously predicting the certain number of frames and then failing to achieve the matching with the detection result, suspending the track operation and not predicting any more; re-identifying the detection result without the matching of the motion prediction result, and measuring the Euclidean distance between the shape matching feature of the pause track T and the shape matching feature of the current detection result P

Setting a threshold value delta, restarting the track motion prediction if the distance is smaller than the threshold value, simultaneously associating the current matching detection results, and establishing a motion prediction model for a new target if all tracks are not smaller than the threshold value.

Coordinate information P for the second frame image and coordinate information of the predicted second frame image

In the matched result, recording the detection result in the matched result as an observed value D, and calculating Kalman gain by combining with a measurement noise covariance matrix R

And updating the motion state of the second frame image

Error matrix

(7) Inputting the characteristics of the second frame image I and the matched coordinate information P and predicted coordinate information P' into the tracking result correction network, and outputting correctionCoordinate information of multi-target tracking; outputting the amount of deviation of the tracking result correction frame with respect to the predicted position of the motion

And respectively representing the offset of the correction result relative to the xy coordinate of the center store of the motion prediction frame, the offset of the width and the offset of the height, thereby obtaining a more accurate tracking result.

The tracking correction network consists of 4 layers of 3 × 3 convolutional layers and fully-connected layers, the output is the offset prediction of the regression matching pair quantity × 4, and Smooth-L1Loss is adopted during training. Setting the true value of the tracking result as

The result of the motion prediction is

The network output offset is

Then, then

Is defined as:

smooth L1Loss is such that the difference between the true and predicted values is not too large when the difference is large, but small enough when the distance is small, so that the network learns a more stable offset regression capability. The final tracking result is

。

The tracking method further comprises the following steps: after the tracking task of the frame image is completed, the shape matching feature F of the tracking corresponding detection result is used_PUpdate each stripShape matching feature F of tracking trajectory_T：

；

Wherein the content of the first and second substances,

it is indicated that the learning rate is,

For the evaluation of the tracking result, it is calculated by the following formula:

wherein

Respectively the number of missed detection, false detection and false matching in t frames.

The multi-target tracking system based on the fusion detection technology is used for multi-target tracking of pedestrians, American group take-out electric vehicles, hungry take-out electric vehicles and non-take-out electric vehicles, and the number of the pedestrians, the American group take-out electric vehicles, the hungry take-out electric vehicles and the non-take-out electric vehicles in the image at the 1 st second is respectively 2, 1, 2 and 6; the number of missed detections is 0, 0, 0, 0; the false detection numbers are respectively 0, 0, 0 and 1; the number of error matches is 0, 0, 0, 1; the tracking result score was calculated to be 81.82%. In the image of 1.5 seconds, the numbers of pedestrians, beauty-group takeaway electric vehicles, hungry take-away electric vehicles and non-takeaway electric vehicles are respectively 3, 3, 2 and 6; the number of missed detections is 0, 0, 0 and 1 respectively; the false detection numbers are respectively 0, 0, 0 and 1; the number of error matches is 0, 0, 0, 1; the tracking result score was calculated to be 78.50%. The tracking scores of the images in other times are not described in detail. From the tracking result fraction, the deep convolutional neural network and the target detection network used in the invention provide the reliability of the result, and the Kalman filter model, the matching feature acquisition network and the tracking result correction network used in the invention can ensure that the matching and tracking results are more accurate and have excellent performance.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multi-target tracking system based on fusion detection technology, the system comprising: the system comprises a deep convolutional neural network, a target detection network, a Kalman filter model, a matching feature acquisition network and a tracking result correction network; the output end of the deep convolutional neural network is connected with the input end of a target detection network, the output end of the target detection network is respectively connected with the input end of a matched feature acquisition network and the input end of a Kalman filter model, the output end of the Kalman filter model is connected with the input end of the matched feature acquisition network, and the output end of the matched feature acquisition network is connected with the input end of a tracking result correction network.

2. The multi-target tracking system tracking method based on the fusion detection technology as claimed in claim 1, which is characterized by comprising the following steps:

(1) collecting video frame images;

3. The tracking method of the multi-target tracking system based on the fusion detection technology as claimed in claim 2, wherein the matching feature acquisition network is composed of a backbone network and a feature mapping module, and the step (5) is specifically as follows: inputting the second frame image into a backbone network to obtain a complete feature map, cutting the position part of the coordinate information corresponding to the category type in the predicted second frame image and the reserved coordinate information of the second frame image in the feature map, respectively inputting the two cut parts of features into a feature mapping module, and outputting a 1 x 128-dimensional predicted shape matching feature and a shape matching feature.

4. The method as claimed in claim 2, wherein the distance metric is measured by a distance measuring device

: the calculation process of (2) is as follows:

wherein the content of the first and second substances,

The ratio of the intersection of (a) to the union thereof,

in order to predict the shape-matching features,

is a shape matching feature.

5. The tracking method of the multi-target tracking system based on the fusion detection technology as claimed in claim 2, wherein in the step (6), if the matching fails, the motion prediction result is continuously predicted for a certain number of frames until the matching is achieved for a certain frame, or the motion prediction result is not matched with the detection result after the certain number of frames are continuously predicted, the track operation is suspended, and the prediction is not performed; if there is no detection result matching the motion prediction result, re-recognition is performed to determine whether to restart the old target or to newly initialize a new target.

6. The method of claim 2The tracking method of the multi-target tracking system based on the fusion detection technology is characterized by further comprising the following steps: after the tracking task of the frame image is completed, the shape matching feature F of the tracking corresponding detection result is used_PUpdating the shape matching feature F of each tracking track_T：

；

Wherein the content of the first and second substances,

it is indicated that the learning rate is,