CN107330920B

CN107330920B - Monitoring video multi-target tracking method based on deep learning

Info

Publication number: CN107330920B
Application number: CN201710504914.5A
Authority: CN
Inventors: 凌贺飞; 李叶; 李平
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2020-01-03
Anticipated expiration: 2037-06-28
Also published as: CN107330920A

Abstract

The invention discloses a monitoring video multi-target tracking method based on deep learning, which comprises the steps of firstly decoding a video to extract an image sequence, then preprocessing the image and inputting the preprocessed image into a trained Faster R-CNN network model, and extracting target position information and target space characteristics in a corresponding layer of a network by the Faster R-CNN network model; inputting the target position information and the target space characteristics into an LSTM network, and predicting the position of a target at the next moment; the method comprises the steps of obtaining fusion characteristics of a target at the next moment through a fusion method of the spatial characteristics of the target, adding different weights to the position similarity and the spatial characteristic similarity to obtain final similarity, and then judging the corresponding relation between a plurality of targets detected at the current moment and a plurality of targets in a tracking state at the previous moment. The method can reduce the missing rate of multi-target tracking, improve the accuracy of multi-target tracking and solve the problem of target shielding in a short time in the tracking process by tracking the target.

Description

Monitoring video multi-target tracking method based on deep learning

Technical Field

The invention belongs to the technical field of target tracking, and particularly relates to a monitoring video multi-target tracking method based on deep learning.

Background

The construction of safe cities and the popularization of high-definition cameras generate massive monitoring videos, the time and the difficulty are very long and the purpose of distinguishing targets by only depending on manpower from massive video image data is difficult to achieve all the more due to human factors such as visual fatigue. With the development of visual computing, from machine learning to deep learning, a computer can more intelligently understand information in a video, and a video intelligent analysis system is developed. The video intelligent analysis system analyzes the continuous image sequence, extracts the moving target of each frame, determines the target relation in the adjacent frames through the tracking technology, determines the moving direction and speed of one target, extracts other characteristic information such as track and gait, and provides the functions of video retrieval, rapid target positioning, target fragment viewing in the video, target behavior information collection and the like for the user. In the video intelligent analysis system, a key technology target tracking, namely how to determine the relation of targets in adjacent frames and obtain a complete motion sequence of a target in the whole video, becomes an important research direction. The multi-target tracking of the monitored video is the premise basis of a video intelligent analysis system, and has important influence on deeper mining of video information.

Appearance is an important characteristic for describing a target, and target tracking algorithms can be divided into two major categories according to appearance models. The first type is a generative method. The generative method first establishes a spatial description of the target using a well-defined algorithm and then compares a number of candidate regions to find the best matching region. Generative methods emphasize the description of the appearance of the target by ignoring the background, and drift occurs when the target encounters occlusion. The other is a discriminant method, and the foreground target is proposed by using target detection firstly according to the idea of matching by using a detection result, so that the tracking is converted into a classification problem. The discriminant model makes full use of foreground and background information, and can better distinguish the foreground from the background, so that the discriminant model has stronger robustness. However, in the process of online learning and updating by using samples, the performance of the classifier is easily affected by the labeling error of the samples, which causes misclassification.

Most of the existing methods based on deep learning use a convolutional neural network to establish an appearance model to realize target tracking, and the method does not consider that the target tracking is the processing of a series of continuous frame images with close time sequence relation and does not mine more effective time sequence information, with the attention of the information of the images in a spatial domain.

Occlusion is a big problem for target tracking. The occlusion problem can be divided into two cases, one is that the target is occluded by background information, when occlusion begins, the target can also be detected partially, and can not be detected slowly until the target is detected again after reappearance. Another situation is that two or more targets overlap, and when the targets just start to overlap, the targets can be detected as multiple targets, but the spatial features gradually approach, when the targets are completely merged, the multiple targets can be detected as one target, and when the targets are separated, the targets must be tracked without disorder. The common method for solving the problem is to divide the target into a plurality of space areas, each space area is provided with one tracker, and when the target is blocked, the department trackers can continue to track, so that a plurality of trackers run simultaneously, and the tracking speed is too slow.

Disclosure of Invention

Aiming at the defects or the improvement requirements in the prior art, the invention provides a monitoring video multi-target tracking method based on deep learning, which aims to use a recurrent neural network to learn the motion rule of a target, fuse the characteristics of the target, predict the position of the target and calculate the similarity of the target by combining two aspects of time and space to match the target so as to realize the target tracking. The method can reduce the omission factor of multi-target tracking, improve the accuracy of multi-target tracking and solve the problem of target shielding in a short time in the tracking process.

In order to achieve the above object, according to one aspect of the present invention, there is provided a monitoring video multi-target tracking method based on deep learning, the method including:

(1) decoding the monitoring video according to a set interval time to obtain an image;

(2) inputting the images obtained by decoding into a trained fast R-CNN target detection network model to obtain the position information and the spatial characteristics of a plurality of targets;

(3) inputting the position information and the spatial characteristics of the target at a plurality of moments into an LSTM network model for LSTM network model offline training, and predicting the position of the target at the next moment by using the trained LSTM network model;

(4) fusing the spatial features of the target at multiple moments to obtain a fusion feature, calculating the similarity between the fusion feature of the target and the spatial feature extracted by the target at the current moment, and calculating a position rectangle R predicted by the target through an LSTM method_oAnd the position rectangle R of the target detected at the current moment_sAnd judging the affiliation relationship between the newly detected target and the tracked target according to the matching of the combined spatial features and the position information.

Further, the step (2) specifically includes:

(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];

(22) filtering out targets with confidence degrees lower than a confidence degree threshold value, discarding [ class numbers and confidence degrees ] in target information from the remaining targets, and keeping position information [ x, y, width, height ], wherein the confidence degree threshold value range is 0.15-0.25, preferably 0.2;

(23) and in the region-of-interest pooling layer, extracting the spatial features of a plurality of targets according to the mapping relation of the region generated by the RPN algorithm.

Further, the specific method for the online training of the LSTM network model in step (3) is as follows:

(31) decoding the training video at the same time interval to extract an image sequence;

(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;

(33) the LSTM network is set as an SN layer, the spatial characteristics and the position information of the same target in the continuous SN images are taken out and input into the LSTM network every time, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:

wherein L is_detIs the position, L, of the next moment detected by the Faster R-CNN network_predIs the predicted position of the LSTM network output, the position value is normalized to [ 0-1%]Within the interval;

(34) training targets in all test videos and calculating average loss through step (33)_avgThe calculation method is as follows:

wherein N is the number of times of inputting all targets in all videos into the network, loss_iObtaining the loss value after each processing of the network in the step (33), when the average loss is_avgWhen the prediction threshold value is smaller than the prediction threshold value and the loss function is converged, finishing training, wherein the value range of the prediction threshold value is 0.15-0.25, and preferably 0.2; otherwise, selecting another continuous SN images to repeat the step (33).

Further, the step (3) is specifically: and (3) extracting the spatial features and the target positions of the same target in the continuous SN images in the image sequence, inputting the spatial features and the target positions into the trained LSTM network model, and outputting the predicted positions of the target at the next moment after the processing of the LSTM network model.

Further, the step (4) is specifically performed

(41) Fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:

wherein, t_nRepresents t_n-1Last moment of (F)_o,tRepresenting the fused features of the object at time t,represents that the target is at t₁The spatial characteristics of the time of day are,

represents that the target is at t₂Spatial features of the time of day;

(42) fusing features F of the object_o,tAnd the spatial feature F extracted by the target at the current moment_s,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:

(43) position rectangle R predicted by LSTM method_oAnd the position rectangle R of the target detected at the current moment_sBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:

wherein S is_inter＝(min(r₁.r,r₂.r)-max(r₁.l,r₂.l))^*(min(r₁.b,r₂.b)-max(r₁.t,r₂.t))；r₁.t,r₁.b,r₁.l,r₁R each represents a rectangle R_sUpper, lower, left, right boundary value of r₂.t,r₂.b,r₂.l,r₂R each represents a rectangle R_oUpper, lower, left, right boundary values of (S)_interIs the overlapping area of two rectangles, S₁、S₂Each represents R_sAnd R_oThe area of the two rectangles;

(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w₁And position similarity weight w₂Balancing to obtain comprehensive similarity:

diff_s,o＝w₁F+w₂R

(45) comparing one target detected at the current moment with all targets in the tracking state to obtain diff with highest comprehensive similarity_s,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, and if the diff value is greater than a matching threshold value, determining that the matching is successful, wherein the value range of the matching threshold value is 0.6-0.7, and preferably 0.65.

Generally, compared with the prior art, the technical scheme of the invention has the following technical characteristics and beneficial effects:

(1) the method adopts a Faster R-CNN target detection algorithm to detect the position information and the spatial characteristics of a plurality of targets in a frame of image, the extracted spatial characteristics have stronger expression capability on the targets, and higher similarity is realized through the matching of appearance models;

(2) the method inputs target position information and target space characteristics into an LSTM network, learns the motion rule of the target by using an LSTM recurrent neural network, has strong prediction capability on the target position, and has higher similarity when the target position is matched;

(3) the technical scheme of the invention obtains the fusion characteristics of the target at the next moment by fusing the spatial characteristics of the target, obtains the final similarity by adding different weights in two aspects of position similarity and spatial characteristic similarity, and then judges the corresponding relation between a plurality of targets detected at the current moment and a plurality of targets in a tracking state before; the features of the two aspects of the predicted position and the fusion feature are combined for use, so that the accuracy of target tracking can be further improved.

Drawings

FIG. 1 is a schematic flow chart of the main process of the present invention;

FIG. 2 is a schematic view of a process for computing spatiotemporal features of an object in the method of the present invention;

FIG. 3 is a flow chart of the matching of targets in the method of the present invention;

FIG. 4 shows the result of the video Venice-1 tracking according to the present invention;

FIG. 5 shows the tracking result of the method of the present invention on the target-intensive video MOT 16-3.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The process of the method of the invention is shown in figure 1:

(1) video frame-by-frame decoding:

(11) setting a decoding interval frame number on the basis of extracting 4 frames per second, wherein if the video fps is 24, the interval frame number is 6;

(12) decoding the video to obtain an image in real time by using Opencv according to the decoding interval frame number;

(13) performing a pre-processing operation on the image, and scaling the image to 224 x 224 pixel size to adapt to the size of the Faster R-CNN network;

(2) fast R-CNN target detection:

the Faster R-CNN is a convolutional neural network, and the target detection by using the Faster R-CNN firstly needs to train model parameters offline and then process images online to obtain the position and the spatial characteristics of a target;

the process of training the model offline is as follows: marking out the target position and the target classification in the image, so that the network can perform back propagation through the set mark to determine the model parameters; the model training is a supervised training process, and the image sample of the training also uses the image extracted from the video to ensure the similarity with the actual used scene; the model training is a repeated iteration process, and the output error of the final model is within a certain range through feedback adjustment;

the on-line actual treatment process comprises the following steps:

(22) filtering out targets with confidence degrees lower than a confidence degree threshold value of 0.2, discarding the [ classification number and confidence degree ] because of independence from time sequence, and keeping the position information [ x, y, width, height ] relevant to the time sequence;

(23) in the region-of-interest pooling layer, the spatial characteristics of a plurality of targets are extracted according to the mapping relation of the region generated by the RPN algorithm, and different targets have differences in appearance and can be used for distinguishing different targets;

(3) LSTM target prediction:

the LSTM is a cyclic neural network, and the LSTM is used for target prediction, and similarly, an offline training model is needed firstly, and then an online image is processed for prediction;

the process of training the model offline is as follows:

(31) decoding the training video according to the time interval of extracting 4 frames per second in the step (11), and extracting an image sequence;

(33) the LSTM network is set to be 6 layers, the spatial characteristics and the position information of the same target in 6 continuous adjacent images are taken out each time and input into the LSTM network, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:

wherein L is_detIs the position, L, of the next moment detected by the Faster R-CNN network_predIs the predicted location of the LSTM network output; the position value is normalized to 0-1]Within the interval;

(34) through continuous iteration, the position output by the LSTM network is enabled to be closer to the position of the target at the next moment, and the motion rule of the target is learned to predict the track.

The on-line actual treatment process comprises the following steps: taking out the spatial features and the position information of the same target in the continuous 6 image sequences, inputting the spatial features and the position information into a network, and outputting a result which is the predicted position of the target at the next moment after the spatial features and the position information are processed by the network;

(4) target matching:

wherein, t_nRepresents t_n-1Last moment of (F)_o,tRepresenting the fused features of the object at time t,

represents that the target is at t₁The spatial characteristics of the time of day are,

represents that the target is at t₂The temporal spatial feature, fused feature calculation diagram is shown in fig. 2;

rectangle r₁And r₂IOU (IoU)_1,2The calculation method is as follows:

S_inter＝(min(r₁.r,r₂.r)-max(r₁.l,r₂.l))^*(min(r_1.b_,r2.b)-max(r₁.t,r₂.t))

wherein r is₁.t,r₁.b,r₁.l,r₁R each represents a rectangle r₁Upper, lower, left, right boundary value of r₂.t,r₂.b,r₂.l,r₂R each represents a rectangle r₂Upper, lower, left, right boundary values of (S)_interIs the overlapping area of two rectangles, S₁And S₂Respectively representing the areas of the two rectangles;

(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w₁，w₁0.6 position similarity weight w₂，w₂The balance is made to 0.4 to get the best effect, and the final matching strategy is as follows:

diff_s,o＝w₁F+w₂R

(45) comparing the similarity of one target detected at the current moment with all the targets in the tracking state to obtain the target with the highest similarity, namely diff_s,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, if the diff value is greater than the matching threshold value of 0.65, determining that the matching is successful, otherwise, determining that the matching is failed, and determining the matching as a new target. The flow chart of the whole process is shown in FIG. 3;

(46) after the target is matched, the tracking state of the target needs to be updated; the tracking state of the target has four states: an initial TRACKING state OS _ BEGIN, an OS _ TRACKING in the TRACKING process, an unsuccessfully matched OS _ UNMATCH and a target TRACKING END OS _ END; in the state that the targets are not matched, the targets are shielded, the targets are not detected or the targets just leave the picture, and the targets still need to be tracked; the target tracking end indicates that the target may fail to track or leave the video picture; wherein the initial tracking state of the target is set during the matching process;

if the target is processed in the initial tracking state, updating the target to be in the tracking process after the current time is processed, and enabling the target to be matched at the next time;

if the target is in the tracking process state, checking whether the target is matched with the target under tracking at the current moment, if the target is matched successfully, adding the information into the target, updating the fusion characteristic of the target and predicting the position information of the target at the next moment; if the matching fails, setting the target to be in a non-matching state, and taking the predicted position as the position of the target at the current moment;

if the target is in an unmatched state, the target is not successfully matched in the last moment, whether the target is successfully matched at the current moment needs to be checked, and if the target is successfully matched, the target is set to be in a tracking process and the target is indicated to be shielded to be finished or detected again; if the matching fails, the continuous frame number of the target in the unmatched state needs to be checked, if the continuous frame number exceeds a certain frame number, the target leaves the picture or fails to track, and the tracking end state is set; otherwise, keeping the current state, and continuously using the prediction information as a result of the target at the current moment;

and if the target processing tracking is finished, removing the target processing tracking from the tracking queue and not performing target matching.

And (3) experimental test:

the host software environment of the experiment is Ubuntu 16.04LTS 64 bit, OpenCV 3.1.0 and CUDA 8.0, and the CPU in the hardware configuration is Intel Core i5-6500 and the GPU is GeForce GTX 1080. The evaluation method in MOT Challenge is used for selecting the following indexes:

FN: the number of missed tests is the lower the value is, the better the result is;

FP: the number of false alarms, the lower the value, the better the effect;

IDSW: the number of all targets jumping is the better the effect is when the value is lower;

MOTA: the accuracy of multi-target tracking calculated by 3 indexes of the number of missed detections, the number of false alarms and the number of target jumping is the most main comprehensive index in the multi-target tracking judgment standard, the number of tracked targets and the accuracy of target matching are shown, and the higher the value is, the better the effect is;

MOTP: calculating the accuracy of multi-target tracking according to the average frame overlapping rate of all tracked targets, wherein the accuracy of the targets in the position result is shown, and the higher the value is, the better the effect is;

HZ: the average value of the image frame numbers tracked by the system in a period of time is taken as an index for evaluating the execution efficiency and speed of the tracking algorithm by taking seconds as a unit, and the higher the value is, the better the performance is;

experiments compare the effect of the method for extracting the spatial features based on the convolutional neural network in most methods at present with the effect of the method for extracting the spatial features based on the convolutional neural network in the MOT Challenge data set, wherein M1 is a method for matching by using the spatial depth features alone, M2 is a method for matching by using the positions on a time sequence alone, and M3 is a method for combining the two methods. Video Venice-1 and MOT16-3 were tracked using three methods, respectively, with video information as shown in Table 1 and experimental results as shown in Table 2 below.

As can be seen from tables 1 and 2, the video Venice-1 has moderate target density and high feature discrimination, good results can be obtained by using each strategy alone, and the accuracy can be improved by the multi-strategy method. For the video MOT16-3, the target density is high, and the accuracy is greatly improved by using a multi-strategy method.

Table 1 test video information table

Video	Venice-1	Mot16-3
			Resolution ratio	1920*1080	1920*1080
Time duration (frame)	450	1500
			Number of targets	4563	104556
Target density	Medium and high grade	Super high

Table 2 tracking results of different videos under different strategies

From the IDSW score, it can be seen that the value of the video MOT16-3 in method M2 is much larger than that in method M3, which indicates that in the case of high object density, the matching based on the position is prone to error, mainly because the high density results in high position overlapping rate of multiple objects and is not easy to distinguish. From FN, it can be seen that the values at method M2 are less than M1 for both videos, indicating that the prediction capability of LSTM can reduce the false negative rate.

The tracking result of the video Venice-1 is shown in FIG. 4, and the tracking result of the video MOT16-3 in a partial area is shown in FIG. 5, where a gray thin-line rectangular box represents the result of tracking after the target is detected, and a white thick-line rectangular box represents the result of predicting the target that is not detected or is in occlusion. The numbers above the rectangular frame represent the tracked target numbers, so that the matching conditions of the targets are compared, and the numbers at the lower right corner represent the frame numbers of the images in the video. The displayed tracking result shows that the pedestrians in the image are correctly tracked as the same target.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A monitoring video multi-target tracking method based on deep learning is characterized by comprising the following steps:

(4) fusing the spatial features of the target at multiple moments to obtain a fusion feature, calculating the similarity between the fusion feature of the target and the spatial feature extracted by the target at the current moment, and calculating a position rectangle R predicted by the target through an LSTM method_oAnd the position rectangle R of the target detected at the current moment_sThe similarity between the target and the tracked target is judged according to the matching of the combination space characteristics and the position information; the step (4) is specifically as follows:

represents that the target is at t₁The spatial characteristics of the time of day are,represents that the target is at t₂Spatial features of the time of day;

wherein S is_inter＝(min(r₁.r,r₂.r)-max(r₁.l,r₂.l))*(min(r₁.b,r₂.b)-max(r₁.t,r₂.t))；r₁.t,r₁.b,r₁.l,r₁R each represents a rectangle R_sUpper, lower, left, right boundary value of r₂.t,r₂.b,r₂.l,r₂R each represents a rectangle R_oUpper, lower, left, right boundary values of (S)_interIs the overlapping area of two rectangles, S₁、S₂Each represents R_sAnd R_oThe area of the two rectangles;

diff_s,o＝w₁F+w₂R

(45) comparing one target detected at the current moment with all targets in the tracking state to obtain diff with highest comprehensive similarity_s,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, if the diff value is larger than the matching threshold, considering that the matching is successful, otherwise, the matching is failed, and setting the matching as a new target.

2. The monitoring video multi-target tracking method based on deep learning as claimed in claim 1, wherein the step (2) specifically comprises:

(22) filtering out targets with confidence degrees lower than a confidence degree threshold value, discarding [ class numbers and confidence degrees ] in target information from the remaining targets, and reserving position information [ x, y, width, height ];

3. The monitored video multi-target tracking method based on deep learning as claimed in claim 1, wherein the specific method of LSTM network model offline training in the step (3) is as follows:

wherein N is the number of times of inputting all targets in all videos into the network, loss_iObtaining the loss value after each processing of the network in the step (33), when the average loss is_avgWhen the prediction threshold value is smaller than the prediction threshold value and the loss function is converged, finishing training; otherwise, selecting another continuous SN images to repeat the step (33).

4. The monitoring video multi-target tracking method based on deep learning as claimed in claim 1, wherein the step (3) is specifically as follows: and (3) extracting the spatial features and the target positions of the same target in the continuous SN images in the image sequence, inputting the spatial features and the target positions into the trained LSTM network model, and outputting the predicted positions of the target at the next moment after the processing of the LSTM network model.