CN107330920B - Monitoring video multi-target tracking method based on deep learning - Google Patents

Monitoring video multi-target tracking method based on deep learning Download PDF

Info

Publication number
CN107330920B
CN107330920B CN201710504914.5A CN201710504914A CN107330920B CN 107330920 B CN107330920 B CN 107330920B CN 201710504914 A CN201710504914 A CN 201710504914A CN 107330920 B CN107330920 B CN 107330920B
Authority
CN
China
Prior art keywords
target
similarity
targets
tracking
spatial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710504914.5A
Other languages
Chinese (zh)
Other versions
CN107330920A (en
Inventor
凌贺飞
李叶
李平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710504914.5A priority Critical patent/CN107330920B/en
Publication of CN107330920A publication Critical patent/CN107330920A/en
Application granted granted Critical
Publication of CN107330920B publication Critical patent/CN107330920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monitoring video multi-target tracking method based on deep learning, which comprises the steps of firstly decoding a video to extract an image sequence, then preprocessing the image and inputting the preprocessed image into a trained Faster R-CNN network model, and extracting target position information and target space characteristics in a corresponding layer of a network by the Faster R-CNN network model; inputting the target position information and the target space characteristics into an LSTM network, and predicting the position of a target at the next moment; the method comprises the steps of obtaining fusion characteristics of a target at the next moment through a fusion method of the spatial characteristics of the target, adding different weights to the position similarity and the spatial characteristic similarity to obtain final similarity, and then judging the corresponding relation between a plurality of targets detected at the current moment and a plurality of targets in a tracking state at the previous moment. The method can reduce the missing rate of multi-target tracking, improve the accuracy of multi-target tracking and solve the problem of target shielding in a short time in the tracking process by tracking the target.

Description

Monitoring video multi-target tracking method based on deep learning
Technical Field
The invention belongs to the technical field of target tracking, and particularly relates to a monitoring video multi-target tracking method based on deep learning.
Background
The construction of safe cities and the popularization of high-definition cameras generate massive monitoring videos, the time and the difficulty are very long and the purpose of distinguishing targets by only depending on manpower from massive video image data is difficult to achieve all the more due to human factors such as visual fatigue. With the development of visual computing, from machine learning to deep learning, a computer can more intelligently understand information in a video, and a video intelligent analysis system is developed. The video intelligent analysis system analyzes the continuous image sequence, extracts the moving target of each frame, determines the target relation in the adjacent frames through the tracking technology, determines the moving direction and speed of one target, extracts other characteristic information such as track and gait, and provides the functions of video retrieval, rapid target positioning, target fragment viewing in the video, target behavior information collection and the like for the user. In the video intelligent analysis system, a key technology target tracking, namely how to determine the relation of targets in adjacent frames and obtain a complete motion sequence of a target in the whole video, becomes an important research direction. The multi-target tracking of the monitored video is the premise basis of a video intelligent analysis system, and has important influence on deeper mining of video information.
Appearance is an important characteristic for describing a target, and target tracking algorithms can be divided into two major categories according to appearance models. The first type is a generative method. The generative method first establishes a spatial description of the target using a well-defined algorithm and then compares a number of candidate regions to find the best matching region. Generative methods emphasize the description of the appearance of the target by ignoring the background, and drift occurs when the target encounters occlusion. The other is a discriminant method, and the foreground target is proposed by using target detection firstly according to the idea of matching by using a detection result, so that the tracking is converted into a classification problem. The discriminant model makes full use of foreground and background information, and can better distinguish the foreground from the background, so that the discriminant model has stronger robustness. However, in the process of online learning and updating by using samples, the performance of the classifier is easily affected by the labeling error of the samples, which causes misclassification.
Most of the existing methods based on deep learning use a convolutional neural network to establish an appearance model to realize target tracking, and the method does not consider that the target tracking is the processing of a series of continuous frame images with close time sequence relation and does not mine more effective time sequence information, with the attention of the information of the images in a spatial domain.
Occlusion is a big problem for target tracking. The occlusion problem can be divided into two cases, one is that the target is occluded by background information, when occlusion begins, the target can also be detected partially, and can not be detected slowly until the target is detected again after reappearance. Another situation is that two or more targets overlap, and when the targets just start to overlap, the targets can be detected as multiple targets, but the spatial features gradually approach, when the targets are completely merged, the multiple targets can be detected as one target, and when the targets are separated, the targets must be tracked without disorder. The common method for solving the problem is to divide the target into a plurality of space areas, each space area is provided with one tracker, and when the target is blocked, the department trackers can continue to track, so that a plurality of trackers run simultaneously, and the tracking speed is too slow.
Disclosure of Invention
Aiming at the defects or the improvement requirements in the prior art, the invention provides a monitoring video multi-target tracking method based on deep learning, which aims to use a recurrent neural network to learn the motion rule of a target, fuse the characteristics of the target, predict the position of the target and calculate the similarity of the target by combining two aspects of time and space to match the target so as to realize the target tracking. The method can reduce the omission factor of multi-target tracking, improve the accuracy of multi-target tracking and solve the problem of target shielding in a short time in the tracking process.
In order to achieve the above object, according to one aspect of the present invention, there is provided a monitoring video multi-target tracking method based on deep learning, the method including:
(1) decoding the monitoring video according to a set interval time to obtain an image;
(2) inputting the images obtained by decoding into a trained fast R-CNN target detection network model to obtain the position information and the spatial characteristics of a plurality of targets;
(3) inputting the position information and the spatial characteristics of the target at a plurality of moments into an LSTM network model for LSTM network model offline training, and predicting the position of the target at the next moment by using the trained LSTM network model;
(4) fusing the spatial features of the target at multiple moments to obtain a fusion feature, calculating the similarity between the fusion feature of the target and the spatial feature extracted by the target at the current moment, and calculating a position rectangle R predicted by the target through an LSTM methodoAnd the position rectangle R of the target detected at the current momentsAnd judging the affiliation relationship between the newly detected target and the tracked target according to the matching of the combined spatial features and the position information.
Further, the step (2) specifically includes:
(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];
(22) filtering out targets with confidence degrees lower than a confidence degree threshold value, discarding [ class numbers and confidence degrees ] in target information from the remaining targets, and keeping position information [ x, y, width, height ], wherein the confidence degree threshold value range is 0.15-0.25, preferably 0.2;
(23) and in the region-of-interest pooling layer, extracting the spatial features of a plurality of targets according to the mapping relation of the region generated by the RPN algorithm.
Further, the specific method for the online training of the LSTM network model in step (3) is as follows:
(31) decoding the training video at the same time interval to extract an image sequence;
(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;
(33) the LSTM network is set as an SN layer, the spatial characteristics and the position information of the same target in the continuous SN images are taken out and input into the LSTM network every time, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:
Figure BDA0001334433330000041
wherein L isdetIs the position, L, of the next moment detected by the Faster R-CNN networkpredIs the predicted position of the LSTM network output, the position value is normalized to [ 0-1%]Within the interval;
(34) training targets in all test videos and calculating average loss through step (33)avgThe calculation method is as follows:
Figure BDA0001334433330000042
wherein N is the number of times of inputting all targets in all videos into the network, lossiObtaining the loss value after each processing of the network in the step (33), when the average loss isavgWhen the prediction threshold value is smaller than the prediction threshold value and the loss function is converged, finishing training, wherein the value range of the prediction threshold value is 0.15-0.25, and preferably 0.2; otherwise, selecting another continuous SN images to repeat the step (33).
Further, the step (3) is specifically: and (3) extracting the spatial features and the target positions of the same target in the continuous SN images in the image sequence, inputting the spatial features and the target positions into the trained LSTM network model, and outputting the predicted positions of the target at the next moment after the processing of the LSTM network model.
Further, the step (4) is specifically performed
(41) Fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:
Figure BDA0001334433330000043
wherein, tnRepresents tn-1Last moment of (F)o,tRepresenting the fused features of the object at time t,represents that the target is at t1The spatial characteristics of the time of day are,
Figure BDA0001334433330000045
represents that the target is at t2Spatial features of the time of day;
(42) fusing features F of the objecto,tAnd the spatial feature F extracted by the target at the current moments,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:
(43) position rectangle R predicted by LSTM methodoAnd the position rectangle R of the target detected at the current momentsBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:
Figure BDA0001334433330000052
Figure BDA0001334433330000053
wherein S isinter=(min(r1.r,r2.r)-max(r1.l,r2.l))*(min(r1.b,r2.b)-max(r1.t,r2.t));r1.t,r1.b,r1.l,r1R each represents a rectangle RsUpper, lower, left, right boundary value of r2.t,r2.b,r2.l,r2R each represents a rectangle RoUpper, lower, left, right boundary values of (S)interIs the overlapping area of two rectangles, S1、S2Each represents RsAnd RoThe area of the two rectangles;
(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w1And position similarity weight w2Balancing to obtain comprehensive similarity:
diffs,o=w1F+w2R
(45) comparing one target detected at the current moment with all targets in the tracking state to obtain diff with highest comprehensive similaritys,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, and if the diff value is greater than a matching threshold value, determining that the matching is successful, wherein the value range of the matching threshold value is 0.6-0.7, and preferably 0.65.
Generally, compared with the prior art, the technical scheme of the invention has the following technical characteristics and beneficial effects:
(1) the method adopts a Faster R-CNN target detection algorithm to detect the position information and the spatial characteristics of a plurality of targets in a frame of image, the extracted spatial characteristics have stronger expression capability on the targets, and higher similarity is realized through the matching of appearance models;
(2) the method inputs target position information and target space characteristics into an LSTM network, learns the motion rule of the target by using an LSTM recurrent neural network, has strong prediction capability on the target position, and has higher similarity when the target position is matched;
(3) the technical scheme of the invention obtains the fusion characteristics of the target at the next moment by fusing the spatial characteristics of the target, obtains the final similarity by adding different weights in two aspects of position similarity and spatial characteristic similarity, and then judges the corresponding relation between a plurality of targets detected at the current moment and a plurality of targets in a tracking state before; the features of the two aspects of the predicted position and the fusion feature are combined for use, so that the accuracy of target tracking can be further improved.
Drawings
FIG. 1 is a schematic flow chart of the main process of the present invention;
FIG. 2 is a schematic view of a process for computing spatiotemporal features of an object in the method of the present invention;
FIG. 3 is a flow chart of the matching of targets in the method of the present invention;
FIG. 4 shows the result of the video Venice-1 tracking according to the present invention;
FIG. 5 shows the tracking result of the method of the present invention on the target-intensive video MOT 16-3.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The process of the method of the invention is shown in figure 1:
(1) video frame-by-frame decoding:
(11) setting a decoding interval frame number on the basis of extracting 4 frames per second, wherein if the video fps is 24, the interval frame number is 6;
(12) decoding the video to obtain an image in real time by using Opencv according to the decoding interval frame number;
(13) performing a pre-processing operation on the image, and scaling the image to 224 x 224 pixel size to adapt to the size of the Faster R-CNN network;
(2) fast R-CNN target detection:
the Faster R-CNN is a convolutional neural network, and the target detection by using the Faster R-CNN firstly needs to train model parameters offline and then process images online to obtain the position and the spatial characteristics of a target;
the process of training the model offline is as follows: marking out the target position and the target classification in the image, so that the network can perform back propagation through the set mark to determine the model parameters; the model training is a supervised training process, and the image sample of the training also uses the image extracted from the video to ensure the similarity with the actual used scene; the model training is a repeated iteration process, and the output error of the final model is within a certain range through feedback adjustment;
the on-line actual treatment process comprises the following steps:
(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];
(22) filtering out targets with confidence degrees lower than a confidence degree threshold value of 0.2, discarding the [ classification number and confidence degree ] because of independence from time sequence, and keeping the position information [ x, y, width, height ] relevant to the time sequence;
(23) in the region-of-interest pooling layer, the spatial characteristics of a plurality of targets are extracted according to the mapping relation of the region generated by the RPN algorithm, and different targets have differences in appearance and can be used for distinguishing different targets;
(3) LSTM target prediction:
the LSTM is a cyclic neural network, and the LSTM is used for target prediction, and similarly, an offline training model is needed firstly, and then an online image is processed for prediction;
the process of training the model offline is as follows:
(31) decoding the training video according to the time interval of extracting 4 frames per second in the step (11), and extracting an image sequence;
(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;
(33) the LSTM network is set to be 6 layers, the spatial characteristics and the position information of the same target in 6 continuous adjacent images are taken out each time and input into the LSTM network, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:
Figure BDA0001334433330000081
wherein L isdetIs the position, L, of the next moment detected by the Faster R-CNN networkpredIs the predicted location of the LSTM network output; the position value is normalized to 0-1]Within the interval;
(34) through continuous iteration, the position output by the LSTM network is enabled to be closer to the position of the target at the next moment, and the motion rule of the target is learned to predict the track.
The on-line actual treatment process comprises the following steps: taking out the spatial features and the position information of the same target in the continuous 6 image sequences, inputting the spatial features and the position information into a network, and outputting a result which is the predicted position of the target at the next moment after the spatial features and the position information are processed by the network;
(4) target matching:
(41) fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:
wherein, tnRepresents tn-1Last moment of (F)o,tRepresenting the fused features of the object at time t,
Figure BDA0001334433330000083
represents that the target is at t1The spatial characteristics of the time of day are,
Figure BDA0001334433330000084
represents that the target is at t2The temporal spatial feature, fused feature calculation diagram is shown in fig. 2;
(42) fusing features F of the objecto,tAnd the spatial feature F extracted by the target at the current moments,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:
(43) position rectangle R predicted by LSTM methodoAnd the position rectangle R of the target detected at the current momentsBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:
Figure BDA0001334433330000086
rectangle r1And r2IOU (IoU)1,2The calculation method is as follows:
Sinter=(min(r1.r,r2.r)-max(r1.l,r2.l))*(min(r1.b,r2.b)-max(r1.t,r2.t))
Figure BDA0001334433330000091
wherein r is1.t,r1.b,r1.l,r1R each represents a rectangle r1Upper, lower, left, right boundary value of r2.t,r2.b,r2.l,r2R each represents a rectangle r2Upper, lower, left, right boundary values of (S)interIs the overlapping area of two rectangles, S1And S2Respectively representing the areas of the two rectangles;
(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w1,w10.6 position similarity weight w2,w2The balance is made to 0.4 to get the best effect, and the final matching strategy is as follows:
diffs,o=w1F+w2R
(45) comparing the similarity of one target detected at the current moment with all the targets in the tracking state to obtain the target with the highest similarity, namely diffs,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, if the diff value is greater than the matching threshold value of 0.65, determining that the matching is successful, otherwise, determining that the matching is failed, and determining the matching as a new target. The flow chart of the whole process is shown in FIG. 3;
(46) after the target is matched, the tracking state of the target needs to be updated; the tracking state of the target has four states: an initial TRACKING state OS _ BEGIN, an OS _ TRACKING in the TRACKING process, an unsuccessfully matched OS _ UNMATCH and a target TRACKING END OS _ END; in the state that the targets are not matched, the targets are shielded, the targets are not detected or the targets just leave the picture, and the targets still need to be tracked; the target tracking end indicates that the target may fail to track or leave the video picture; wherein the initial tracking state of the target is set during the matching process;
if the target is processed in the initial tracking state, updating the target to be in the tracking process after the current time is processed, and enabling the target to be matched at the next time;
if the target is in the tracking process state, checking whether the target is matched with the target under tracking at the current moment, if the target is matched successfully, adding the information into the target, updating the fusion characteristic of the target and predicting the position information of the target at the next moment; if the matching fails, setting the target to be in a non-matching state, and taking the predicted position as the position of the target at the current moment;
if the target is in an unmatched state, the target is not successfully matched in the last moment, whether the target is successfully matched at the current moment needs to be checked, and if the target is successfully matched, the target is set to be in a tracking process and the target is indicated to be shielded to be finished or detected again; if the matching fails, the continuous frame number of the target in the unmatched state needs to be checked, if the continuous frame number exceeds a certain frame number, the target leaves the picture or fails to track, and the tracking end state is set; otherwise, keeping the current state, and continuously using the prediction information as a result of the target at the current moment;
and if the target processing tracking is finished, removing the target processing tracking from the tracking queue and not performing target matching.
And (3) experimental test:
the host software environment of the experiment is Ubuntu 16.04LTS 64 bit, OpenCV 3.1.0 and CUDA 8.0, and the CPU in the hardware configuration is Intel Core i5-6500 and the GPU is GeForce GTX 1080. The evaluation method in MOT Challenge is used for selecting the following indexes:
FN: the number of missed tests is the lower the value is, the better the result is;
FP: the number of false alarms, the lower the value, the better the effect;
IDSW: the number of all targets jumping is the better the effect is when the value is lower;
MOTA: the accuracy of multi-target tracking calculated by 3 indexes of the number of missed detections, the number of false alarms and the number of target jumping is the most main comprehensive index in the multi-target tracking judgment standard, the number of tracked targets and the accuracy of target matching are shown, and the higher the value is, the better the effect is;
MOTP: calculating the accuracy of multi-target tracking according to the average frame overlapping rate of all tracked targets, wherein the accuracy of the targets in the position result is shown, and the higher the value is, the better the effect is;
HZ: the average value of the image frame numbers tracked by the system in a period of time is taken as an index for evaluating the execution efficiency and speed of the tracking algorithm by taking seconds as a unit, and the higher the value is, the better the performance is;
experiments compare the effect of the method for extracting the spatial features based on the convolutional neural network in most methods at present with the effect of the method for extracting the spatial features based on the convolutional neural network in the MOT Challenge data set, wherein M1 is a method for matching by using the spatial depth features alone, M2 is a method for matching by using the positions on a time sequence alone, and M3 is a method for combining the two methods. Video Venice-1 and MOT16-3 were tracked using three methods, respectively, with video information as shown in Table 1 and experimental results as shown in Table 2 below.
As can be seen from tables 1 and 2, the video Venice-1 has moderate target density and high feature discrimination, good results can be obtained by using each strategy alone, and the accuracy can be improved by the multi-strategy method. For the video MOT16-3, the target density is high, and the accuracy is greatly improved by using a multi-strategy method.
Table 1 test video information table
Video Venice-1 Mot16-3
Resolution ratio 1920*1080 1920*1080
Time duration (frame) 450 1500
Number of targets 4563 104556
Target density Medium and high grade Super high
Table 2 tracking results of different videos under different strategies
From the IDSW score, it can be seen that the value of the video MOT16-3 in method M2 is much larger than that in method M3, which indicates that in the case of high object density, the matching based on the position is prone to error, mainly because the high density results in high position overlapping rate of multiple objects and is not easy to distinguish. From FN, it can be seen that the values at method M2 are less than M1 for both videos, indicating that the prediction capability of LSTM can reduce the false negative rate.
The tracking result of the video Venice-1 is shown in FIG. 4, and the tracking result of the video MOT16-3 in a partial area is shown in FIG. 5, where a gray thin-line rectangular box represents the result of tracking after the target is detected, and a white thick-line rectangular box represents the result of predicting the target that is not detected or is in occlusion. The numbers above the rectangular frame represent the tracked target numbers, so that the matching conditions of the targets are compared, and the numbers at the lower right corner represent the frame numbers of the images in the video. The displayed tracking result shows that the pedestrians in the image are correctly tracked as the same target.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (4)

1. A monitoring video multi-target tracking method based on deep learning is characterized by comprising the following steps:
(1) decoding the monitoring video according to a set interval time to obtain an image;
(2) inputting the images obtained by decoding into a trained fast R-CNN target detection network model to obtain the position information and the spatial characteristics of a plurality of targets;
(3) inputting the position information and the spatial characteristics of the target at a plurality of moments into an LSTM network model for LSTM network model offline training, and predicting the position of the target at the next moment by using the trained LSTM network model;
(4) fusing the spatial features of the target at multiple moments to obtain a fusion feature, calculating the similarity between the fusion feature of the target and the spatial feature extracted by the target at the current moment, and calculating a position rectangle R predicted by the target through an LSTM methodoAnd the position rectangle R of the target detected at the current momentsThe similarity between the target and the tracked target is judged according to the matching of the combination space characteristics and the position information; the step (4) is specifically as follows:
(41) fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:
Figure FDA0002150441990000011
wherein, tnRepresents tn-1Last moment of (F)o,tRepresenting the fused features of the object at time t,
Figure FDA0002150441990000012
represents that the target is at t1The spatial characteristics of the time of day are,represents that the target is at t2Spatial features of the time of day;
(42) fusing features F of the objecto,tAnd the spatial feature F extracted by the target at the current moments,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:
Figure FDA0002150441990000014
(43) position rectangle R predicted by LSTM methodoAnd the position rectangle R of the target detected at the current momentsBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:
Figure FDA0002150441990000021
Figure FDA0002150441990000022
wherein S isinter=(min(r1.r,r2.r)-max(r1.l,r2.l))*(min(r1.b,r2.b)-max(r1.t,r2.t));r1.t,r1.b,r1.l,r1R each represents a rectangle RsUpper, lower, left, right boundary value of r2.t,r2.b,r2.l,r2R each represents a rectangle RoUpper, lower, left, right boundary values of (S)interIs the overlapping area of two rectangles, S1、S2Each represents RsAnd RoThe area of the two rectangles;
(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w1And position similarity weight w2Balancing to obtain comprehensive similarity:
diffs,o=w1F+w2R
(45) comparing one target detected at the current moment with all targets in the tracking state to obtain diff with highest comprehensive similaritys,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, if the diff value is larger than the matching threshold, considering that the matching is successful, otherwise, the matching is failed, and setting the matching as a new target.
2. The monitoring video multi-target tracking method based on deep learning as claimed in claim 1, wherein the step (2) specifically comprises:
(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];
(22) filtering out targets with confidence degrees lower than a confidence degree threshold value, discarding [ class numbers and confidence degrees ] in target information from the remaining targets, and reserving position information [ x, y, width, height ];
(23) and in the region-of-interest pooling layer, extracting the spatial features of a plurality of targets according to the mapping relation of the region generated by the RPN algorithm.
3. The monitored video multi-target tracking method based on deep learning as claimed in claim 1, wherein the specific method of LSTM network model offline training in the step (3) is as follows:
(31) decoding the training video at the same time interval to extract an image sequence;
(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;
(33) the LSTM network is set as an SN layer, the spatial characteristics and the position information of the same target in the continuous SN images are taken out and input into the LSTM network every time, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:
Figure FDA0002150441990000031
wherein L isdetIs the position, L, of the next moment detected by the Faster R-CNN networkpredIs the predicted position of the LSTM network output, the position value is normalized to [ 0-1%]Within the interval;
(34) training targets in all test videos and calculating average loss through step (33)avgThe calculation method is as follows:
Figure FDA0002150441990000032
wherein N is the number of times of inputting all targets in all videos into the network, lossiObtaining the loss value after each processing of the network in the step (33), when the average loss isavgWhen the prediction threshold value is smaller than the prediction threshold value and the loss function is converged, finishing training; otherwise, selecting another continuous SN images to repeat the step (33).
4. The monitoring video multi-target tracking method based on deep learning as claimed in claim 1, wherein the step (3) is specifically as follows: and (3) extracting the spatial features and the target positions of the same target in the continuous SN images in the image sequence, inputting the spatial features and the target positions into the trained LSTM network model, and outputting the predicted positions of the target at the next moment after the processing of the LSTM network model.
CN201710504914.5A 2017-06-28 2017-06-28 Monitoring video multi-target tracking method based on deep learning Active CN107330920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710504914.5A CN107330920B (en) 2017-06-28 2017-06-28 Monitoring video multi-target tracking method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710504914.5A CN107330920B (en) 2017-06-28 2017-06-28 Monitoring video multi-target tracking method based on deep learning

Publications (2)

Publication Number Publication Date
CN107330920A CN107330920A (en) 2017-11-07
CN107330920B true CN107330920B (en) 2020-01-03

Family

ID=60198399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710504914.5A Active CN107330920B (en) 2017-06-28 2017-06-28 Monitoring video multi-target tracking method based on deep learning

Country Status (1)

Country Link
CN (1) CN107330920B (en)

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108055501A (en) * 2017-11-22 2018-05-18 天津市亚安科技有限公司 A kind of target detection and the video monitoring system and method for tracking
CN108229407A (en) * 2018-01-11 2018-06-29 武汉米人科技有限公司 A kind of behavioral value method and system in video analysis
CN108280843A (en) * 2018-01-24 2018-07-13 新华智云科技有限公司 A kind of video object detecting and tracking method and apparatus
CN108537818B (en) * 2018-03-07 2020-08-14 上海交通大学 Crowd trajectory prediction method based on cluster pressure LSTM
CN108320297B (en) * 2018-03-09 2020-06-19 湖北工业大学 Video target real-time tracking method and system
CN108509876B (en) * 2018-03-16 2020-11-27 深圳市商汤科技有限公司 Object detection method, device, apparatus, storage medium, and program for video
CN108491816A (en) * 2018-03-30 2018-09-04 百度在线网络技术(北京)有限公司 The method and apparatus for carrying out target following in video
CN110349182A (en) * 2018-04-07 2019-10-18 苏州竺星信息科技有限公司 A kind of personage's method for tracing based on video and positioning device
CN108520530B (en) * 2018-04-12 2020-01-14 厦门大学 Target tracking method based on long-time and short-time memory network
CN108764032B (en) * 2018-04-18 2019-12-24 北京百度网讯科技有限公司 Intelligent monitoring method and device for coal mine water exploration and drainage, computer equipment and storage medium
DE102018206208A1 (en) * 2018-04-23 2019-10-24 Robert Bosch Gmbh Method, device, product and computer program for operating a technical system
CN108664930A (en) * 2018-05-11 2018-10-16 西安天和防务技术股份有限公司 A kind of intelligent multi-target detection tracking
CN108664935A (en) * 2018-05-14 2018-10-16 中山大学新华学院 The method for tracking target and system of depth Spatial-temporal Information Fusion based on CUDA
CN108805907B (en) * 2018-06-05 2022-03-29 中南大学 Pedestrian posture multi-feature intelligent identification method
CN108875819B (en) * 2018-06-08 2020-10-27 浙江大学 Object and component joint detection method based on long-term and short-term memory network
CN110688873A (en) * 2018-07-04 2020-01-14 上海智臻智能网络科技股份有限公司 Multi-target tracking method and face recognition method
CN109063574B (en) * 2018-07-05 2021-04-23 顺丰科技有限公司 Method, system and equipment for predicting envelope frame based on deep neural network detection
CN109344725B (en) * 2018-09-04 2020-09-04 上海交通大学 Multi-pedestrian online tracking method based on space-time attention mechanism
CN109308469B (en) * 2018-09-21 2019-12-10 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN109241952B (en) * 2018-10-26 2021-09-07 北京陌上花科技有限公司 Figure counting method and device in crowded scene
CN111104831B (en) * 2018-10-29 2023-09-29 香港城市大学深圳研究院 Visual tracking method, device, computer equipment and medium
CN109409307B (en) * 2018-11-02 2022-04-01 深圳龙岗智能视听研究院 Online video behavior detection method based on space-time context analysis
CN109584213B (en) * 2018-11-07 2023-05-30 复旦大学 Multi-target number selection tracking method
CN109740416B (en) * 2018-11-19 2021-02-12 深圳市华尊科技股份有限公司 Target tracking method and related product
CN109800689B (en) * 2019-01-04 2022-03-29 西南交通大学 Target tracking method based on space-time feature fusion learning
CN109934096B (en) * 2019-01-22 2020-12-11 浙江零跑科技有限公司 Automatic driving visual perception optimization method based on characteristic time sequence correlation
CN109934115B (en) * 2019-02-18 2021-11-02 苏州市科远软件技术开发有限公司 Face recognition model construction method, face recognition method and electronic equipment
CN110033469B (en) * 2019-04-01 2021-08-27 北京科技大学 Sub-pixel edge detection method and system
CN110276783B (en) * 2019-04-23 2021-01-08 上海高重信息科技有限公司 Multi-target tracking method and device and computer system
CN110175538A (en) * 2019-05-10 2019-08-27 国网福建省电力有限公司龙岩供电公司 A kind of substation's Bird's Nest recognition methods and system based on machine learning
CN110111358B (en) * 2019-05-14 2022-05-24 西南交通大学 Target tracking method based on multilayer time sequence filtering
CN110333517B (en) * 2019-07-11 2022-11-25 腾讯科技(深圳)有限公司 Obstacle sensing method, obstacle sensing device and storage medium
CN110598540B (en) * 2019-08-05 2021-12-03 华中科技大学 Method and system for extracting gait contour map in monitoring video
CN110443829A (en) * 2019-08-05 2019-11-12 北京深醒科技有限公司 It is a kind of that track algorithm is blocked based on motion feature and the anti-of similarity feature
CN110944295B (en) * 2019-11-27 2021-09-21 恒安嘉新(北京)科技股份公司 Position prediction method, position prediction device, storage medium and terminal
CN111027505B (en) * 2019-12-19 2022-12-23 吉林大学 Hierarchical multi-target tracking method based on significance detection
SG10201913754XA (en) * 2019-12-30 2020-12-30 Sensetime Int Pte Ltd Image processing method and apparatus, electronic device, and storage medium
US11631251B2 (en) 2020-02-23 2023-04-18 Tfi Digital Media Limited Method and system for jockey and horse recognition and tracking
CN112001252B (en) * 2020-07-22 2024-04-12 北京交通大学 Multi-target tracking method based on different composition network
CN112070807B (en) * 2020-11-11 2021-02-05 湖北亿咖通科技有限公司 Multi-target tracking method and electronic device
CN112529941B (en) * 2020-12-17 2021-08-31 深圳市普汇智联科技有限公司 Multi-target tracking method and system based on depth trajectory prediction
CN112906545B (en) * 2021-02-07 2023-05-05 广东省科学院智能制造研究所 Real-time action recognition method and system for multi-person scene
CN113569824B (en) * 2021-09-26 2021-12-17 腾讯科技(深圳)有限公司 Model processing method, related device, storage medium and computer program product
CN114240997B (en) * 2021-11-16 2023-07-28 南京云牛智能科技有限公司 Intelligent building online trans-camera multi-target tracking method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127802A (en) * 2016-06-16 2016-11-16 南京邮电大学盐城大数据研究院有限公司 A kind of movement objective orbit method for tracing
CN106845430A (en) * 2017-02-06 2017-06-13 东华大学 Pedestrian detection and tracking based on acceleration region convolutional neural networks
CN106875425A (en) * 2017-01-22 2017-06-20 北京飞搜科技有限公司 A kind of multi-target tracking system and implementation method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127802A (en) * 2016-06-16 2016-11-16 南京邮电大学盐城大数据研究院有限公司 A kind of movement objective orbit method for tracing
CN106875425A (en) * 2017-01-22 2017-06-20 北京飞搜科技有限公司 A kind of multi-target tracking system and implementation method based on deep learning
CN106845430A (en) * 2017-02-06 2017-06-13 东华大学 Pedestrian detection and tracking based on acceleration region convolutional neural networks

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Faster R-CNN:Towards Real-time Object Detection with Region Proposal Networks;Shaoqing Ren等;《Proceedings of the 28th International Conference on Neural Information Processing Systems》;20151212;第1卷;第91-99页 *
MULTI-TARGET DETECTION IN CCTV FOOTAGE FOR TRACKING APPLICATIONS USING DEEP LEARNING TECHNIQUES;A.Dimou等;《2016 IEEE International Conference on Image Processing》;20160819;第1-5页 *
基于深度学习的目标跟踪方法研究现状与展望;罗海波等;《红外与激光工程》;20170531;第46卷(第5期);第1-7页 *

Also Published As

Publication number Publication date
CN107330920A (en) 2017-11-07

Similar Documents

Publication Publication Date Title
CN107330920B (en) Monitoring video multi-target tracking method based on deep learning
CN109492581B (en) Human body action recognition method based on TP-STG frame
Yang et al. Spatio-temporal action detection with cascade proposal and location anticipation
KR101653278B1 (en) Face tracking system using colar-based face detection method
Li et al. Rapid and robust human detection and tracking based on omega-shape features
CN107230267B (en) Intelligence In Baogang Kindergarten based on face recognition algorithms is registered method
CN105260749B (en) Real-time target detection method based on direction gradient binary pattern and soft cascade SVM
Lin et al. Social mil: Interaction-aware for crowd anomaly detection
CN113011367A (en) Abnormal behavior analysis method based on target track
KR102132722B1 (en) Tracking method and system multi-object in video
JP7136500B2 (en) Pedestrian Re-identification Method for Random Occlusion Recovery Based on Noise Channel
CN112926522B (en) Behavior recognition method based on skeleton gesture and space-time diagram convolution network
CN111191535B (en) Pedestrian detection model construction method based on deep learning and pedestrian detection method
CN104616006A (en) Surveillance video oriented bearded face detection method
Mao et al. Training a scene-specific pedestrian detector using tracklets
CN103971100A (en) Video-based camouflage and peeping behavior detection method for automated teller machine
CN111881775B (en) Real-time face recognition method and device
CN109711232A (en) Deep learning pedestrian recognition methods again based on multiple objective function
Heili et al. Parameter estimation and contextual adaptation for a multi-object tracking CRF model
CN110852203B (en) Multi-factor suspicious person identification method based on video feature learning
Kim et al. Development of a real-time automatic passenger counting system using head detection based on deep learning
Moayed et al. Traffic intersection monitoring using fusion of GMM-based deep learning classification and geometric warping
Li et al. Pedestrian Motion Path Detection Method Based on Deep Learning and Foreground Detection
CN117058627B (en) Public place crowd safety distance monitoring method, medium and system
Shi et al. High-altitude parabolic detection method based on GMM model and SORT algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant