CN107330920B - Monitoring video multi-target tracking method based on deep learning - Google Patents
Monitoring video multi-target tracking method based on deep learning Download PDFInfo
- Publication number
- CN107330920B CN107330920B CN201710504914.5A CN201710504914A CN107330920B CN 107330920 B CN107330920 B CN 107330920B CN 201710504914 A CN201710504914 A CN 201710504914A CN 107330920 B CN107330920 B CN 107330920B
- Authority
- CN
- China
- Prior art keywords
- target
- similarity
- targets
- tracking
- spatial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a monitoring video multi-target tracking method based on deep learning, which comprises the steps of firstly decoding a video to extract an image sequence, then preprocessing the image and inputting the preprocessed image into a trained Faster R-CNN network model, and extracting target position information and target space characteristics in a corresponding layer of a network by the Faster R-CNN network model; inputting the target position information and the target space characteristics into an LSTM network, and predicting the position of a target at the next moment; the method comprises the steps of obtaining fusion characteristics of a target at the next moment through a fusion method of the spatial characteristics of the target, adding different weights to the position similarity and the spatial characteristic similarity to obtain final similarity, and then judging the corresponding relation between a plurality of targets detected at the current moment and a plurality of targets in a tracking state at the previous moment. The method can reduce the missing rate of multi-target tracking, improve the accuracy of multi-target tracking and solve the problem of target shielding in a short time in the tracking process by tracking the target.
Description
Technical Field
The invention belongs to the technical field of target tracking, and particularly relates to a monitoring video multi-target tracking method based on deep learning.
Background
The construction of safe cities and the popularization of high-definition cameras generate massive monitoring videos, the time and the difficulty are very long and the purpose of distinguishing targets by only depending on manpower from massive video image data is difficult to achieve all the more due to human factors such as visual fatigue. With the development of visual computing, from machine learning to deep learning, a computer can more intelligently understand information in a video, and a video intelligent analysis system is developed. The video intelligent analysis system analyzes the continuous image sequence, extracts the moving target of each frame, determines the target relation in the adjacent frames through the tracking technology, determines the moving direction and speed of one target, extracts other characteristic information such as track and gait, and provides the functions of video retrieval, rapid target positioning, target fragment viewing in the video, target behavior information collection and the like for the user. In the video intelligent analysis system, a key technology target tracking, namely how to determine the relation of targets in adjacent frames and obtain a complete motion sequence of a target in the whole video, becomes an important research direction. The multi-target tracking of the monitored video is the premise basis of a video intelligent analysis system, and has important influence on deeper mining of video information.
Appearance is an important characteristic for describing a target, and target tracking algorithms can be divided into two major categories according to appearance models. The first type is a generative method. The generative method first establishes a spatial description of the target using a well-defined algorithm and then compares a number of candidate regions to find the best matching region. Generative methods emphasize the description of the appearance of the target by ignoring the background, and drift occurs when the target encounters occlusion. The other is a discriminant method, and the foreground target is proposed by using target detection firstly according to the idea of matching by using a detection result, so that the tracking is converted into a classification problem. The discriminant model makes full use of foreground and background information, and can better distinguish the foreground from the background, so that the discriminant model has stronger robustness. However, in the process of online learning and updating by using samples, the performance of the classifier is easily affected by the labeling error of the samples, which causes misclassification.
Most of the existing methods based on deep learning use a convolutional neural network to establish an appearance model to realize target tracking, and the method does not consider that the target tracking is the processing of a series of continuous frame images with close time sequence relation and does not mine more effective time sequence information, with the attention of the information of the images in a spatial domain.
Occlusion is a big problem for target tracking. The occlusion problem can be divided into two cases, one is that the target is occluded by background information, when occlusion begins, the target can also be detected partially, and can not be detected slowly until the target is detected again after reappearance. Another situation is that two or more targets overlap, and when the targets just start to overlap, the targets can be detected as multiple targets, but the spatial features gradually approach, when the targets are completely merged, the multiple targets can be detected as one target, and when the targets are separated, the targets must be tracked without disorder. The common method for solving the problem is to divide the target into a plurality of space areas, each space area is provided with one tracker, and when the target is blocked, the department trackers can continue to track, so that a plurality of trackers run simultaneously, and the tracking speed is too slow.
Disclosure of Invention
Aiming at the defects or the improvement requirements in the prior art, the invention provides a monitoring video multi-target tracking method based on deep learning, which aims to use a recurrent neural network to learn the motion rule of a target, fuse the characteristics of the target, predict the position of the target and calculate the similarity of the target by combining two aspects of time and space to match the target so as to realize the target tracking. The method can reduce the omission factor of multi-target tracking, improve the accuracy of multi-target tracking and solve the problem of target shielding in a short time in the tracking process.
In order to achieve the above object, according to one aspect of the present invention, there is provided a monitoring video multi-target tracking method based on deep learning, the method including:
(1) decoding the monitoring video according to a set interval time to obtain an image;
(2) inputting the images obtained by decoding into a trained fast R-CNN target detection network model to obtain the position information and the spatial characteristics of a plurality of targets;
(3) inputting the position information and the spatial characteristics of the target at a plurality of moments into an LSTM network model for LSTM network model offline training, and predicting the position of the target at the next moment by using the trained LSTM network model;
(4) fusing the spatial features of the target at multiple moments to obtain a fusion feature, calculating the similarity between the fusion feature of the target and the spatial feature extracted by the target at the current moment, and calculating a position rectangle R predicted by the target through an LSTM methodoAnd the position rectangle R of the target detected at the current momentsAnd judging the affiliation relationship between the newly detected target and the tracked target according to the matching of the combined spatial features and the position information.
Further, the step (2) specifically includes:
(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];
(22) filtering out targets with confidence degrees lower than a confidence degree threshold value, discarding [ class numbers and confidence degrees ] in target information from the remaining targets, and keeping position information [ x, y, width, height ], wherein the confidence degree threshold value range is 0.15-0.25, preferably 0.2;
(23) and in the region-of-interest pooling layer, extracting the spatial features of a plurality of targets according to the mapping relation of the region generated by the RPN algorithm.
Further, the specific method for the online training of the LSTM network model in step (3) is as follows:
(31) decoding the training video at the same time interval to extract an image sequence;
(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;
(33) the LSTM network is set as an SN layer, the spatial characteristics and the position information of the same target in the continuous SN images are taken out and input into the LSTM network every time, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:
wherein L isdetIs the position, L, of the next moment detected by the Faster R-CNN networkpredIs the predicted position of the LSTM network output, the position value is normalized to [ 0-1%]Within the interval;
(34) training targets in all test videos and calculating average loss through step (33)avgThe calculation method is as follows:
wherein N is the number of times of inputting all targets in all videos into the network, lossiObtaining the loss value after each processing of the network in the step (33), when the average loss isavgWhen the prediction threshold value is smaller than the prediction threshold value and the loss function is converged, finishing training, wherein the value range of the prediction threshold value is 0.15-0.25, and preferably 0.2; otherwise, selecting another continuous SN images to repeat the step (33).
Further, the step (3) is specifically: and (3) extracting the spatial features and the target positions of the same target in the continuous SN images in the image sequence, inputting the spatial features and the target positions into the trained LSTM network model, and outputting the predicted positions of the target at the next moment after the processing of the LSTM network model.
Further, the step (4) is specifically performed
(41) Fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:
wherein, tnRepresents tn-1Last moment of (F)o,tRepresenting the fused features of the object at time t,represents that the target is at t1The spatial characteristics of the time of day are,represents that the target is at t2Spatial features of the time of day;
(42) fusing features F of the objecto,tAnd the spatial feature F extracted by the target at the current moments,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:
(43) position rectangle R predicted by LSTM methodoAnd the position rectangle R of the target detected at the current momentsBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:
wherein S isinter=(min(r1.r,r2.r)-max(r1.l,r2.l))*(min(r1.b,r2.b)-max(r1.t,r2.t));r1.t,r1.b,r1.l,r1R each represents a rectangle RsUpper, lower, left, right boundary value of r2.t,r2.b,r2.l,r2R each represents a rectangle RoUpper, lower, left, right boundary values of (S)interIs the overlapping area of two rectangles, S1、S2Each represents RsAnd RoThe area of the two rectangles;
(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w1And position similarity weight w2Balancing to obtain comprehensive similarity:
diffs,o=w1F+w2R
(45) comparing one target detected at the current moment with all targets in the tracking state to obtain diff with highest comprehensive similaritys,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, and if the diff value is greater than a matching threshold value, determining that the matching is successful, wherein the value range of the matching threshold value is 0.6-0.7, and preferably 0.65.
Generally, compared with the prior art, the technical scheme of the invention has the following technical characteristics and beneficial effects:
(1) the method adopts a Faster R-CNN target detection algorithm to detect the position information and the spatial characteristics of a plurality of targets in a frame of image, the extracted spatial characteristics have stronger expression capability on the targets, and higher similarity is realized through the matching of appearance models;
(2) the method inputs target position information and target space characteristics into an LSTM network, learns the motion rule of the target by using an LSTM recurrent neural network, has strong prediction capability on the target position, and has higher similarity when the target position is matched;
(3) the technical scheme of the invention obtains the fusion characteristics of the target at the next moment by fusing the spatial characteristics of the target, obtains the final similarity by adding different weights in two aspects of position similarity and spatial characteristic similarity, and then judges the corresponding relation between a plurality of targets detected at the current moment and a plurality of targets in a tracking state before; the features of the two aspects of the predicted position and the fusion feature are combined for use, so that the accuracy of target tracking can be further improved.
Drawings
FIG. 1 is a schematic flow chart of the main process of the present invention;
FIG. 2 is a schematic view of a process for computing spatiotemporal features of an object in the method of the present invention;
FIG. 3 is a flow chart of the matching of targets in the method of the present invention;
FIG. 4 shows the result of the video Venice-1 tracking according to the present invention;
FIG. 5 shows the tracking result of the method of the present invention on the target-intensive video MOT 16-3.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The process of the method of the invention is shown in figure 1:
(1) video frame-by-frame decoding:
(11) setting a decoding interval frame number on the basis of extracting 4 frames per second, wherein if the video fps is 24, the interval frame number is 6;
(12) decoding the video to obtain an image in real time by using Opencv according to the decoding interval frame number;
(13) performing a pre-processing operation on the image, and scaling the image to 224 x 224 pixel size to adapt to the size of the Faster R-CNN network;
(2) fast R-CNN target detection:
the Faster R-CNN is a convolutional neural network, and the target detection by using the Faster R-CNN firstly needs to train model parameters offline and then process images online to obtain the position and the spatial characteristics of a target;
the process of training the model offline is as follows: marking out the target position and the target classification in the image, so that the network can perform back propagation through the set mark to determine the model parameters; the model training is a supervised training process, and the image sample of the training also uses the image extracted from the video to ensure the similarity with the actual used scene; the model training is a repeated iteration process, and the output error of the final model is within a certain range through feedback adjustment;
the on-line actual treatment process comprises the following steps:
(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];
(22) filtering out targets with confidence degrees lower than a confidence degree threshold value of 0.2, discarding the [ classification number and confidence degree ] because of independence from time sequence, and keeping the position information [ x, y, width, height ] relevant to the time sequence;
(23) in the region-of-interest pooling layer, the spatial characteristics of a plurality of targets are extracted according to the mapping relation of the region generated by the RPN algorithm, and different targets have differences in appearance and can be used for distinguishing different targets;
(3) LSTM target prediction:
the LSTM is a cyclic neural network, and the LSTM is used for target prediction, and similarly, an offline training model is needed firstly, and then an online image is processed for prediction;
the process of training the model offline is as follows:
(31) decoding the training video according to the time interval of extracting 4 frames per second in the step (11), and extracting an image sequence;
(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;
(33) the LSTM network is set to be 6 layers, the spatial characteristics and the position information of the same target in 6 continuous adjacent images are taken out each time and input into the LSTM network, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:
wherein L isdetIs the position, L, of the next moment detected by the Faster R-CNN networkpredIs the predicted location of the LSTM network output; the position value is normalized to 0-1]Within the interval;
(34) through continuous iteration, the position output by the LSTM network is enabled to be closer to the position of the target at the next moment, and the motion rule of the target is learned to predict the track.
The on-line actual treatment process comprises the following steps: taking out the spatial features and the position information of the same target in the continuous 6 image sequences, inputting the spatial features and the position information into a network, and outputting a result which is the predicted position of the target at the next moment after the spatial features and the position information are processed by the network;
(4) target matching:
(41) fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:
wherein, tnRepresents tn-1Last moment of (F)o,tRepresenting the fused features of the object at time t,represents that the target is at t1The spatial characteristics of the time of day are,represents that the target is at t2The temporal spatial feature, fused feature calculation diagram is shown in fig. 2;
(42) fusing features F of the objecto,tAnd the spatial feature F extracted by the target at the current moments,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:
(43) position rectangle R predicted by LSTM methodoAnd the position rectangle R of the target detected at the current momentsBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:
rectangle r1And r2IOU (IoU)1,2The calculation method is as follows:
Sinter=(min(r1.r,r2.r)-max(r1.l,r2.l))*(min(r1.b,r2.b)-max(r1.t,r2.t))
wherein r is1.t,r1.b,r1.l,r1R each represents a rectangle r1Upper, lower, left, right boundary value of r2.t,r2.b,r2.l,r2R each represents a rectangle r2Upper, lower, left, right boundary values of (S)interIs the overlapping area of two rectangles, S1And S2Respectively representing the areas of the two rectangles;
(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w1,w10.6 position similarity weight w2,w2The balance is made to 0.4 to get the best effect, and the final matching strategy is as follows:
diffs,o=w1F+w2R
(45) comparing the similarity of one target detected at the current moment with all the targets in the tracking state to obtain the target with the highest similarity, namely diffs,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, if the diff value is greater than the matching threshold value of 0.65, determining that the matching is successful, otherwise, determining that the matching is failed, and determining the matching as a new target. The flow chart of the whole process is shown in FIG. 3;
(46) after the target is matched, the tracking state of the target needs to be updated; the tracking state of the target has four states: an initial TRACKING state OS _ BEGIN, an OS _ TRACKING in the TRACKING process, an unsuccessfully matched OS _ UNMATCH and a target TRACKING END OS _ END; in the state that the targets are not matched, the targets are shielded, the targets are not detected or the targets just leave the picture, and the targets still need to be tracked; the target tracking end indicates that the target may fail to track or leave the video picture; wherein the initial tracking state of the target is set during the matching process;
if the target is processed in the initial tracking state, updating the target to be in the tracking process after the current time is processed, and enabling the target to be matched at the next time;
if the target is in the tracking process state, checking whether the target is matched with the target under tracking at the current moment, if the target is matched successfully, adding the information into the target, updating the fusion characteristic of the target and predicting the position information of the target at the next moment; if the matching fails, setting the target to be in a non-matching state, and taking the predicted position as the position of the target at the current moment;
if the target is in an unmatched state, the target is not successfully matched in the last moment, whether the target is successfully matched at the current moment needs to be checked, and if the target is successfully matched, the target is set to be in a tracking process and the target is indicated to be shielded to be finished or detected again; if the matching fails, the continuous frame number of the target in the unmatched state needs to be checked, if the continuous frame number exceeds a certain frame number, the target leaves the picture or fails to track, and the tracking end state is set; otherwise, keeping the current state, and continuously using the prediction information as a result of the target at the current moment;
and if the target processing tracking is finished, removing the target processing tracking from the tracking queue and not performing target matching.
And (3) experimental test:
the host software environment of the experiment is Ubuntu 16.04LTS 64 bit, OpenCV 3.1.0 and CUDA 8.0, and the CPU in the hardware configuration is Intel Core i5-6500 and the GPU is GeForce GTX 1080. The evaluation method in MOT Challenge is used for selecting the following indexes:
FN: the number of missed tests is the lower the value is, the better the result is;
FP: the number of false alarms, the lower the value, the better the effect;
IDSW: the number of all targets jumping is the better the effect is when the value is lower;
MOTA: the accuracy of multi-target tracking calculated by 3 indexes of the number of missed detections, the number of false alarms and the number of target jumping is the most main comprehensive index in the multi-target tracking judgment standard, the number of tracked targets and the accuracy of target matching are shown, and the higher the value is, the better the effect is;
MOTP: calculating the accuracy of multi-target tracking according to the average frame overlapping rate of all tracked targets, wherein the accuracy of the targets in the position result is shown, and the higher the value is, the better the effect is;
HZ: the average value of the image frame numbers tracked by the system in a period of time is taken as an index for evaluating the execution efficiency and speed of the tracking algorithm by taking seconds as a unit, and the higher the value is, the better the performance is;
experiments compare the effect of the method for extracting the spatial features based on the convolutional neural network in most methods at present with the effect of the method for extracting the spatial features based on the convolutional neural network in the MOT Challenge data set, wherein M1 is a method for matching by using the spatial depth features alone, M2 is a method for matching by using the positions on a time sequence alone, and M3 is a method for combining the two methods. Video Venice-1 and MOT16-3 were tracked using three methods, respectively, with video information as shown in Table 1 and experimental results as shown in Table 2 below.
As can be seen from tables 1 and 2, the video Venice-1 has moderate target density and high feature discrimination, good results can be obtained by using each strategy alone, and the accuracy can be improved by the multi-strategy method. For the video MOT16-3, the target density is high, and the accuracy is greatly improved by using a multi-strategy method.
Table 1 test video information table
Video | Venice-1 | Mot16-3 |
Resolution ratio | 1920*1080 | 1920*1080 |
Time duration (frame) | 450 | 1500 |
Number of targets | 4563 | 104556 |
Target density | Medium and high grade | Super high |
Table 2 tracking results of different videos under different strategies
From the IDSW score, it can be seen that the value of the video MOT16-3 in method M2 is much larger than that in method M3, which indicates that in the case of high object density, the matching based on the position is prone to error, mainly because the high density results in high position overlapping rate of multiple objects and is not easy to distinguish. From FN, it can be seen that the values at method M2 are less than M1 for both videos, indicating that the prediction capability of LSTM can reduce the false negative rate.
The tracking result of the video Venice-1 is shown in FIG. 4, and the tracking result of the video MOT16-3 in a partial area is shown in FIG. 5, where a gray thin-line rectangular box represents the result of tracking after the target is detected, and a white thick-line rectangular box represents the result of predicting the target that is not detected or is in occlusion. The numbers above the rectangular frame represent the tracked target numbers, so that the matching conditions of the targets are compared, and the numbers at the lower right corner represent the frame numbers of the images in the video. The displayed tracking result shows that the pedestrians in the image are correctly tracked as the same target.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (4)
1. A monitoring video multi-target tracking method based on deep learning is characterized by comprising the following steps:
(1) decoding the monitoring video according to a set interval time to obtain an image;
(2) inputting the images obtained by decoding into a trained fast R-CNN target detection network model to obtain the position information and the spatial characteristics of a plurality of targets;
(3) inputting the position information and the spatial characteristics of the target at a plurality of moments into an LSTM network model for LSTM network model offline training, and predicting the position of the target at the next moment by using the trained LSTM network model;
(4) fusing the spatial features of the target at multiple moments to obtain a fusion feature, calculating the similarity between the fusion feature of the target and the spatial feature extracted by the target at the current moment, and calculating a position rectangle R predicted by the target through an LSTM methodoAnd the position rectangle R of the target detected at the current momentsThe similarity between the target and the tracked target is judged according to the matching of the combination space characteristics and the position information; the step (4) is specifically as follows:
(41) fusing the spatial features of the same target at adjacent continuous moments to obtain fused features of the target, wherein the computing mode is as follows:
wherein, tnRepresents tn-1Last moment of (F)o,tRepresenting the fused features of the object at time t,represents that the target is at t1The spatial characteristics of the time of day are,represents that the target is at t2Spatial features of the time of day;
(42) fusing features F of the objecto,tAnd the spatial feature F extracted by the target at the current moments,tAnd comparing, and calculating the characteristic similarity F by using the cosine similarity in the following way:
(43) position rectangle R predicted by LSTM methodoAnd the position rectangle R of the target detected at the current momentsBy contrast, the area intersection ratio IOU of the rectangles is used to calculate the position similarity R, which is expressed as follows:
wherein S isinter=(min(r1.r,r2.r)-max(r1.l,r2.l))*(min(r1.b,r2.b)-max(r1.t,r2.t));r1.t,r1.b,r1.l,r1R each represents a rectangle RsUpper, lower, left, right boundary value of r2.t,r2.b,r2.l,r2R each represents a rectangle RoUpper, lower, left, right boundary values of (S)interIs the overlapping area of two rectangles, S1、S2Each represents RsAnd RoThe area of the two rectangles;
(44) combining the feature similarity and the position similarity, and adding a feature similarity weight w1And position similarity weight w2Balancing to obtain comprehensive similarity:
diffs,o=w1F+w2R
(45) comparing one target detected at the current moment with all targets in the tracking state to obtain diff with highest comprehensive similaritys,oThe target with the maximum value is used as a pending matching target; and setting the maximum value as diff, if the diff value is larger than the matching threshold, considering that the matching is successful, otherwise, the matching is failed, and setting the matching as a new target.
2. The monitoring video multi-target tracking method based on deep learning as claimed in claim 1, wherein the step (2) specifically comprises:
(21) inputting the image into a trained Faster R-CNN network model, and extracting a plurality of target information through a network top classification layer and a window regression layer, wherein each target information comprises [ classification number, confidence coefficient, x, y, width and height ];
(22) filtering out targets with confidence degrees lower than a confidence degree threshold value, discarding [ class numbers and confidence degrees ] in target information from the remaining targets, and reserving position information [ x, y, width, height ];
(23) and in the region-of-interest pooling layer, extracting the spatial features of a plurality of targets according to the mapping relation of the region generated by the RPN algorithm.
3. The monitored video multi-target tracking method based on deep learning as claimed in claim 1, wherein the specific method of LSTM network model offline training in the step (3) is as follows:
(31) decoding the training video at the same time interval to extract an image sequence;
(32) detecting each image in the image sequence through a Faster R-CNN network to obtain the position information and the spatial characteristics of a plurality of targets;
(33) the LSTM network is set as an SN layer, the spatial characteristics and the position information of the same target in the continuous SN images are taken out and input into the LSTM network every time, the loss function in training uses the mean square error to calculate the difference of the position information, and the calculation mode is as follows:
wherein L isdetIs the position, L, of the next moment detected by the Faster R-CNN networkpredIs the predicted position of the LSTM network output, the position value is normalized to [ 0-1%]Within the interval;
(34) training targets in all test videos and calculating average loss through step (33)avgThe calculation method is as follows:
wherein N is the number of times of inputting all targets in all videos into the network, lossiObtaining the loss value after each processing of the network in the step (33), when the average loss isavgWhen the prediction threshold value is smaller than the prediction threshold value and the loss function is converged, finishing training; otherwise, selecting another continuous SN images to repeat the step (33).
4. The monitoring video multi-target tracking method based on deep learning as claimed in claim 1, wherein the step (3) is specifically as follows: and (3) extracting the spatial features and the target positions of the same target in the continuous SN images in the image sequence, inputting the spatial features and the target positions into the trained LSTM network model, and outputting the predicted positions of the target at the next moment after the processing of the LSTM network model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710504914.5A CN107330920B (en) | 2017-06-28 | 2017-06-28 | Monitoring video multi-target tracking method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710504914.5A CN107330920B (en) | 2017-06-28 | 2017-06-28 | Monitoring video multi-target tracking method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107330920A CN107330920A (en) | 2017-11-07 |
CN107330920B true CN107330920B (en) | 2020-01-03 |
Family
ID=60198399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710504914.5A Active CN107330920B (en) | 2017-06-28 | 2017-06-28 | Monitoring video multi-target tracking method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107330920B (en) |
Families Citing this family (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108055501A (en) * | 2017-11-22 | 2018-05-18 | 天津市亚安科技有限公司 | A kind of target detection and the video monitoring system and method for tracking |
CN108229407A (en) * | 2018-01-11 | 2018-06-29 | 武汉米人科技有限公司 | A kind of behavioral value method and system in video analysis |
CN108280843A (en) * | 2018-01-24 | 2018-07-13 | 新华智云科技有限公司 | A kind of video object detecting and tracking method and apparatus |
CN108537818B (en) * | 2018-03-07 | 2020-08-14 | 上海交通大学 | Crowd trajectory prediction method based on cluster pressure LSTM |
CN108320297B (en) * | 2018-03-09 | 2020-06-19 | 湖北工业大学 | Video target real-time tracking method and system |
CN108509876B (en) * | 2018-03-16 | 2020-11-27 | 深圳市商汤科技有限公司 | Object detection method, device, apparatus, storage medium, and program for video |
CN108491816A (en) * | 2018-03-30 | 2018-09-04 | 百度在线网络技术(北京)有限公司 | The method and apparatus for carrying out target following in video |
CN110349182A (en) * | 2018-04-07 | 2019-10-18 | 苏州竺星信息科技有限公司 | A kind of personage's method for tracing based on video and positioning device |
CN108520530B (en) * | 2018-04-12 | 2020-01-14 | 厦门大学 | Target tracking method based on long-time and short-time memory network |
CN108764032B (en) * | 2018-04-18 | 2019-12-24 | 北京百度网讯科技有限公司 | Intelligent monitoring method and device for coal mine water exploration and drainage, computer equipment and storage medium |
DE102018206208A1 (en) * | 2018-04-23 | 2019-10-24 | Robert Bosch Gmbh | Method, device, product and computer program for operating a technical system |
CN108664930A (en) * | 2018-05-11 | 2018-10-16 | 西安天和防务技术股份有限公司 | A kind of intelligent multi-target detection tracking |
CN108664935A (en) * | 2018-05-14 | 2018-10-16 | 中山大学新华学院 | The method for tracking target and system of depth Spatial-temporal Information Fusion based on CUDA |
CN108805907B (en) * | 2018-06-05 | 2022-03-29 | 中南大学 | Pedestrian posture multi-feature intelligent identification method |
CN108875819B (en) * | 2018-06-08 | 2020-10-27 | 浙江大学 | Object and component joint detection method based on long-term and short-term memory network |
CN110688873A (en) * | 2018-07-04 | 2020-01-14 | 上海智臻智能网络科技股份有限公司 | Multi-target tracking method and face recognition method |
CN109063574B (en) * | 2018-07-05 | 2021-04-23 | 顺丰科技有限公司 | Method, system and equipment for predicting envelope frame based on deep neural network detection |
CN109344725B (en) * | 2018-09-04 | 2020-09-04 | 上海交通大学 | Multi-pedestrian online tracking method based on space-time attention mechanism |
CN109308469B (en) * | 2018-09-21 | 2019-12-10 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating information |
CN109241952B (en) * | 2018-10-26 | 2021-09-07 | 北京陌上花科技有限公司 | Figure counting method and device in crowded scene |
CN111104831B (en) * | 2018-10-29 | 2023-09-29 | 香港城市大学深圳研究院 | Visual tracking method, device, computer equipment and medium |
CN109409307B (en) * | 2018-11-02 | 2022-04-01 | 深圳龙岗智能视听研究院 | Online video behavior detection method based on space-time context analysis |
CN109584213B (en) * | 2018-11-07 | 2023-05-30 | 复旦大学 | Multi-target number selection tracking method |
CN109740416B (en) * | 2018-11-19 | 2021-02-12 | 深圳市华尊科技股份有限公司 | Target tracking method and related product |
CN109800689B (en) * | 2019-01-04 | 2022-03-29 | 西南交通大学 | Target tracking method based on space-time feature fusion learning |
CN109934096B (en) * | 2019-01-22 | 2020-12-11 | 浙江零跑科技有限公司 | Automatic driving visual perception optimization method based on characteristic time sequence correlation |
CN109934115B (en) * | 2019-02-18 | 2021-11-02 | 苏州市科远软件技术开发有限公司 | Face recognition model construction method, face recognition method and electronic equipment |
CN110033469B (en) * | 2019-04-01 | 2021-08-27 | 北京科技大学 | Sub-pixel edge detection method and system |
CN110276783B (en) * | 2019-04-23 | 2021-01-08 | 上海高重信息科技有限公司 | Multi-target tracking method and device and computer system |
CN110175538A (en) * | 2019-05-10 | 2019-08-27 | 国网福建省电力有限公司龙岩供电公司 | A kind of substation's Bird's Nest recognition methods and system based on machine learning |
CN110111358B (en) * | 2019-05-14 | 2022-05-24 | 西南交通大学 | Target tracking method based on multilayer time sequence filtering |
CN110333517B (en) * | 2019-07-11 | 2022-11-25 | 腾讯科技(深圳)有限公司 | Obstacle sensing method, obstacle sensing device and storage medium |
CN110598540B (en) * | 2019-08-05 | 2021-12-03 | 华中科技大学 | Method and system for extracting gait contour map in monitoring video |
CN110443829A (en) * | 2019-08-05 | 2019-11-12 | 北京深醒科技有限公司 | It is a kind of that track algorithm is blocked based on motion feature and the anti-of similarity feature |
CN110944295B (en) * | 2019-11-27 | 2021-09-21 | 恒安嘉新(北京)科技股份公司 | Position prediction method, position prediction device, storage medium and terminal |
CN111027505B (en) * | 2019-12-19 | 2022-12-23 | 吉林大学 | Hierarchical multi-target tracking method based on significance detection |
SG10201913754XA (en) * | 2019-12-30 | 2020-12-30 | Sensetime Int Pte Ltd | Image processing method and apparatus, electronic device, and storage medium |
US11631251B2 (en) | 2020-02-23 | 2023-04-18 | Tfi Digital Media Limited | Method and system for jockey and horse recognition and tracking |
CN112001252B (en) * | 2020-07-22 | 2024-04-12 | 北京交通大学 | Multi-target tracking method based on different composition network |
CN112070807B (en) * | 2020-11-11 | 2021-02-05 | 湖北亿咖通科技有限公司 | Multi-target tracking method and electronic device |
CN112529941B (en) * | 2020-12-17 | 2021-08-31 | 深圳市普汇智联科技有限公司 | Multi-target tracking method and system based on depth trajectory prediction |
CN112906545B (en) * | 2021-02-07 | 2023-05-05 | 广东省科学院智能制造研究所 | Real-time action recognition method and system for multi-person scene |
CN113569824B (en) * | 2021-09-26 | 2021-12-17 | 腾讯科技(深圳)有限公司 | Model processing method, related device, storage medium and computer program product |
CN114240997B (en) * | 2021-11-16 | 2023-07-28 | 南京云牛智能科技有限公司 | Intelligent building online trans-camera multi-target tracking method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106127802A (en) * | 2016-06-16 | 2016-11-16 | 南京邮电大学盐城大数据研究院有限公司 | A kind of movement objective orbit method for tracing |
CN106845430A (en) * | 2017-02-06 | 2017-06-13 | 东华大学 | Pedestrian detection and tracking based on acceleration region convolutional neural networks |
CN106875425A (en) * | 2017-01-22 | 2017-06-20 | 北京飞搜科技有限公司 | A kind of multi-target tracking system and implementation method based on deep learning |
-
2017
- 2017-06-28 CN CN201710504914.5A patent/CN107330920B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106127802A (en) * | 2016-06-16 | 2016-11-16 | 南京邮电大学盐城大数据研究院有限公司 | A kind of movement objective orbit method for tracing |
CN106875425A (en) * | 2017-01-22 | 2017-06-20 | 北京飞搜科技有限公司 | A kind of multi-target tracking system and implementation method based on deep learning |
CN106845430A (en) * | 2017-02-06 | 2017-06-13 | 东华大学 | Pedestrian detection and tracking based on acceleration region convolutional neural networks |
Non-Patent Citations (3)
Title |
---|
Faster R-CNN:Towards Real-time Object Detection with Region Proposal Networks;Shaoqing Ren等;《Proceedings of the 28th International Conference on Neural Information Processing Systems》;20151212;第1卷;第91-99页 * |
MULTI-TARGET DETECTION IN CCTV FOOTAGE FOR TRACKING APPLICATIONS USING DEEP LEARNING TECHNIQUES;A.Dimou等;《2016 IEEE International Conference on Image Processing》;20160819;第1-5页 * |
基于深度学习的目标跟踪方法研究现状与展望;罗海波等;《红外与激光工程》;20170531;第46卷(第5期);第1-7页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107330920A (en) | 2017-11-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107330920B (en) | Monitoring video multi-target tracking method based on deep learning | |
CN109492581B (en) | Human body action recognition method based on TP-STG frame | |
Yang et al. | Spatio-temporal action detection with cascade proposal and location anticipation | |
KR101653278B1 (en) | Face tracking system using colar-based face detection method | |
Li et al. | Rapid and robust human detection and tracking based on omega-shape features | |
CN107230267B (en) | Intelligence In Baogang Kindergarten based on face recognition algorithms is registered method | |
CN105260749B (en) | Real-time target detection method based on direction gradient binary pattern and soft cascade SVM | |
Lin et al. | Social mil: Interaction-aware for crowd anomaly detection | |
CN113011367A (en) | Abnormal behavior analysis method based on target track | |
KR102132722B1 (en) | Tracking method and system multi-object in video | |
JP7136500B2 (en) | Pedestrian Re-identification Method for Random Occlusion Recovery Based on Noise Channel | |
CN112926522B (en) | Behavior recognition method based on skeleton gesture and space-time diagram convolution network | |
CN111191535B (en) | Pedestrian detection model construction method based on deep learning and pedestrian detection method | |
CN104616006A (en) | Surveillance video oriented bearded face detection method | |
Mao et al. | Training a scene-specific pedestrian detector using tracklets | |
CN103971100A (en) | Video-based camouflage and peeping behavior detection method for automated teller machine | |
CN111881775B (en) | Real-time face recognition method and device | |
CN109711232A (en) | Deep learning pedestrian recognition methods again based on multiple objective function | |
Heili et al. | Parameter estimation and contextual adaptation for a multi-object tracking CRF model | |
CN110852203B (en) | Multi-factor suspicious person identification method based on video feature learning | |
Kim et al. | Development of a real-time automatic passenger counting system using head detection based on deep learning | |
Moayed et al. | Traffic intersection monitoring using fusion of GMM-based deep learning classification and geometric warping | |
Li et al. | Pedestrian Motion Path Detection Method Based on Deep Learning and Foreground Detection | |
CN117058627B (en) | Public place crowd safety distance monitoring method, medium and system | |
Shi et al. | High-altitude parabolic detection method based on GMM model and SORT algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |