CN115035159A

CN115035159A - Video multi-target tracking method based on deep learning and time sequence feature enhancement

Info

Publication number: CN115035159A
Application number: CN202210632698.3A
Authority: CN
Inventors: 刘勇; 林叶能; 王蒙蒙
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-09-09

Abstract

The invention relates to the field of computer vision, and discloses a video multi-target tracking method based on deep learning and time sequence feature enhancement, which comprises the following steps: s1, preparing and processing a data set, and taking the processed data as input data of model training and testing; s2, separating the target detection and the ReID task in the model structure; s3, constructing a ReID task module improvement model structure by using the time sequence information; and S4, post-processing reasoning of the model, and applying the improved model structure to a data association matching process of multi-target tracking. The method improves the problems of two tasks of target detection and ReID in training, and separates the detection and ReID branches, so that the two structures have more independence and improve the detection performance while keeping the functional precision. And time sequence information is utilized, the characteristics of the central point of the historical frame are combined, and a characteristic strengthening module is added, so that the multi-target tracking performance of the model on the unmanned aerial vehicle video sequence is improved.

Description

Video multi-target tracking method based on deep learning and time sequence feature enhancement

Technical Field

The invention relates to the field of computer vision, in particular to a video multi-target tracking method based on deep learning and time sequence characteristic enhancement.

Background

In recent years, with the rapid development of artificial intelligence technology, computer vision technology has also penetrated into various fields, and application scenarios are becoming more and more widespread. The computer vision technology is fully utilized, and the efficiency of environmental monitoring and national defense safety monitoring can be improved to a great extent. The achievement of multi-target detection and tracking tasks through computer vision technology is also becoming one of the research hotspots in recent years. The multi-target tracking means that in a continuous frame sequence of a video, all target objects of each frame are detected, attributes such as the position, the size and the speed of each target are obtained, the same individual in each frame is endowed with the same ID identification, and therefore the tasks of detecting and tracking a plurality of targets in the video sequence are completed.

The main idea of completing subsequent tracking matching based on the result of target detection is to do tracking based on detection, and mainstream target detection algorithms based on bounding boxes include YOLO series algorithms, fast RCNN, RetinaNet and the like, and in recent years, many center point-based target detection anchor-free algorithms have appeared, such as target detection algorithms of centret, FCOS, centrpoint and the like. In a TBD algorithm system, data association and matching between frames are required after target detection, SORT and DeepSORT algorithms are relatively classical target matching algorithms and can be combined with a simple target detection algorithm to complete a multi-target tracking task. The task of multi-target tracking can be realized based on an anchor-free framework, and in a JDT framework structure, such as RetinaTrack and FairMOT, detection and tracking are combined by the two methods, so that the model structure is simplified, and the calculation real-time performance is improved.

The multi-target tracking is a cross-frame video interpretation task, most models in the current research do not well utilize time sequence information, only rely on image information of a current frame has certain limitation, and a target lacks the relation between frames. For example, if an object is blocked in a certain frame, if data association is performed only by means of single frame information, the same object characterization information is often different, which may result in an IDSwitch condition, thereby reducing the accuracy of the model. Therefore, how to utilize good timing information can greatly improve the performance of the model. In addition, although detection and data association are jointly trained by the JDT paradigm model to realize end-to-end multi-target tracking, target detection and tracking are usually two different visual tasks, and target detection needs to distinguish multiple classes of objects, so that the distance between different classes of objects is maximized, and the distance between the same class of objects is minimized, thereby improving the precision of target detection. However, target tracking requires maximizing the distance between all objects in the same category, so if two sub-tasks share more parameters during training, the training efficiency of the model may be reduced, and the performance of the trained model may be degraded in some cases.

Disclosure of Invention

In order to solve the problems, the invention provides a video multi-target tracking method based on deep learning and time sequence feature enhancement, FairMOT is used as an original reference structure, for an integral model structure, firstly, a structure of original model detection and feature generation is split, a feature enhancement module based on time sequence information is added on a reiD branch, the discrimination capability of the model on the reiD information is improved, in the calculation process of model loss, compared with the original single-frame loss calculation, double-frame output is carried out on an output part of the model detection, loss calculation is carried out on the output of adjacent frames at the same time, and the training efficiency and the bias of the model are improved.

In order to achieve the purpose, the invention provides a video multi-target tracking method based on deep learning and time sequence feature enhancement, which comprises the following steps:

s1, preparing and processing a data set, and taking the processed data as input data of model training and testing;

s2, separating target detection and ReID tasks in the model structure;

s3, constructing a ReID task module improvement model structure by using the time sequence information;

and S4, post-processing reasoning of the model, and applying the improved model structure to a data association matching process of multi-target tracking.

Preferably, the step S1 specifically includes the following steps:

s11, collecting a video sequence set of the unmanned aerial vehicle as a data set;

s12, marking the data set into a coco format, wherein the coco format can provide the serial number of the frame number, the target ID, the coordinates of the top left vertex of the bounding box, the width and the height of the bounding box, whether the target is blocked and whether the target needs to be ignored;

s13, counting the IDs of the data sets according to the categories;

and S14, performing rotation and scaling processing on each image in the data set.

Preferably, the step S2 specifically includes the following steps:

s21, changing a decoder of a backbone network on the model into two decoders with the same structure for target detection and ReID tasks respectively;

s22, changing the model input into double-frame input, sharing parameters of the two frames of images, and extracting features through an encoder;

and S23, simultaneously inputting the extracted features into the two decoders with the same structure to respectively perform target detection and ReID tasks.

Preferably, the step S23 specifically includes: in the target detection part, firstly, the characteristics of the previous frame obtained by a decoder are followed by a multilayer convolution, the characteristic graph is spliced with the characteristics of the current frame obtained by the decoder, and finally, the output of a target detection branch is obtained through a heat graph branch; in the ReID task part, a feature enhancement module is added, the features of the adjacent frames obtained by the decoder and the heat map of the previous frame are used as the input information of the feature module, and the output of the ReID task branch is obtained after the information integration of the modules.

Preferably, the step S3 is specifically divided into a training phase and an inference phase.

Preferably, the training phase specifically includes the following steps:

s311, obtaining the characteristics of the corresponding position of the characteristic graph in the previous frame through the labeling information of the data set, and calculating the similarity between the characteristics and the current characteristic graph to obtain the characteristic distance between each object in the previous frame and each point of the current frame;

and S312, carrying out feature fusion after obtaining the position information corresponding to each two of the feature map and the current feature map in the previous frame.

Preferably, the inference phase specifically includes the following steps:

s321, obtaining the number of targets possibly existing in the previous frame by using the heat map, and taking the ReID feature information of the positions corresponding to the targets as one of input into a feature module;

s322, setting a threshold, if the distance between the center point of the previous frame and the center point of the matched current frame exceeds the threshold, considering that the matched point is not credible and neglected, and only keeping the matching point with high credibility and carrying out feature fusion with the current feature map.

Preferably, the step S3 is to perform feature fusion on the heat map of the previous frame, the feature map of the previous frame, and the feature map of the current frame.

Preferably, the step S4 specifically includes the following steps:

s41, taking three frames as a round, normalizing and standardizing the heat map obtained by the model and the ReID characteristics by the first frame, carrying out non-maximum suppression processing on the heat map, screening out possible objects according to the set threshold value, and endowing ID to the objects of the first frame;

s42, repeating the operation of the first frame by the second frame to obtain possible objects, carrying out matching of the objects and the objects of the first frame around the frame iou, keeping detection which meets expectations, giving the same ID, and keeping objects which are not matched;

s43, adding a ReID feature on the basis of the second frame in the third frame, calculating the cosine distance of the ReID feature of a detection target of the adjacent frame, performing motion prediction through Kalman filtering, and performing data association by combining the appearance and the motion feature;

and S44, carrying out iou calculation on the objects which are not matched in the third frame and the objects in the previous frame, if the objects are smaller than a fixed threshold value, regarding the objects as new targets, giving new IDs, repeating the steps for each frame, and finishing the post-processing step of video multi-target tracking.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a video multi-target tracking method based on deep learning and timing sequence characteristic enhancement, which can realize end-to-end detection and tracking of multi-class objects in an unmanned aerial vehicle video sequence, and firstly solves the problem that two tasks of target detection and ReID possibly have certain conflict during training, separates target detection and ReID branches, enables two structures to have independence and improves detection precision, utilizes timing sequence information, combines central point characteristics of historical frames and adds a characteristic enhancement module, and accordingly improves multi-target tracking performance of a model on the unmanned aerial vehicle video sequence.

Drawings

FIG. 1 is an overall structure of an algorithm model of the present invention;

FIG. 2 is a first improved model structure diagram of a ReID task module constructed by using timing information according to the present invention;

FIG. 3 is a second improved model structure diagram of the ReID task module constructed by using the timing information according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problems and the defects in the prior art, the invention provides a video multi-target tracking method based on deep learning and time sequence characteristic enhancement, which is improved on the structure of a model: 1. the single-frame output of the original model is changed into double-frame output, so that the training efficiency and the bias of the model are improved; 2. the problem of conflict between target detection and the ReiD task training under the JDT paradigm is solved; 3. a characteristic strengthening module based on time sequence information is constructed, the representation of the ReID information is improved, the training efficiency of the ReID branch of the model is improved, and the tracking matching precision is guaranteed.

The invention provides a video multi-target tracking method based on deep learning and time sequence characteristic enhancement, which comprises the following steps:

s2, separating target detection and ReID tasks in the model structure;

Each step is described in detail below.

And step S1, preparing and processing a data set, taking the processed data as input data of model training and testing, namely preparing and processing the data set, processing the image information and the labeling information to obtain training data of the required multi-target tracking task, and taking the training data as the input data of the model for training and testing. The method mainly comprises the following steps:

s11, collecting an unmanned aerial vehicle video sequence set as a data set; corresponding unmanned aerial vehicle video sequence sets are selected, the data set comprises a training set and a verification set, and the training set and the verification set comprise multiple sections of unmanned aerial vehicle video sequences. Each video segment is an optical image and contains different scenes and objects. Each frame in the video has the same size and format, and different video sequences have different image sizes and shooting modes. The data set may contain a variety of object categories, such as pedestrian, vehicle, truck, etc.

S12, labeling the data set into a COCO format (the COCO format of the data set is composed of a JSON file containing all details of the image, such as size, annotation (i.e. bounding box coordinates), and label corresponding to its bounding box), and compared with the target detection data set, labeling the data set with an ID label, so that the labeled original format can provide information mainly including the number of frames, the target ID, the coordinates of the top left vertex of the bounding box, the width and height of the bounding box, whether the target is occluded or not, whether the target needs to be ignored, and so on.

S13, counting the IDs of the data sets according to the categories;

compared with a single-class multi-target tracking task, multi-class multi-target tracking needs to perform additional processing on a data set, and for original labeling, due to multi-class data information, multi-class IDs need to be processed in training data construction, and the main mode is to count the IDs according to classes, namely the IDs of each class are counted from 0.

S14, rotating and zooming each image in the data set;

compared with a single-class task, the number of target IDs of each class appearing in the whole video sequence needs to be counted, the target IDs are used as classification classes when the model trains the ReiD branch, data enhancement is performed on the model during training, each image introduced into the model is rotated and scaled, and the training effect of the model is improved.

And step S2, separating the target detection and the ReID task in the model structure.

The FairMOT is used as an original reference structure, and is an end-to-end anchor-free multi-target tracking framework constructed based on the CenterNet. The feature extraction part of the model performs feature extraction on a two-dimensional video image by using DLA34 as a backbone network, and then generates a plurality of branches head according to different visual tasks, wherein the branches are respectively a heat map branch, an offset branch, a wh branch and a ReID branch, and the branches share the feature map after feature extraction. Compared with the feature extraction part of the original FairMOT network framework, the problem of conflict between the target detection task and the ReID task to a certain extent during training is considered, namely the target detection task is used for maximizing the distance between objects in different categories and minimizing the objects in the same category, and the ReID task is used for maximizing the distance between different individuals in the same category, so that the aim of accurate re-identification is achieved. Therefore, for the two tasks of target detection and ReID, the model training often has conflicts in some aspects, and therefore, the model structure needs to be adjusted for the problem. The method mainly comprises the following steps:

aiming at the problem of collision between the ReID and the target detection, the model of the invention mainly separates two branches, and the functions of the two structures on the original FairMOT are realized by four branches which share the same encoder and decoder. The structure adjustment of the model in this respect is mainly to improve the decoder part of the backbone network DLA34, and change it into two decoders with the same structure, which are respectively used for target detection and ReID, but the two decoders do not share parameters in the decoder part, so that the mutual influence between the parameters is reduced during the training of the model, as shown in fig. 1, which is the overall structure of the algorithm model of the present invention.

compared with single-frame input of FairMOT, the input of the model of the invention is changed into the input of adjacent frames, in the testing stage, if the first frame of the video sequence is the input of the images of two first frames, in the input part, the processing mode of the images of the two frames is parameter sharing, and feature extraction is carried out through a DLA34 encoder.

S23, simultaneously inputting the extracted features into the two decoders with the same structure to respectively perform target detection and ReID tasks;

inputting the obtained features into two decoders simultaneously, respectively carrying out target detection and ReiD information processing, performing structure adjustment on a branch of a heat map in a target detection part compared with an original FairMOT structure, firstly, performing multi-layer convolution after the features of a previous frame obtained by the decoders, splicing the feature map with the features of a current frame obtained by the decoders, and finally obtaining the output of the branch through the branch of the heat map, wherein the output of the model of the invention is the prediction result of an adjacent frame in the two branches, and simultaneously carrying out loss calculation on the two prediction results during training, and on the ReiD branch, compared with the original FairMOT, the branch is added with a feature enhancement module, the features of the adjacent frame obtained by the decoder B and the heat map of the previous frame are used as input information of the feature module, and finally outputting the ReiD branch after information integration of the modules.

And step S3, constructing a ReID task module improvement model structure by using the time sequence information. The invention provides two improved model structures for constructing a ReID task module by using time sequence information, and as shown in FIG. 2, a first improved model structure diagram for constructing the ReID task module by using time sequence information is provided.

The first method for constructing the ReID task module to improve the model structure by using the time sequence information is divided into a training stage and an inference stage, wherein the training stage specifically comprises the following steps:

in the training stage, auxiliary training is carried out by inputting marking information, namely the number of objects which exist simultaneously in two adjacent frames and corresponding position indexes. And obtaining the characteristics of the corresponding positions of the characteristic maps in the previous frame through the information, and performing similarity calculation on the characteristics and the current characteristic maps to obtain the characteristic distance between each object in the previous frame and each point of the current frame, and keeping the point with the minimum distance to be regarded as the possible position of the object in the previous frame in the current frame image.

S312, feature fusion is carried out after position information corresponding to each two of the feature map in the previous frame and the current feature map is obtained;

and after the position information corresponding to each two is obtained, feature fusion is carried out, and the fusion mode selected in the step is that the corresponding feature matrixes are added to calculate the average.

The reasoning phase specifically comprises the following steps:

in the inference stage of the model, because there is no labeling information, it is necessary to use the heat map of the previous frame obtained by the model as auxiliary information, obtain the number of targets that may exist in the previous frame from the heat map, and use the ReID feature information of the positions corresponding to these targets as one of the inputs to the feature module.

S322, setting a threshold, if the distance between the center point of the previous frame and the center point of the matched current frame exceeds the threshold, considering that the matched point is not credible and neglected, and only keeping the matching point with high credibility and carrying out feature fusion with the current feature map;

in the inference stage, because there is a situation that an object appearing in the previous frame may disappear in the current frame, for the situation, a distance constraint needs to be added, and a threshold needs to be set, that is, if the distance between the center point of the previous frame and the center point of the matched current frame is far away and exceeds the threshold, the matched point is considered to be unreliable, the matched point is ignored, only the matching point with high reliability is reserved, and a feature fusion mode consistent with the training stage is performed to be used as the final output of the ReID branch.

The second method is that a ReID task module is constructed by using time sequence information to improve a model structure, and specifically, a heat map of a previous frame, a feature map of the previous frame and a feature map of a current frame are subjected to feature fusion; FIG. 3 is a diagram of a second improved ReID task module model constructed by using timing information according to the present invention.

The second method utilizes the time sequence information to input three main blocks, namely a heat map of the previous frame and feature map information of an adjacent frame, and the three blocks are subjected to channel splicing and then are used as final output on the branch after multi-layer convolution of the ReID branch.

In the training stage of the model, the heat map input of the previous frame is also provided with the label information, in the inference stage, the heat map obtained by the model detection part is used as the input of the ReID branch, and the following formula represents the specific loss function representation for training the model algorithm in the method:

the expression (1) represents the training loss function of the hot map branch, (2), (3) represent the loss functions of the detection frame width and the ReID itself, respectively, and the expression (4) represents the loss function integration of the final model training and the corresponding weight of each part.

And step S4, post-processing reasoning of the model, and applying the improved model structure to a data association matching process of multi-target tracking. The method mainly comprises the following steps:

the post-processing portion of the model is roughly consistent with the original FairMOT, and data correlation is accomplished primarily by DeepsORT. Compared with single-class multi-target tracking, the method adjusts the post-processing part on multiple classes, and different from the training phase, the ID assignment is not carried out on each class in the post-processing phase, namely the multiple classes carry out the ID assignment together according to the sequence of object detection.

S41, taking three frames as a round, normalizing and standardizing the heat map obtained by the model and the ReID characteristics by the first frame, carrying out non-maximum value inhibition processing on the heat map, screening out possible objects according to a set threshold value, and giving IDs to the objects of the first frame;

and the reasoning part takes DeepSORT as a main flow frame, takes three frames as a round, normalizes and standardizes the heat map and the ReID characteristics obtained by the model in the first frame, performs non-maximum suppression processing on the heat map, screens out possible objects according to a set threshold value, and endows the objects in the first frame with id. And repeating the operation of the first frame by the second frame to obtain a possible object, matching the object with the object of the first frame by surrounding the frame iou, keeping detection which is in accordance with expectation, giving the same id, and keeping the objects which are not matched.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. A video multi-target tracking method based on deep learning and time sequence feature enhancement is characterized by comprising the following steps:

s2, separating target detection and ReID tasks in the model structure;

2. The method for video multi-target tracking based on deep learning and temporal feature enhancement as claimed in claim 1, wherein the step S1 specifically includes the following steps:

s13, counting the IDs of the data sets according to the categories;

3. The method for video multi-target tracking based on deep learning and temporal feature enhancement as claimed in claim 1, wherein the step S2 specifically includes the following steps:

4. The method for video multi-target tracking based on deep learning and temporal feature enhancement as claimed in claim 3, wherein the step S23 is specifically as follows: in the target detection part, firstly, the characteristics of the previous frame obtained by a decoder are followed by a multilayer convolution, the characteristic graph is spliced with the characteristics of the current frame obtained by the decoder, and finally, the output of a target detection branch is obtained through a heat graph branch; in the ReID task part, a feature enhancement module is added, the features of the adjacent frames obtained by the decoder and the heat map of the previous frame are used as the input information of the feature module, and the output of the ReID task branch is obtained after the information integration of the modules.

5. The video multi-target tracking method based on deep learning and time series feature enhancement as claimed in claim 1, wherein the step S3 is specifically divided into a training phase and an inference phase.

6. The video multi-target tracking method based on deep learning and time sequence feature enhancement as claimed in claim 5, wherein the training stage specifically comprises the following steps:

and S312, performing feature fusion after obtaining the position information corresponding to each two of the feature map and the current feature map in the previous frame.

7. The video multi-target tracking method based on deep learning and time series feature enhancement as claimed in claim 5, wherein the inference stage specifically comprises the following steps:

and S322, setting a threshold, if the distance between the center point of the previous frame and the center point of the matched current frame exceeds the threshold, considering that the matched point is not credible and neglected, and only keeping the matching point with high credibility and carrying out feature fusion on the current feature map.

8. The method as claimed in claim 1, wherein the step S3 is specifically to perform feature fusion on the heat map of the previous frame, the feature map of the previous frame, and the feature map of the current frame.

9. The method for video multi-target tracking based on deep learning and temporal feature enhancement as claimed in claim 1, wherein the step S4 specifically includes the following steps: