CN110956643A

CN110956643A - Improved vehicle tracking method and system based on MDNet

Info

Publication number: CN110956643A
Application number: CN201911227267.3A
Authority: CN
Inventors: 李爱民; 王建文; 逄业文
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-04-03

Abstract

The invention provides an improved vehicle tracking method and system based on MDNet, which adopts an improved MDNet tracking algorithm, firstly, Mask RCNN is adopted to carry out example segmentation operation on a video frame, and candidate areas obtained by example segmentation are used as the input of the improved MDNet algorithm, so that a foreground tracking target is strengthened, the tracking range is reduced, a background and a target can be clearly distinguished, and the tracking real-time performance and the tracking accuracy are improved; meanwhile, the training and testing of the improved MDNet tracking algorithm are carried out on line, and the smaller network structure after example segmentation enables the robustness of the improved MDNet tracking algorithm to be better in target tracking.

Description

Improved vehicle tracking method and system based on MDNet

Technical Field

The disclosure relates to the technical field of vehicle tracking, in particular to an improved vehicle tracking method and system based on MDNet.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Computer vision is one of the popular disciplines in the field of artificial intelligence and has received wide attention from both domestic and foreign scholars. Visual target tracking, as an important branch of research in computer vision, has attracted the eye of visual researchers. Beginning in 2015, deep learning enters the military target tracking field, and features of targets can be better extracted by using the deep learning. The target is better expressed, powerful target change is processed, the tracker is prevented from drifting, and the target can be subjected to range positioning. Visual target tracking technology has been widely used in many areas of life and military affairs. The vehicle target tracking is a key problem in the research of the intelligent traffic field, and the intelligent traffic system performs tasks such as traffic flow control, vehicle illegal behavior detection and the like according to the acquired video images. Accurate detection and tracking of vehicle targets is of great significance to traffic safety and intelligent vehicle management.

The inventor of the present disclosure finds that currently, commonly used algorithms for tracking a moving vehicle mainly include optical flow-based target tracking, motion estimation-based target tracking, recognition-based target tracking, and deep learning-based target tracking methods, and a difficulty of vehicle target tracking research is how to ensure robustness, real-time performance, and accuracy of the algorithms. The existing tracking algorithm has a good effect when the tracking problem of the moving vehicle under the condition of a simple background is processed, but due to the complexity of target motion and the timeliness of target characteristics, when a tracking target is shielded, rotated, changed in scale and interfered by the background, the tracking effect is poor, and a robust tracking effect is difficult to obtain.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an improved vehicle tracking method and system based on MDNet, which improve the discrimination of background and foreground targets and further improve the real-time property, accuracy and robustness of vehicle tracking.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

the first aspect of the present disclosure provides an improved MDNet-based vehicle tracking method.

An improved vehicle tracking method based on MDNet, comprising the steps of:

preprocessing an acquired video sequence, inputting the video sequence into a Mask R-CNN neural network for example segmentation to obtain a candidate region of a vehicle target to be tracked;

the method comprises the following steps of utilizing an obtained candidate region of a vehicle target to be tracked as an input, utilizing an MDNet network to track the target, and specifically comprising the following steps:

when the state of each frame of target of a video sequence is predicted, generating a plurality of candidate area positive samples and negative samples which accord with Gaussian distribution according to the predicted target position of the previous frame, then obtaining scores of the positive samples and the negative samples according to an MDNet network, and finding out the candidate area sample with the highest target score as the current optimal target state;

and averaging the scores of all the candidate area samples, comparing the average with a preset threshold value, judging that the target tracking is successful when the scores of all the candidate area samples are greater than the preset threshold value, and otherwise, judging that the target tracking is failed.

As some possible implementation modes, a random gradient descent method is adopted to train the convolutional neural network of the MDNet, and N is sequentially selected when the first video sequence is iterated each time₁Frame, then in N₁In frames, each frame takes M₁A positive sample and M₂A bounding box of negative examples, including N₁*M₁A positive sample, N₁*M₂And all the positive samples and the negative samples form a small batch, and the segmentation results obtained by segmenting the examples are unified into A x A as the input of the network.

As some possible implementation manners, the MDNet neural network adopts an RReLU activation function, sets a dynamically changing learning rate according to the number of training rounds, adopts a large learning rate when the distance from the optimal solution is far from the initial training, and gradually reduces the learning rate in the process of approaching the optimal solution along with the increase of the number of iterations.

As some possible implementation manners, averaging all the candidate area samples to generate a target boundary frame of the current frame, if the tracking is successful, performing fine adjustment on the boundary frame, generating a plurality of positive sample areas and negative sample areas according to the target boundary frame predicted by the current frame, performing forward propagation on the sample areas respectively, and storing the third convolution layer convolution characteristics of the areas.

By way of further limitation, positive sample regions of frames prior to a first predetermined number are discarded if the number of video frames exceeds a first predetermined number, and negative sample regions of frames prior to a second predetermined number are discarded if the number of video frames exceeds a second predetermined number.

As a further limitation, if the tracking fails, performing short-time updating, selecting a positive sample and a negative sample with the latest preset frame number, and then performing iterative training for a preset number of rounds;

randomly extracting third convolution layer features and T of S positive samples at each iteration₁Third convolution layer characteristics of negative samples, constituting a small batch, and₁putting the negative samples into an MDNet network, circulating for preset times, and calculating scores;

then from T₁Selecting T from the negative sample₂And taking the maximum calculation target score as a difficult negative sample, respectively calculating the score of the positive sample and the score of the difficult negative sample, carrying out forward propagation calculation loss, and carrying out parameter optimization of the MDNet network.

As a further limitation, the method is input into a Mask R-CNN neural network for example segmentation, specifically:

after the video sequence is input into a network, obtaining a corresponding feature map, and obtaining a plurality of candidate identification areas in the feature map;

sending the candidate identification areas into an RPN for binary classification, and filtering out the candidate identification areas which do not meet the requirements;

and obtaining the coordinates of part of candidate recognition areas by using an RPN, inputting the coordinates into ROI Pooling, outputting a characteristic diagram with the size of B x B for classification and positioning, and performing ROIAlign operation on the remaining candidate recognition areas.

A second aspect of the present disclosure provides an improved MDNet based vehicle tracking system.

An MDNet based improved vehicle tracking system comprising:

an instance partitioning module configured to: preprocessing an acquired video sequence, inputting the video sequence into a Mask R-CNN neural network for example segmentation to obtain a candidate region of a vehicle target to be tracked;

a target tracking module configured to: the method comprises the following steps of utilizing an obtained candidate region of a vehicle target to be tracked as an input, utilizing an MDNet network to track the target, and specifically comprising the following steps:

A third aspect of the present disclosure provides a medium having a program stored thereon, the program, when executed by a processor, implementing the steps in the MDNet-based improved vehicle tracking method according to the first aspect of the present disclosure.

A fourth aspect of the present disclosure provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the MDNet-based improved vehicle tracking method according to the first aspect of the present disclosure when executing the program.

Compared with the prior art, the beneficial effect of this disclosure is:

1. according to the method, the video frame is subjected to instance segmentation, then the segmented foreground vehicle region is used as the input of MDNet vehicle target tracking processing to be subjected to subsequent target tracking processing, and the smaller tracking region is obtained after instance segmentation, so that the network structure adopted by the method can be relatively smaller, the background and foreground targets can be distinguished conveniently, and the tracking real-time performance and accuracy can be improved.

2. The MDNet tracking algorithm is improved, the improved MDNet tracking algorithm firstly adopts Mask RCNN to carry out example segmentation operation on a video frame, and candidate areas obtained by example segmentation are used as input of the improved MDNet tracking algorithm, so that a foreground tracking target is strengthened, the tracking range is reduced, a background and a target can be clearly distinguished, and the tracking instantaneity and accuracy are improved.

3. The network architecture adopted by the present disclosure contains five hidden layers, three convolutional layers (conv1-conv3) and two global connection layers (fc4-fc5), and adopts a smaller network structure, so that the network architecture achieves a more robust effect in target tracking.

Drawings

Fig. 1 is a schematic diagram of an overall implementation of the MDNet-based improved vehicle tracking method provided in embodiment 1 of the present disclosure.

Fig. 2 is a schematic flowchart of the Mask RCNN algorithm provided in embodiment 1 of the present disclosure.

Fig. 3 is a schematic diagram of an example segmentation result provided in embodiment 1 of the present disclosure.

Fig. 4 is a schematic flow chart of an improved algorithm MDNet algorithm provided in embodiment 1 of the present disclosure.

Fig. 5 is a schematic diagram of a vehicle target tracking result provided in embodiment 1 of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example 1:

as shown in fig. 1, embodiment 1 of the present disclosure provides an improved vehicle tracking method based on MDNet, which includes performing example segmentation on a video frame by using Mask RCNN, and using candidate regions obtained by example segmentation as input of an improved MDNet algorithm, thereby enhancing a foreground tracking target, reducing a tracking range, more clearly distinguishing a background from a target, and performing training and testing on line.

The network architecture adopted receives 107 × 107RGB input and comprises five hidden layers, wherein three convolutional layers (conv1-conv3) and two global connection layers (fc4-fc5), and the embodiment adopts a smaller network structure, so that the network architecture obtains a more robust effect in target tracking.

The method comprises the following specific steps:

step 1: preprocessing operations such as data labeling are performed on the video sequence, and the labeled video sequence is input into the neural network, as shown in fig. 2.

Step 2: after input, a corresponding feature map (feature map) is obtained, and a plurality of ROI candidates (regions of interest) are obtained in the feature map. And then, the candidate ROIs are sent to an RPN network for binary classification, and a part of candidate ROIs are filtered out.

The RPN network proposes the coordinates of several ROIs as [ x, y, w, h ], then inputs ROI Pooling (region of interest Pooling), outputs 7 × 7 size feature maps for classification and localization. The purpose of ROI Pooling is to uniformly adjust the different sized ROIs to 7 × 7 smaller feature maps. And then, the remaining ROI is subjected to ROI Align (region of interest aggregation) operation, so that the problem of region mismatching in the ROI Pooling operation can be better solved.

Mask RCNN adopts an average two-value cross entropy loss function, and the loss function of Mask R-CNN can be described as follows:

L_final＝L({p_i},{t_i})+(L_cls+L_box+L_mask) (1)

wherein L is_clsAnd L_boxFor classification and regression, L_maskIs to classify each pixel, with K m dimensions of the output, K representing the number of classes, m being the size of the extracted ROI image.

Finally, these ROIs were classified and Mask generated. A sequence of video frames to be segmented is obtained as shown in fig. 3.

The penalty function for training the RPN is described as follows:

in the above formula, i is the serial number of the anchor in each small batch, p is the probability of the anchor target, and p is^*Is a label, t is four parameters of the prediction box, t^*Is a parameter of the calibration frame, L_clsIs a classification loss function, L_regIs a regression loss function.

For roilign back propagation the following is described:

denotes the distance between two points, Δ h and Δ w denote x_iAnd x_iDifference between abscissa and ordinate (r, j).

And step 3: and performing example segmentation operation on the target in the video image by using Mask RCNN to obtain candidate region information of the vehicle target to be tracked.

The result obtained in step 2 is used as the input of the improved MDNet algorithm, as shown in fig. 4, which makes it easier to improve the tracking efficiency and effectively distinguish the tracked target from the background. By adopting the method, the situation that the target tracking effect is deteriorated and even the tracking target is lost due to the fact that the space position information of the target is diluted along with the deepening of the network can be avoided.

The size of the vehicle candidate region obtained by example segmentation is much smaller than that of the original image, so that the tracking can be realized by adopting smaller network depth. The MDNet separates, from the domain-specific (region-dependent representation) information, domain-independent representation information by a multi-domain learning framework.

The CNN employed is trained by a random gradient descent (SGD) method, where each domain is specifically processed in each iteration. Because of the SGD method training, the video sequence is first scrambled. The original video sequence is arranged according to the frame order, and each time the first video sequence is iterated, 8 frames are taken in turn, and then in the 8 frames, each frame takes a bounding box of 4 positive samples (IOU > -0.7) and 12 negative samples (IOU < ═ 0.3), where IOU is the overlap ratio of the generated candidate box and the generated ground channel box. The segmentation results obtained by instance segmentation are unified into 107 × 107 as the input of the network, and a mini-batch is composed of 32 positive samples and 96 negative samples.

The IOU equation is defined as follows:

the MDNet algorithm improvement method comprises the following steps: using the RReLU activation function, if the learning rate setting is too small, the entire network convergence process may become extremely slow; if the learning rate is set too large, the gradient may wander around the minimum and may even fail to converge to achieve the desired effect. In the present embodiment, the learning rate is not fixed, but a dynamically changing learning rate is set according to the number of training rounds. When training is started, a slightly larger learning rate is adopted when the distance from the optimal solution is far, and the learning rate is gradually reduced in the process of approaching the optimal solution along with the increase of the iteration times.

The activation function employed is expressed as follows:

wherein, a_ji～U(l,u),l＜u and l,u∈[0,1)

And 4, step 4: in the process of target tracking, a simple network is always kept, and meanwhile two updating methods, namely Long-term updating and short-term updating, are adopted according to the change speed of the appearance of the target. Long-term updates are updated after regular intervals, short-term updates are updated when potential update failures occur, i.e., the positive score of the predicted target is less than 0.5.

When predicting the state of each frame of target, firstly extracting N templates around the previous frame of object, and then obtaining the score f of the positive sample according to the network⁺(xⁱ) Score f of Sum negative sample^-(xⁱ). By finding the score of the largest sample as the current optimal target state X^*：

The method specifically comprises the following steps: and generating 256 candidate areas which accord with the Gaussian distribution on each frame according to the target position predicted by the previous frame, wherein the generated candidate boxes are represented as (x, y, w, h). Then, the candidate frame area is cut out from the original image, and then the size resize is 107 × 107 to be used as the input of the network for calculation. The scores of the 256 candidate regions are calculated through forward propagation, and the candidate region with the highest target score is selected. These candidate regions are averaged to generate a target bounding box of the current frame, and an average of the candidate region scores is calculated. A threshold is set and then compared with a threshold to determine whether tracking is successful.

Step 6: if the tracking is successful, performing bounding box fine tuning, generating 50 positive sample regions (IOU > -0.7) and 200 negative sample regions (IOU < -0.3) according to the predicted targetoutputting box of the current frame. The sample regions are then separately forward propagated, and finally the conv3 (third convolution layer) characteristics of these regions are preserved. Wherein the number of video frames exceeds 100 and discards the positive sample regions of the earliest frames, and if the number of video frames exceeds 20, discards the negative sample regions of the earliest frames.

If the tracking fails, the aforementioned short-time update is performed. And selecting the positive sample and the negative sample of the latest 20 frames, and then carrying out 15 rounds of iterative training, wherein the iterative process is the same as the previous iterative process, and randomly extracting the conv3 features of 32 positive samples and the conv3 features of 1024 negative samples in each iteration to form a mini-batch. These 1024 negative samples are then put into the test model, 4 cycles are made, scores are calculated, and scores where the calculation result is the target are retained. And then picking out 96 negative samples with the largest calculated target score from the 1024 negative samples as the difficult negative samples. Then, a training model is introduced, scores of positive samples (32) and scores of difficult negative samples (96) are calculated respectively, loss is calculated through forward propagation, optimization of an optimizer and updating of parameters are carried out, and the like.

And 7: the tracked object is displayed resulting in a tracked video sequence, as shown in fig. 5.

Example 2:

this disclosed embodiment 2 has improved an improvement vehicle tracking system based on MDNet, includes:

Example 3:

the embodiment 3 of the present disclosure provides a medium on which a program is stored, which when executed by a processor implements the steps in the MDNet-based improved vehicle tracking method according to the embodiment 1 of the present disclosure.

Example 4:

the embodiment 4 of the present disclosure provides an electronic device, which includes a memory, a processor, and a program stored on the memory and executable on the processor, and the processor executes the program to implement the steps in the MDNet-based improved vehicle tracking method according to the embodiment 1 of the present disclosure.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. An improved vehicle tracking method based on MDNet, characterized by comprising the following steps:

2. The MDNet-based improved vehicle tracking method of claim 1, wherein the convolutional neural network of the MDNet is trained by using a random gradient descent method, and each time the convolutional neural network is iterated to the first video sequence, N is sequentially taken₁Frame, then in N₁In frames, each frame takes M₁A positive sample and M₂A bounding box of negative examples, including N₁*M₁A positive sample, N₁*M₂And all the positive samples and the negative samples form a small batch, and the segmentation results obtained by segmenting the examples are unified into A x A as the input of the network.

3. The MDNet-based improved vehicle tracking method according to claim 1, wherein the MDNet neural network uses a RReLU activation function, sets a dynamically changing learning rate according to the number of training rounds, uses a large learning rate when the training is just started and is far from the optimal solution, and gradually reduces the learning rate as the number of iterations increases in the process of approaching the optimal solution.

4. The MDNet-based improved vehicle tracking method of claim 1, wherein the target bounding box of the current frame is generated by averaging all the candidate region samples, and if tracking is successful, fine-tuning of the bounding box is performed, a plurality of positive sample regions and negative sample regions are generated from the predicted target bounding box of the current frame, forward propagation is performed on the sample regions, respectively, and third convolutional layer convolution characteristics of the regions are saved.

5. The MDNet-based improved vehicle tracking method of claim 2, wherein the positive sample regions of frames before the first preset number are discarded if the number of video frames exceeds a first preset number, and the negative sample regions of frames before the second preset number are discarded if the number of video frames exceeds a second preset number.

6. The MDNet-based improved vehicle tracking method of claim 2, wherein if tracking fails, a short time update is performed, a positive sample and a negative sample of a latest preset frame number are selected, and then iterative training is performed for a preset number of rounds;

7. The MDNet-based improved vehicle tracking method of claim 2, wherein the MDNet-based improved vehicle tracking method is input into a Mask R-CNN neural network for instance segmentation, and specifically comprises the following steps:

8. An improved MDNet-based vehicle tracking system, comprising:

9. A medium having a program stored thereon, wherein the program, when executed by a processor, implements the steps in the MDNet based improved vehicle tracking method of any of claims 1-7.

10. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps in the MDNet-based improved vehicle tracking method of any one of claims 1-7.