CN107945210B

CN107945210B - Target tracking method based on deep learning and environment self-adaption

Info

Publication number: CN107945210B
Application number: CN201711237457.4A
Authority: CN
Inventors: 周圆; 李孜孜; 曹颖; 杜晓婷; 杨鸿宇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2021-01-05
Anticipated expiration: 2037-11-30
Also published as: CN107945210A

Abstract

The invention discloses a target tracking algorithm based on deep learning and environment self-adaptation, which consists of two parts, wherein one part is preprocessing, information is extracted from each frame image of a tracking video, and then positive and negative samples are further screened through significance detection and a convolutional neural network algorithm; the other part is a convolutional neural network for realizing a VGG model: firstly, extracting target features by using a three-layer convolutional network, secondly, classifying the target and the background by using a full connection layer, finally obtaining the position of the target to be tracked, and then starting the tracking process of the next frame. Compared with the prior art, the method (1) can accurately use the preprocessing information of the image while reducing the calculation complexity, so that the tracking effect is more accurate, and therefore, the content of the method is creative; (2) the tracker can adapt to various complex environmental scenes and has wide application prospect.

Description

Target tracking method based on deep learning and environment self-adaption

Technical Field

The invention relates to the field of target tracking of computer vision, in particular to a target tracking algorithm which is adaptive to an environment based on a deep learning method.

Background

Humans are connected and communicated with the outside by feeling, but the energy and the visual field of the human are very limited. Thus, human vision is greatly limited or even inefficient in various fields of application. Today, computer vision is attracting more and more attention of people due to rapid development of digital computer technology, people intend to replace human eyes with computers, enable the computers to have intellectualization, enable the computers to process visual information and perfect a plurality of short boards in human vision. Computer vision is a highly cross-sectional subject that integrates many fields such as artificial neural networks, psychology, physics, computer graphics, and mathematics.

Currently, in the field of computer vision, target tracking is one of very active topics, and people are also paying more attention to the field. The target tracking has wide application fields, for example, action analysis, behavior recognition, monitoring, human-computer interaction and the like all use the knowledge, has important research value and great application prospect in science and engineering, and attracts the interests of a large number of researchers at home and abroad.

The deep learning is well applied to the image processing direction, and a new solution is provided for the target tracking direction. In the field of target tracking, new sequences are tested by automatically learning more abstract and essential features from acquired samples using a deep-learning deep architecture. The tracking technology combined with the deep learning method gradually surpasses the traditional tracking method in performance, and becomes a new trend in the field.

So far, no target tracking algorithm based on deep learning and environment adaptation is developed in papers and documents published at home and abroad.

Disclosure of Invention

Based on the prior art, the invention provides a target tracking method based on deep learning and environment self-adaptation, which utilizes a convolutional neural network to self-adaptively adjust the parameters of the network, so that a tracker has high accuracy and combines the preprocessing advantage of significance detection in various tracking scenes.

The invention discloses a target tracking method based on deep learning and environment self-adaptation, which comprises the following steps:

step 1, adopting a picture with 107 multiplied by 107 pixel points as input;

step 2, preprocessing comprises positive sample preprocessing and negative sample processing, including positive sample preprocessing and negative sample preprocessing; wherein, the step of positive sample pretreatment comprises: firstly, a sampling flow is executed: taking a rectangle larger than the grountruth value of the target around the target in the positive sample as a sampling frame according to the grountruth value, calculating the proportion of the saliency map of the positive sample in the whole sampling frame, if the proportion is larger than a set threshold value, taking the positive sample as a pure positive sample, and if the proportion is smaller than the set threshold value, discarding the positive sample; then, significance testing is utilizedDetecting the shape of a target by a measuring algorithm to obtain a saliency map, binarizing the obtained saliency map, replacing the original frame image with the binarized saliency map, and sampling the binarized whole frame image according to the previous sampling process; the negative sample pretreatment step comprises the following steps: screening negative samples by using a difficult case mining algorithm, carrying out one-time forward propagation on the sampled samples in a convolutional neural network, arranging the samples with larger loss in sequence, taking the selected samples with larger loss as 'difficult cases', and training the network by using the samples; wherein: during off-line multi-domain training, 50 positive samples and 200 negative samples are adopted from each frame, the positive samples and the negative samples respectively have a coincidence rate which is more than or equal to 0.7 and less than or equal to 0.5 with a frame of a ground-route, and the positive samples and the negative samples are respectively selected according to the standard; likewise, for online learning, collection

A positive sample and

negative samples and follows the upper sample coincidence rate standard;

step 3, adopting a bounding box regression model when the first frame is trained, and specifically processing the model comprises the following steps: for a given first frame in a tested video sequence, training a linear bounding box regression model by using a three-layer convolution network to predict the position of a target and extract the characteristics of the target; in each frame of the subsequent video sequence, a bounding box regression model is used to adjust the position of the bounding box that predicts the corresponding target.

Compared with the prior art, the invention has the following effects:

(1) the method can reduce the calculation complexity and accurately use the preprocessing information of the image, so that the tracking effect is more accurate, and therefore, the method has originality;

(2) the tracker can adapt to various complex environmental scenes and has wide application prospect.

Drawings

FIG. 1 is an overall framework of the target tracking method based on deep learning and environment adaptation of the present invention; FIG. 1(a) is a basic model of the tracking algorithm herein; FIG. 1(b) is a significance detection model; FIG. 1(c) a deep learning tracking model;

FIG. 2 shows the results of the Diving sequence tracking test

FIG. 3 shows the result of ball sequence tracking test

Detailed Description

The invention relates to a target tracking method based on deep learning and environment self-adaptation, which comprises two parts, wherein one part is preprocessing, information is extracted from each frame image of a tracking video, and then positive and negative samples are further screened by significance detection and a convolutional neural network algorithm; the other part is a convolutional neural network for realizing a VGG model: firstly, extracting target features by using a three-layer convolutional network, secondly, classifying the target and the background by using a full connection layer, finally obtaining the position of the target to be tracked, and then starting the tracking process of the next frame.

The specific process is described in detail as follows:

step 1, adopting a picture with 107 multiplied by 107 pixel points as input; in order to ensure that the feature graph output by the convolution layer is matched with the input size, the input full convolution layer is ensured to be a one-dimensional vector;

step 2, preprocessing comprises positive sample preprocessing and negative sample processing

(1) Pretreatment of a positive sample: the positive samples taken by the general method are sometimes negative samples containing most of the background, and such "positive samples" may cause certain errors for training in the convolutional neural network. Therefore, the method screens the adopted positive sample to a certain extent, so that the positive sample is more pure. The specific implementation method is as follows:

firstly, taking a rectangle around a target in a positive sample according to a grountruth value, wherein the rectangle is larger than the grountruth value of the target; and calculating the proportion of the saliency map in the whole sampling frame, if the proportion is greater than a certain set threshold value, inputting the saliency map as a pure positive sample into a network, and if the proportion is less than the set threshold value, discarding the saliency map. This can be used to ensure that the resulting positive samples are almost pure.

Then, a "saliency" detection is performed, i.e. the detection of an object that is salient in an area. The method comprises the following steps of roughly detecting the shape of a target by using a saliency detection algorithm, binarizing an obtained saliency map, inserting the obtained saliency map into an original image of one frame, sampling the whole image of the binarized frame according to the previous sampling process, and inspecting the target by using a saliency method.

The positive sample screening in the step is a universal positive sample screening method in most tracking algorithms; the idea is applied to the pre-trained network, and can have certain influence on the parameters of the whole network.

(2) Negative sample pretreatment

In trace detection, most negative samples are usually redundant, and only a few representative negative samples are useful for training the tracker. For the usual SGD method, it is easy to cause the problem of tracker drift. For solving the problem, the most common idea is the difficult excavation idea. For screening of negative samples, the idea of hard case mining is applied, a sampled sample is subjected to forward propagation in a convolutional neural network once, samples with larger loss are arranged in sequence, and the samples are selected, so that the samples are close enough to the positive samples and are not positive samples, the samples are called as 'hard cases', the network is trained by using the samples, and the network can better learn the difference between the positive samples and the negative samples.

Step 3, adopting a bounding box regression model when the first frame is trained, and specifically processing the model comprises the following steps: for a given first frame in a tested video sequence, a three-layer convolution network is used for training a linear regression model to predict the position of a target and extract the characteristics of the target; in each frame of a subsequent video sequence, the position of a boundary frame of a target is adjusted by using a regression model, the target and the background in the image are classified by using a full-connection layer to obtain an image block with high target probability, the image block is regarded as a target to be tracked, the position of the target to be tracked can be obtained, and then the tracking process of the next frame is started.

In the positive sample preprocessing, a long-short updating strategy can also be adopted: the network is re-updated with positive samples collected over a period of time. When tracking the target, once a loss of tracking is found, a short-term update strategy is used, in which a positive sample for updating the network or a positive sample collected during this period is used. The negative examples used in both update strategies use the negative examples collected in the short-term update model. Stipulate T_sAnd T_lIs two frame index sets, short-term set to T_sLong term setting of T20 frames_l100 frames. The purpose of this strategy is to keep the sample the most "fresh" which is more advantageous for tracking the results.

After the neural network is trained offline, the video sequence to be tested is tracked online. Therefore, in the whole tracking algorithm, an online tracking algorithm part is needed. The specific implementation process of the online tracking algorithm is as follows:

inputting: filter { w of pre-training convolutional neural network CNN₁,...,w₅}

State x of an initialization target₁

And (3) outputting: estimating a state of a target

(1) Randomly initializing the weight w of the 6 th fully-connected layer₆So that w₆Obtaining a random initial value;

(2) training a bounding box regression model;

(3) extracting a positive sample

And negative sample

(4) The positive sample is screened using the significance network,

(5) using extracted positive samples

And negative sample

To update the weight value of the full link layer w₄,w₅,w₆In which w₄,w₅,w₆Respectively represent the weight value of the 4.5.6 th layer of full connection;

(6) setting a long and short updating initial value: t is_s← {1} and T_l←{1}；

(7) The following operations were repeated:

extracting a candidate sample of an object

By the formula

State of finding optimal target

Wherein the content of the first and second substances,

the formula shows that the sample with the highest score of the candidate positive sample passing through the convolutional neural network is the optimal target state

If it is not

Then extracting the training sample

And

T_s←T_s∪{t}，T_l←T_l∪{t}

where T denotes the T-th frame, T_sAnd T_lRepresenting short and long index sets, respectively. Mixing T with T_sAnd T_lRespectively, is given to T_sAnd T_lUpdating the values of the two frame index sets;

if the position length of the short frame index set is greater than 20, namely: i T_s|＞τ_sThen set the short index T_sMinimum element culling in

Wherein v represents a value in the short index set;

if the position length of the long frame index set is greater than 100, that is: i T_l|＞τ_lThen long index set T_lMinimum value culling in

Adjusting a predicted position of an object using a bounding box regression model

If it is not

Updating weights w using positive and negative samples in the short-term model₄,w₅,w₆}；

In other cases, the weights w are updated using positive and negative samples in the short-term model₄,w₅,w₆}。

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

The target tracking method based on deep learning and environment adaptation proposed by the patent is verified below. Meanwhile, the training error of the algorithm is compared with the training error of the algorithm before improvement through a simulation experiment, and the effectiveness of the algorithm is verified through a large number of experimental results. The experimental results are presented in the form of a tracked target box.

Candidate object generation to generate a candidate object in each frame, N-256 samples are selected,

wherein the content of the first and second substances,

representing the previous target state; the covariance matrix is a parameter of (0.09 r)²) R represents the average of the length and width of the target box in the previous frame. The size of each candidate target frame is 1.5 times that of the initial state target frame.

Training data: during off-line multi-domain training, 50 positive samples and 200 negative samples are adopted in each frame, the positive samples and the negative samples respectively have a coincidence rate which is more than or equal to 0.7 and less than or equal to 0.5 with a frame of a ground-route, and the positive samples and the negative samples are respectively selected according to the standard. Likewise, for online learning, collection

A positive sample and

negative samples and follows the upper sample coincidence criterion. But when the first frame is sampled, we take positive samples

Negative sample

For bounding box regression u, we used 1000 training samples.

Network learning: for multi-domain network learning for training K branches, the learning rate parameter of the convolutional layer is set to 0.0001, and the learning rate of the fully-connected layer is set to 0.001. At the very beginning of training the fully-connected layers, we iterate 30 times, with the learning rates for fully-connected layers 4 and 5 set to 0.0001 and the sixth fully-connected layer learning rate set to 0.001.

Table 1 shows the results of experiments in which the improved algorithm was added to the "significance" preprocessing network, and table 2 shows the results of experiments in which the unmodified algorithm was not added to the preprocessing network.

TABLE 1 training results after improved algorithm

TABLE 2 training results of the unmodified algorithm

Claims

1. A target tracking method based on deep learning and environment self-adaptation is characterized by comprising the following steps:

step (1), adopting a picture with 107 multiplied by 107 pixel points as input;

the pretreatment comprises positive sample pretreatment and negative sample treatment, wherein the positive sample pretreatment and the negative sample pretreatment are included; wherein, the step of positive sample pretreatment comprises: firstly, a sampling flow is executed: taking a rectangle larger than the grountruth value of the target around the target in the positive sample as a sampling frame according to the grountruth value, calculating the proportion of the saliency map of the positive sample in the whole sampling frame, if the proportion is larger than a set threshold value, taking the positive sample as a pure positive sample, and if the proportion is smaller than the set threshold value, discarding the positive sample; secondly, detecting the shape of the target by using a saliency detection algorithm to obtain a saliency map, binarizing the obtained saliency map, replacing the original frame image with the binarized saliency map, and sampling the binarized whole frame image according to the previous sampling process; the negative sample pretreatment step comprises the following steps: is difficult to useThe mining algorithm screens negative samples, the sampled samples are subjected to one-time forward propagation in a convolutional neural network, the samples with large loss are arranged in sequence, the selected samples with large loss are taken as 'difficult cases', and the network is trained by the samples; wherein: during off-line multi-domain training, 50 positive samples and 200 negative samples are adopted from each frame, the positive samples and the negative samples respectively have a coincidence rate which is more than or equal to 0.7 and less than or equal to 0.5 with a frame of a ground-route, and the positive samples and the negative samples are respectively selected according to the standard; likewise, for online learning, collection

A positive sample and

negative samples and follows the upper sample coincidence rate standard;

step (3), adopting a bounding box regression model when the first frame is trained, and specifically processing the method comprises the following steps: for a given first frame in a tested video sequence, training a linear bounding box regression model by using a three-layer convolution network to predict the position of a target and extract the characteristics of the target; in each frame of the subsequent video sequence, a bounding box regression model is used to adjust the position of the bounding box that predicts the corresponding target.