CN108596958B

CN108596958B - Target tracking method based on difficult positive sample generation

Info

Publication number: CN108596958B
Application number: CN201810443211.0A
Authority: CN
Inventors: 李成龙; 杨芮; 王逍; 汤进; 罗斌
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2021-06-04
Anticipated expiration: 2038-05-10
Also published as: CN108596958A

Abstract

The invention discloses a target tracking method based on difficult positive sample generation, which is characterized in that for each video in training data, a variational self-encoder is utilized to learn a corresponding flow pattern, namely a positive sample generation network, and fine-tuning encoding is carried out according to an encoded input image to generate a large number of positive samples; inputting the positive samples into a difficult positive sample conversion network, training an intelligent agent to learn to shield a target object by using a background image block, continuously adjusting a bounding box by the intelligent agent to enable the samples to become difficult to identify, achieving the purpose of generating the difficult positive samples, and outputting the difficult positive samples as shielded difficult positive samples; and training a twin network for matching the target image block with the candidate image block based on the generated difficult positive sample to complete the positioning of the current frame target until the whole video processing is completed. The invention is based on a target tracking method for generating difficult positive samples, and directly learns the flow pattern distribution condition of the target from data, thereby obtaining a large amount of various positive samples.

Description

Target tracking method based on difficult positive sample generation

Technical Field

The invention relates to a visual tracking technology, in particular to a target tracking method based on difficult positive sample generation.

Background

Currently, the mainstream deep learning method tracking generally comprises the following steps: firstly, collecting a large amount of manually marked videos; secondly, performing dense sampling of positive and negative samples near the first frame marking frame on each video; thirdly, training a binary classifier by using the sample sampled in the last step; fourthly, confirming candidate areas near the search box, classifying, and selecting the area with the highest score as a tracking result; and fifthly, repeating the steps until the video is finished.

The prior art is not enough: as shown in fig. 1, the existing dense sampling method has insufficient sample diversity; the difficult samples are few, and the model is too sensitive to the challenge factors. Since visual tracking only gives one bounding box as an initial condition, and tracked targets are diverse, a tracking method based on deep learning cannot obtain enough training samples, and belongs to a typical small sample learning problem. In view of fig. 2, in the existing labeled video, various challenging video frames are very short.

The diversity of positive samples obtained by conventional dense sampling is insufficient, so that the model is easy to over-fit and is too sensitive to challenging factors; the existing difficult positive samples are obtained according to the prediction result of the model, namely: setting a threshold range, selecting all samples with confidence degrees in the range, and placing the samples in the next cycle to continuously fine-tune the model, so that the robustness of the model is stronger. However, this method is chosen based on model-dependent predictions, but the predictions of the model are not all accurate, thereby filling the tracking model with uncertainty.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the conventional dense sampling method has the advantages that the obtained samples are insufficient in diversity and few in difficult samples, the model is too sensitive to challenge factors, and the target tracking method based on the difficult positive sample generation is provided.

The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:

(1) acquiring a marked video for training a depth tracking model;

(2) for each video in the training data, a variational self-encoder is utilized to carry out learning of a corresponding flow pattern, namely a positive sample generation network, and fine-tuning encoding is carried out according to an encoded input image to generate a large number of positive samples;

(3) inputting the positive samples into a difficult positive sample conversion network, training an intelligent agent to learn to shield a target object by using a background image block, continuously adjusting a bounding box by the intelligent agent to enable the samples to become difficult to identify, achieving the purpose of generating the difficult positive samples, and outputting the difficult positive samples as shielded difficult positive samples;

(4) and training a twin network for matching the target image block with the candidate image block based on the generated difficult positive sample to complete the positioning of the current frame target until the whole video processing is completed.

In the step (1), a certain amount of video for tracking is calibrated manually, and the calibration content includes finding the same target object as the first frame on each video frame and giving the position of the target object in the current frame.

In the step (2), data is preprocessed, and the data is stored as a data format h5 file which can be read by the deep neural network, wherein the coincidence ratio of the sampled image block and the true value exceeding a distinguishing threshold is a positive sample, and the coincidence ratio lower than the preset threshold is a negative sample.

The difficult positive sample conversion network comprises convolution layers, and original signal characteristics are enhanced and noise is reduced through convolution operation;

the pooling layer reduces a plurality of characteristics by a sampling method by utilizing an image locality principle;

and a full-junction layer, each neuron of the full-junction layer being connected to each neuron of the next layer, performing normal classification.

Taking a positive sample image to be shielded as input, performing convolution operation from an input layer to a convolutional layer, wherein each neuron of the convolutional layer can be connected with a local receptive field with a certain size in the input layer, and obtaining the characteristics of the image to be shielded after convolution; the process from convolutional layer to pooling layer is to reduce the number of features of the previous layer; the features obtained after the convolutional layer and the pooling layer are classified by the full-link layer, the result is finally output after the calculation processing of the full-link layer, and the probability that each output node on the full-link layer selects an action for the intelligent agent, namely the probability that a certain action should be executed to change the action of the shielding region in the current state is output on each output node.

The training process adopts the main parameter settings of deep learning as follows: for a positive sample generation network, the initial learning rate is 0.001, the optimization algorithm is RMSprop, and the training iteration number is 20,000; for the positive sample conversion network, the mini-batch of training data is 100, the optimization method is Adam, and the initial learning rate is 1 e-6; for the twin network subsequently used for tracking, the initial learning rate is 0.0001, the momentum is 0.9, and the weight decay parameter is 0.0005.

The process of the deep learning is as follows:

for a positive sample generation network, two fully-connected layers are adopted to encode input images which are pulled into column vectors, then encoded features are input into two fully-connected branches, mean value and standard deviation estimation is carried out, then the encoded features are input into three fully-connected layers, and finally the reconstructed images are output;

for a difficult positive sample conversion network, inputting a sample into a pre-trained VGG network, outputting an action selected by an agent in a current state through two fully-connected layers, obtaining a new shielded sample through executing the action, calculating the similarity between the sample and a true value, if the action belongs to a mobile action, if the similarity is reduced, giving a positive reward of the current action, otherwise, giving a negative reward;

if the action belongs to the stop action, when the similarity is lower than a certain threshold value, giving a positive reward, otherwise, giving a negative reward;

for a twin network mainly used for tracking, the twin network is provided with two branch networks which are respectively used for coding a target to be searched and a candidate search area of a current frame, and parameters of the two branches are shared;

the training of the network is based on positive and negative sample pairs, i.e.: if the coincidence degree of the two image blocks is greater than a certain threshold value, the two image blocks are regarded as the same image block, the given label is 1, otherwise, the two image blocks are regarded as different image blocks, and the given label is 0;

and measuring the difference between the output result of the model and the real graph by using a MarginContrastive Loss function, wherein the difference can be reversely transmitted into the network layer by layer to carry out parameter training of the model.

The actions comprise moving actions and stopping actions; the movement action represents a change of the current observation region; the stop action means that the occlusion area of the current frame has been found, and the search process of the current video frame is stopped, specifically: move up, move down, move left, move right, zoom out, enlarge, thin, flatten, stop.

Suppose that the score of an occluded object at time t is S_tThe score at the previous time is S_t-1The reward function for a movement action is set to:

where s and s' represent the current and next instant states, respectively.

The reward function for stopping the action is:

wherein the content of the first and second substances,

is a preset threshold parameter, and the reward function shows that if the agent chooses to stop the action at the current moment, the similarity between the currently occluded sample and the true sample will be calculated, if the similarity is lower than a certain threshold, a positive reward is given, otherwise a negative reward is given.

Compared with the prior art, the invention has the following advantages: the invention is based on the target tracking method of difficult positive sample generation, on the existing training data set, the current production model is utilized to directly learn the flow pattern distribution condition of the target from the data, and a large amount of various positive samples can be obtained without additional manual marking.

The generation of the difficult positive samples is regarded as a sequence decision problem, and the reinforcement learning algorithm is used for automatically learning the occlusion target so as to simulate the real occlusion situation, thereby obtaining more challenging positive samples.

Based on the model of the difficult positive samples, the test time is not increased explicitly, but the robustness and the tracking precision of the tracking algorithm are improved remarkably.

Drawings

FIG. 1 is a schematic diagram of a dense sampling method commonly used in prior art visual tracking;

FIG. 2 is a representation of various challenging factors in a video;

FIG. 3 is a flow chart of a difficult positive sample generation method of the present invention;

FIG. 4 is a sample obtained by constructing and sampling a positive sample flow pattern according to the present invention;

a shows the process of coding on a learned target flow pattern according to a real video frame, changing codes and then decoding a simulated target image; b is a table showing the actions selected and performed by the agent at each moment, namely: a process of shielding the target object with the background image block;

FIG. 5 is a sample of the difficulty achieved by the present invention;

FIG. 6 is a flow chart of the present invention;

FIG. 7 is a diagram illustrating the operation of bounding box transformation in the deep reinforcement learning algorithm of the present invention.

Detailed Description

The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.

As shown in fig. 3 to 7, the method for generating a hard positive sample of the present embodiment includes the following steps:

(1) acquiring a marked video for training a depth tracking model;

(2) aiming at each video in the training data, a variational self-encoder is utilized to learn a corresponding flow pattern (namely, a positive sample generation network); the network structure of the positive sample generation network comprises a full connection layer, the input of the network is an image matrix pulled into column vectors, the output of the network is reconstructed image vectors, and then size normalization is carried out to obtain a conventional color image.

(3) For the reconstructed positive samples, the present embodiment performs conversion of the difficult positive samples, specifically, regards the generation problem of the difficult positive samples as a sequence decision problem, and performs learning of the partial network by using deep reinforcement learning. The network structure of the positive sample generation network comprises a convolutional layer, a pooling layer and a full-link layer. The input of the network is a color image, the intelligent agent can continuously adjust the bounding box (namely, operations such as moving, scaling and the like) to make the sample difficult to identify, so that the purpose of generating the difficult positive sample is achieved, and the blocked difficult positive sample is output.

(4) Based on the generated hard positive samples, the embodiment trains the twin network to match the target image blocks with the candidate image blocks to complete the positioning of the current frame target until the whole video processing is completed.

The difficult positive sample conversion network may include: the three layers of the convolutional layer, the pooling layer and the full-connection layer are as follows:

a Convolutional layer (Convolutional layer), which enhances the original signal characteristics and reduces noise through convolution operation, and the specific convolution operation can be realized by adopting the prior art;

the pooling layer (Poolinglayer) reduces many characteristics by sampling method using image locality principle, and can include modes of maximum pooling, mean pooling, random pooling and the like, and the prior art can be adopted for specific realization:

a fully connected Layer (fully connected Layer), each neuron of which is connected to each neuron of the next Layer, performs normal classification like a conventional Multi-Layer Perceptron (MLP) neural network.

Taking the positive sample image to be shielded as input, performing convolution operation from an input layer to a convolutional layer, wherein each neuron of the convolutional layer can be connected with a local receptive field with a certain size in the input layer, and obtaining the characteristics of the image to be shielded after convolution; the process from convolutional layer to pooling layer may be referred to as pooling process, with the goal of reducing the number of features of the previous layer; the features obtained after the convolution layer and the pooling layer are classified by the full-connection layer, and the result is finally output after the calculation processing of the full-connection layer.

And each output node on the full connection layer is the probability of the action to be selected by the agent, namely the probability that the action of the occlusion region should be changed by executing a certain action in the current state is output on each output node.

The network can obtain ideal difficult positive samples by continuously training parameters by using a positive sample generation network and a difficult sample conversion network, and the purpose of generating the difficult positive samples can be automatically completed by using deep learning without manual participation.

The training process is realized by adopting a deep learning toolkit Keras and Caffe, and the designed main parameters are set as follows: for a positive sample generation network, the initial learning rate is 0.001, the optimization algorithm is RMSprop, and the training iteration number is 20,000; for the positive sample conversion network, the mini-batch of training data is 100, the optimization method is Adam, and the initial learning rate is 1 e-6; for the twin network subsequently used for tracking, the initial learning rate is 0.0001, the momentum is 0.9, and the weight decay parameter is 0.0005.

The embodiment can automatically generate the hard positive sample by using the deep learning network and train the deep tracking algorithm, and the specific operation can include the following steps:

collection of annotated data:

for better training of the deep network, a certain amount of video for tracking needs to be calibrated manually, and the calibration content includes finding the same target object as the first frame on each video frame and giving the position of the target object in the current frame.

Pretreatment:

since the method proposed by the embodiment requires a pre-trained tracking model, the data is pre-processed first, and the data is stored as a data format h5 file which can be read by the deep neural network. In this process, the discrimination threshold for positive and negative samples is set to: the coincidence ratio of the sampled image block and the true value reaches 0.7, namely the image block is regarded as a positive sample, and the image block is regarded as a negative sample when the coincidence ratio is lower than 0.5.

Designing a deep neural network:

the network structure may comprise three components, respectively: convolutional layers (convolutional layers), pooling layers (PoolingLayer), and fully connected layers (fullonnectedLayer).

For the positive sample generation network, the present embodiment adopts two fully-connected layers to encode an input image pulled into a column vector, then inputs the encoded features into two fully-connected branches, performs mean and standard deviation estimation, then inputs into three fully-connected layers, and finally outputs the image after reconstruction.

For a difficult positive sample conversion network, inputting a sample into a pre-trained VGG network, outputting an action selected by an agent in a current state through two fully-connected layers, obtaining a new shielded sample through executing the action, calculating the similarity between the sample and a true value, if the action belongs to a mobile action, if the similarity is reduced, giving a positive reward of the current action, otherwise, giving a negative reward; if the action belongs to a stop action, a positive reward is given when the similarity is below a certain threshold, otherwise a negative reward is given.

For the twin network mainly used for tracking, there are two branch networks for encoding the target to be searched and the candidate search area of the current frame, respectively, and the parameters of the two branches are shared. The training of the network is based on positive and negative sample pairs, i.e.: and if the coincidence degree of the two image blocks is greater than a certain threshold value, the two image blocks are regarded as the same image block, the given label is 1, otherwise, the two image blocks are regarded as different image blocks, and the given label is 0. The MarginContrastive Loss function was used to measure the difference between the model output and the true image pair. The difference can be reversely transmitted to the network layer by layer to carry out parameter training of the model.

Training of models

The present embodiment may use the existing deep network training tool to train the model, such as: the Keras kit and Caffe. During the use of Caffe, a solvent file can be defined, which gives a method of optimizing the model (training), i.e. a parameter back-propagation algorithm. The key parameters may include a base learning rate (base learning rate), a learning momentum, a weight penalty coefficient, and the like. The basic learning rate can be set to be 0.0001-0.01, the range of the learning momentum can be 0.9-0.99, and the range of the weight penalty coefficient can be 0.0001-0.001.

In a specific implementation process, the three main network modules in this embodiment may be batch operations, and identify and track multiple target images at the same time, and the following respectively describes three sub-network modules:

the first subnetwork module utilizes a positive sample generation network to carry out positive sample expansion:

the module realizes the learning of the positive sample flow pattern by adopting a variational self-encoder network. This embodiment extracts the target object from the video frame, then unifies its resolution into 64 x 64, and then pulls it into a column vector with dimension 12288(64 x 3). The dimension of the middle fully-connected layer is 512 dimensions, and the dimension of the hidden layer coding is 2. The output dimension of the network after reconstruction is 12288, and then the resolution is adjusted to 64 × 3, i.e. the reconstructed image is obtained. In addition, in this embodiment, the network structure of the diversity self-encoder may also adopt a convolution structure. In order to obtain a cleaner stream, the present embodiment performs stream construction separately for each video. In other words, the present embodiment can perform the learning of the variational self-encoder for each training video by using the obtained target object.

And the sub-network module II performs difficult positive sample conversion by using a difficult positive sample conversion network:

unifying the resolutions of the input images into 224 × 224, inputting the input images into the VGG network to obtain feature expressions of the corresponding images, and obtaining the size of the feature map to be 512 × 7 ═ 25088;

the deep Q-Network is immediately behind the VGG Network, and specifically, is composed of three fully connected layers, whose dimensions are: 1024, 9;

the dimension of final output corresponds to the length of the action list designed by the embodiment, and represents the probability of selecting corresponding actions;

this embodiment considers the difficult positive sample conversion process as a sequence decision process, specifically:

the state is as follows: the present embodiment may normalize the image to 224 × 224, then input the image into the VGG network, and then extract the features of the 8 th layer as the state of the current step;

the actions are as follows: there are two types of actions in this embodiment, namely: a moving operation and a stopping operation; the movement action represents a change of the current observation region; the stop action indicates that an occlusion region of the current frame has been found and the search process of the current video frame is stopped. In the present embodiment, 8 moving actions and one stopping action are designed, as shown in fig. 7, which respectively are: move up, move down, move left, move right, zoom out, enlarge, thin, flatten, stop.

Rewarding: the goal of the agent in this embodiment is to receive the maximum reward, so the design of the reward function will be the key to the success of the strategy. Suppose that the score of an occluded object at time t is S_tThe score at the previous time is S_t-1The reward function for a movement action may be set as:

where s and s' represent the current and next instant states, respectively.

The stop motion has no state of the next moment, so the embodiment can specially design another reward function for the stop motion:

wherein the content of the first and second substances,

is a preset threshold parameter. The reward function states that if the agent chooses to stop the action at the current time, the currently occluded sample and true sample are calculatedSimilarity between books, a positive reward is given if the similarity is below a certain threshold, otherwise a negative reward is given.

The intelligent agent continuously interacts with the environment to obtain a large number of training samples, the training samples are stored in a playback unit, and then the mini-batch samples are sampled from the training samples to learn the occlusion strategy. Another important way to break the correlation between data, in addition to the playback unit, is the use of a target network. Specifically, copy of the model parameters is performed every τ steps, and the state at the current time and the state at the next time are respectively input into the conventional network model and the target network for sensing the environment.

And a third sub-network module trains a depth tracking model by using the twin network.

Performing convolution operation on the video frame to be tracked and a convolution kernel through a first convolution layer convolution kernel, wherein the size (kernelsize) of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 64, and the number of parameters of the convolution kernel is 64 × 3 — 1728;

through the first pooling layer poolinglayer, the pooling range size (kernelsize) may be 2 x 2, with 2 pixels per move;

after the second convolution layer convolution, performing convolution calculation on the output of the previous layer and a convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 128, and the number of parameters of the convolution kernel is 128 × 3 × 64 ═ 73728;

through the second pooling layer poolinglayer, the pooling range size (kernelsize) may be 2 x 2, with 2 pixels per move;

after passing through the third convolution layer, performing convolution calculation on the output of the previous layer and a convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 256, and the number of parameters of the convolution kernel is 256 × 3 — 128 — 294912;

performing convolution calculation on the output of the previous layer and a convolution kernel through a fourth convolution layer convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 512, and the number of parameters of the convolution kernel is 512 × 3 — 256 — 1179648;

after the fifth convolution layer convolution, performing convolution calculation on the output of the previous layer and a convolution kernel, wherein the size of the convolution kernel can be 3 × 3, the moving step length can be set to 1 pixel point each time during sliding, the number of input feature layers can be 512, and the number of parameters of the convolution kernel is 512 × 3 — 512 — 2359296;

after passing through an interest of interest pooling layer (ROI PoolingLayer), the part maps feature maps with different sizes into vectors with uniform dimensions, and then the vectors are input into a next full-connection layer;

through the full connected layer fullyconnectedlayer, the number of nodes of the full connected layer can be 4096, and the number of related convolution kernel parameters can be 4096 × 4096 — 16777216;

normalizing the obtained features through an L2 normalization layer;

and finally, distance measurement is carried out on the features obtained by two paths of input, wherein MarginContrastivLoss is adopted, and the similarity between two given image blocks is output.

In the implementation, each convolution layer can be followed by a nonlinear change, each fully-connected layer can be followed by a nonlinear change, and a dropout layer to avoid overfitting.

By adopting the model provided by the embodiment, the robustness of the model can be obviously improved, and a good experimental effect is obtained on a plurality of public data sets.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A target tracking method based on difficult positive sample generation is characterized by comprising the following steps:

(1) acquiring a marked video for training a depth tracking model;

2. The method for tracking the target based on the generation of the difficult positive sample as claimed in claim 1, wherein in the step (1), a certain amount of video for tracking is calibrated manually, the calibration comprises finding the same target object as the first frame on each video frame, and the position of the target object in the current frame is given.

3. The method for tracking the target based on the generation of the difficult positive samples as claimed in claim 1, wherein in the step (2), the data is preprocessed, the data is stored as a data format h5 file readable by a deep neural network, the coincidence degree of the sampled image block and the true value exceeding a distinguishing threshold is a positive sample, and the coincidence degree below the setting threshold is a negative sample.

4. The target tracking method based on the generation of the difficult positive samples as claimed in claim 1, wherein the difficult positive sample conversion network comprises convolution layers, and the original signal features are enhanced and noise is reduced through convolution operation;

the pooling layer reduces characteristics by a sampling method by utilizing an image locality principle;

5. The method for tracking the target based on the generation of the difficult positive sample according to claim 4, wherein the image of the positive sample to be occluded is used as an input, each neuron of the convolutional layer can be connected with a local receptive field with a certain size in the input layer through convolution operation from the input layer to the convolutional layer, and the characteristic of the image to be occluded is obtained through convolution; the process from convolutional layer to pooling layer is to reduce the number of features of the previous layer; the features obtained after the convolutional layer and the pooling layer are classified by the full-link layer, the result is finally output after the calculation processing of the full-link layer, and the probability that each output node on the full-link layer selects an action for the intelligent agent, namely the probability that a certain action should be executed to change the action of the shielding region in the current state is output on each output node.

6. The method for tracking the target based on the generation of the difficult positive sample as claimed in claim 1, wherein the training process adopts deep learning: for a positive sample generation network, the initial learning rate is 0.001, the optimization algorithm is RMSprop, and the training iteration number is 20,000; for the positive sample conversion network, the mini-batch of training data is 100, the optimization method is Adam, and the initial learning rate is 1 e-6; for the twin network subsequently used for tracking, the initial learning rate is 0.0001, the momentum is 0.9, and the weight decay parameter is 0.0005.

7. The method for tracking the target based on the generation of the difficult positive sample as claimed in claim 6, wherein the deep learning process is as follows:

for the twin network for tracking, two branch networks are provided, which are respectively used for coding a target to be searched and a candidate search area of a current frame, and parameters of the two branches are shared;

8. The method of claim 5, wherein the actions comprise a moving action and a stopping action; the movement action represents a change of the current observation region; the stop action means that the occlusion area of the current frame has been found, and the search process of the current video frame is stopped, specifically: move up, move down, move left, move right, zoom out, enlarge, thin, flatten, stop.

9. According to the claimsSolving 8 the method for tracking a target based on the generation of a difficult positive sample, wherein a score of an occluded object at a time t is assumed to be S_tThe score at the previous time is S_t-1The reward function for a movement action is set to:

where s represents the state at time t-1 and s' represents the state at time t.

10. The method of claim 8, wherein the reward function for stopping actions is:

wherein φ is a preset threshold parameter, and the reward function shows that if the agent chooses to stop the action at the current moment, the similarity between the currently occluded sample and the true sample is calculated, if the similarity is lower than a certain threshold, a positive reward is given, otherwise, a negative reward is given.