CN110009661B

CN110009661B - Video target tracking method

Info

Publication number: CN110009661B
Application number: CN201910249323.7A
Authority: CN
Inventors: 卢湖川; 高凯
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2022-03-29
Anticipated expiration: 2039-03-29
Also published as: CN110009661A

Abstract

The invention belongs to the technical field of image video target tracking, and provides a video target tracking method, which can continuously track a specific single target in a video and relates to the related knowledge of image processing. Firstly, a rapid target tracker is trained by using a deep mutual learning and knowledge distillation method. Secondly, when each frame comes, many particles are scattered around the previous frame, and the distribution of the particles is random. We then select a large image area that can contain all the particles. And sending the relative positions of the image area and the particles to a target tracker to obtain the final score, and selecting the result with the highest score. And the final result is used as a final result after the surrounding frame regression. Finally, the tracker is updated online each time tracking fails or after a certain time. The invention has the advantages that the traditional sampling method is changed, the sampling in the image layer is changed into the sampling in the characteristic layer, the speed is greatly improved, and the speed is improved under the condition of ensuring the precision.

Description

Video target tracking method

Technical Field

The invention belongs to the technical field of image video target tracking, can continuously track a specific single target in a video, and relates to the related knowledge of image processing.

Background

With the continuous development of image processing technology, video target tracking plays an important role in daily life due to the practicability of the video target tracking.

Video target tracking is mainly divided into two main categories: particle filter methods and related filter methods. The correlation filter method is to use the target characteristics to perform correlation matching around the previous frame target, wherein the obtained final corresponding highest place is the current frame target position. In the method, a High-speed tracking with kernel correlation filters' method is published in PAMI journal in 2014 by j.f. henriques et al, so that the calculation can be converted into a fourier domain, the calculation speed is greatly accelerated, and the speed real-time property is High. Then Danelljan M et al published in the ICCV conference in 2015 for Learning spatial Regularized Correlation Filters for Visual Tracking, which suppresses the edge information of the filter, so that the filter can more accurately find the target position, thereby further improving the precision. In the ECCV conference, Danelljan M et al in 2016 proposed Beyond correction Filters, Learning Continuous conversion Operators for Visual Tracking, which interpolates feature maps with different resolutions into a Continuous spatial domain through cubic interpolation, and then applies a Hessian matrix to obtain a target position with sub-pixel precision. In 2017, Danelljan M et al further proposed an ECO (Efficient Convolition Operators for Tracking) in the CVPR conference, which adopts Convolution operation of factorization and simplifies feature extraction, and finally obtains a faster and stronger tracker. The particle filter method is to spread a large number of particles around the target position of the previous frame and to judge whether the image block in the particles is the target to know the current position of the target. Because the scattered particles are enough, the obtained position information is more accurate, and the precision of the particle filter method is generally high. However, because a large number of particles need to be scattered, the calculation amount is large, and the speed of the method is generally slow, so that the expansibility of the method is poor. A typical representation is the Learning Multi-Domain conference for Visual Tracking published by Nam H et al in 2016 at the CVPR conference.

Although current particle filter algorithms achieve good results, there are several problems to be solved. First, the accuracy of the particle filter algorithm is not apprehended, and the particle filter algorithm exhibits an elegance in the accuracy of standing for ao due to the vigorous development of the correlation filter algorithm. Secondly, the conventional particle filter needs to make a judgment for each image block, and in order to maintain the accuracy, the particles must be enough, so that the particle filter algorithm is slow.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: for any given video, the target can be continuously tracked under the condition that only the position of the target given by the first frame is known and no other information exists, and the tracking is kept in the subsequent video sequence. Moreover, the method can also process the situations of complex scenes, light changes, similar objects, occlusion and the like in the video, which means that even if the object is in a complex scene and the situation that the object is occluded by the similar object occurs, the object can still be tracked.

The technical scheme of the invention is based on an observed conclusion that: in video target tracking, the position offset of a target in two frames of images before and after the target does not change greatly, and the shape change is not obvious. Therefore, the position of the target in the current image frame can be found by scattering particles around the target in the previous image frame and judging the image blocks in the particles, so that the target is continuously tracked. Moreover, the calculation complexity of the particle filter is very high, the tracking speed is very low and only 1FPS can be achieved, the traditional sampling method is changed, sampling on an image layer is changed into sampling on a characteristic layer, the speed is greatly improved, and the speed is improved under the condition of ensuring the precision. The method comprises the following specific steps:

a method for tracking video target comprises the following steps:

firstly, an off-line training stage:

step 1: training a classifier network by using a public database, wherein the input of the classifier network is an image block, and the output of the classifier network is the score of the image block, wherein the output of the image block is 1 if the image block is a target foreground, and the output of the image block is 0 if the image block is a background; the classifier network is used for distinguishing whether an input image block is a target foreground or a background and scoring the image block;

step 2: the method of deep mutual learning is adopted, two same classifier networks are used for training at the same time, connection is established in a network layer before a result, and KL divergence is used for mutual supervision, so that the two classifier networks obtain stronger classification capability;

and step 3: a knowledge distillation method is adopted, and the classifier network obtained by training is used as a teacher for guiding and training a new classifier network; the input of the new classifier network is the coordinates and shapes of the image blocks and the particles, and the output is the score of each particle frame; in the training process, each layer of the classifier network is connected, and MSE loss is used for mutual supervision, so that the new classifier network learns the capacity of the original classifier network;

and 4, step 4: similar to the step 2, deep mutual learning training is carried out on the new classifier network learned in the step 3 as a basis, and connection is established in a network layer before the result for mutual supervision, so that the new classifier network has excellent speed and obtains more accurate precision to obtain a final classifier network;

secondly, an online tracking stage:

and 5: for a given first frame image, a plurality of particles are taken around a target truth value, and the trained classifier network in the step 4 is fine-tuned by utilizing the obtained target truth value, so that the classifier can better adapt to the video; training a surrounding frame regressor capable of finely adjusting the final result according to the target truth values, wherein the input of the surrounding frame regressor is the characteristic of the particles, and the output is the adjusted target position;

step 6: scattering a large number of particles around a first frame image of a given video, wherein the particles are different in size and shape; because the target positions in two adjacent frames do not change suddenly, some particles in the scattered particles can well surround the target object;

and 7: inputting a large image block containing all particles and the coordinates of the particles into a classifier network, obtaining the score of each particle, and obtaining the five particles with the highest score, as shown in formula (8),

the score of the ith particle in the t frame of the video sequence is shown, and a plurality of particles with the highest score are selected

Taking the average value; sending the average value into a surrounding frame regressor, and finally taking an output result of the surrounding frame regressor as a tracking result;

and 8: storing the classifier network characteristics of the output result obtained each time, and when the score of the classifier network is lower than 0.5, finely adjusting the classifier network by using the stored characteristics and expanding resampling; when the score of the classifier network is low, the classifier network is not suitable for the frame at the moment, retraining is needed to better adapt to the target, or the target position moves greatly, and sampling is needed to obtain the target position at the moment;

and step 9: fine-tuning the output result by using the stored network characteristics every 20 frames; since the shape of the target changes significantly after too long, it cannot be tracked again using the result of the initial training, and a new classifier network needs to be trained to better adapt to the target.

The invention has the beneficial effects that: the method can accurately and quickly track the single target, and has excellent performance even in poor external environment conditions. Compared with common particle filtering, the method has the advantages that the speed is greatly improved and the real-time performance is guaranteed under the condition of almost equal precision.

Drawings

FIG. 1 is a block diagram of offline training. Fig. 1(a) shows two target trackers performing deep mutual learning training, where the upper and lower networks are the same. Fig. 1(b) is a diagram for guiding the training of a fast target tracker (lower target tracker) by using a trained target tracker (upper target tracker) as a teacher and adopting a knowledge distillation method. Fig. 1(c) shows two fast target trackers performing deep mutual learning training, and the upper and lower networks are the same.

Fig. 2 is the result of the target tracker on some videos. The first picture in each row is the first frame of the video. Where the green bounding box is the true value and the red bounding box is the trace result of our invention.

Detailed Description

The following further describes the specific embodiments of the present invention in combination with the technical solutions.

A method for tracking video target comprises the following steps:

firstly, an off-line training stage:

step 1: training a classifier network by using a public database, wherein the input of the classifier network is an image block, and the output of the classifier network is the score of the image block, wherein the output of the image block is 1 if the image block is a target foreground, and the output of the image block is 0 if the image block is a background; the classifier network is used for distinguishing input image blocksThe target foreground or the background is still the background, and the image blocks are scored; as in the formula (1), wherein,

representing the ith image block taken at the t-th frame of the video sequence,

representing image blocks

In the features of class m in the classifier network 1, m takes values of 1 and 2;

represents the output of class m in the softmax layer in the classifier network 1; equation (2) represents the supervision loss of a classifier network, where,

the true value label of the ith image block obtained from the t frame in the video sequence is represented, when the true value label is the same as the type output, the result is 1, otherwise, the result is 0;

represents the loss of the classifier network 1 in taking N image blocks in the image sequence;

step 2: training two same classifier networks simultaneously by adopting a deep mutual learning method, establishing connection in a network layer before a result, and performing mutual supervision by using KL divergenceThereby enabling the two classifier networks to obtain stronger classification capability; formula (3) D_KL(p₂‖p₁) Representing mutual supervision by KL divergence, wherein,

respectively representing the m-th class output in the softmax layer in the classifier network 1 and the classifier network 2, wherein the KL divergence is calculated by taking N image blocks from the image sequence; formula (4)

Respectively representing the final loss of the classifier network 1, the classifier network 2, wherein

Respectively representing the loss of N image blocks taken by the classifier network 1 and the classifier network 2 in the image sequence; lambda [ alpha ]₁、λ₂Is a hyper-parameter for adjusting the relationship between losses;

and step 3: a knowledge distillation method is adopted, and the classifier network obtained by training is used as a teacher for guiding and training a new classifier network; the input of the new classifier network is the coordinates and shapes of the image blocks and the particles, and the output is the score of each particle frame; in the training process, each layer of the classifier network is connected, and MSE loss is used for mutual supervision, so that the new classifier network learns the capacity of the original classifier network; as shown in the formula (5), in the formula,

respectively representing image blocks

In the classifier network Θ₁、Θ₂The output of the coordinate k in the ith layer, W, H respectively represent the width and height of the output of the ith layer in the classifier network,

representing an MSE loss function of the l layer of the classifier network; formula (6)

Respectively representing a classifier network Θ₁、Θ₂Network MSE loss obtained by superposing L layer loss, wherein alpha and beta are hyper-parameters and are used for adjusting loss ratio; the final supervision loss is the superposition of classification loss and MSE loss as shown in the following formula (7);

secondly, an online tracking stage:

Claims

1. A method for tracking video target is characterized by comprising the following steps:

firstly, an off-line training stage:

step 1: training a classifier network by using a public database, wherein the input of the classifier network is an image block, and the output of the classifier network is the score of the image block, wherein the output of the image block is 1 if the image block is a target foreground, and the output of the image block is 0 if the image block is a background; the classifier network is used for distinguishing whether an input image block is a target foreground or a background and scoring the image block; as in the formula (1), wherein,

representing the ith image block taken at the t-th frame of the video sequence,

representing image blocks

shows a classifier network 1 in the figureLoss of N image blocks in the image sequence;

step 2: using deep mutual learning method, training is carried out simultaneously by two same classifier networks

Establishing connection in a network layer before the result, and performing mutual supervision by using KL divergence, so that two classifier networks obtain stronger classification capability; formula (3) D_KL(p₂‖p₁) Representing mutual supervision by KL divergence, wherein,

Respectively representing the final supervision loss of the classifier network 1, 2, wherein

and step 3: a knowledge distillation method is adopted, and the classifier network obtained by training is used as a teacher for guiding and training a new classifier network; the input of the new classifier network is the coordinates, shape and image blocks of the particles, and the output is the score of each particle box; in the training process, each layer of the classifier network is connected, and MSE loss is used for mutual supervision, so that the new classifier network learns the capacity of the original classifier network; as shown in the formula (5), in the formula,

respectively representing image blocks

secondly, an online tracking stage:

and 7: inputting a large image block containing all particles into the classifier network, inputting the coordinates of all particles into the classifier network to obtain the score of each particle, selecting the five particles with the highest score, as in equation (8),

the score of the ith particle in the t frame of the video sequence is shown, and five particles with the highest score are selected