CN113963022B

CN113963022B - Multi-outlet full convolution network target tracking method based on knowledge distillation

Info

Publication number: CN113963022B
Application number: CN202111221017.6A
Authority: CN
Inventors: 邬向前; 卜巍; 马丁
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2023-08-18
Anticipated expiration: 2041-10-20
Also published as: CN113963022A

Abstract

The invention discloses a target tracking method of a multi-outlet full convolution network based on knowledge distillation, which comprises the following steps: step one, constructing a multi-outlet full convolution network based on knowledge distillation; and step two, training a plurality of outlets based on knowledge distillation. The invention provides a multi-outlet full convolution structure based on knowledge distillation, which is used for tracking based on classification, and encourages the precursor outlets to imitate and learn the probability output of the subsequent outlets by virtue of the advantages of knowledge distillation, so that the discrimination capability of the precursor outlets is improved. The invention improves the discrimination capability by extracting the regional characteristics of different scales by utilizing a plurality of RoIAlignon layers and fusing the regional characteristics above each outlet. According to the invention, different kinds of attention modules are used for capturing different target specific information, so that the distinguishing capability of the target and the background and the interferents thereof is improved. The invention obtains higher tracking precision and simultaneously has relatively higher processing speed.

Description

Multi-outlet full convolution network target tracking method based on knowledge distillation

Technical Field

The invention relates to a target tracking method, in particular to a target tracking method of a multi-outlet full convolution network of knowledge distillation.

Background

Convolutional Neural Networks (CNNs) have been successfully applied to visual target tracking tasks by virtue of their advantages in extracting high-level semantic feature representations. However, although CNN-based tracking methods can achieve good positioning accuracy, the processing speed of most methods is slow.

Disclosure of Invention

In order to better balance the speed and the precision of a CNN-based tracker, the invention provides a target tracking method of a multi-outlet full convolution network based on knowledge distillation.

The invention aims at realizing the following technical scheme:

a target tracking method of a multi-outlet full convolution network based on knowledge distillation comprises the steps of firstly selecting the first three convolution layers of a pretrained VGG-M model on an ImageNet, and embedding two MIN modules into a first convolution layer and a second convolution layer respectively so as to increase nonlinear representation of characteristics and relieve gradient disappearance caused by ReLU. The above three convolution layers and two MIN modules form a base network for extracting a characteristic representation of an input candidate sample. Then, three attention modules are introduced in the base network, including two residual attention modules and one channel attention module. Finally, three outlets are set in the basic network to respectively correspond to three video frames with different difficulties. The three outlets have the same structure and include one RoIAlign layer for extracting candidate region features, and two convolution layers (conv_exit_1 and conv_exit_2) for classifying candidate regions. The method specifically comprises the following steps:

step one, constructing a multi-outlet full convolution network based on knowledge distillation, wherein the specific construction steps are as follows:

(1) Selecting the first three convolution layers of the VGG-M pre-training network, respectively embedding two MIN modules into the first convolution layer and the second convolution layer to increase the nonlinearity of the characteristic representation, and meanwhile, the influence caused by gradient disappearance, wherein the three convolution layers and the two MIN modules form a basic network together;

the overall flow of the MIN module is as follows:

wherein ,x_i,j Is an input centered on coordinates (i, j), ch is the channel index of feature F, w and b represent feature weights and offsets, respectively, F is constructed by taking the maximum of k maxout hidden layer portions, with the maxout cell being the most across the channelA pooling layer which selects the maximum output to be input to the next layer, and in addition, introduces a normalized BN layer to avoid the influence caused by the data distribution difference;

(2) On the basis of a basic network, three attention modules are added to increase discrimination capability of feature representation, wherein the discrimination capability comprises two residual attention modules and a channel attention module, one channel attention module is added after the second residual attention module to enhance sensitivity of a channel to distinguishing a target and a background, the channel attention module takes a feature F as an input of the channel attention module, spatial information of the F is removed through global pooling operation, a channel dependency relationship is obtained through two fully connected layers, and a channel weight w is calculated by utilizing a sigmoid function _c Output F ^C (x) Multiplying the channel weight w by F (x) _c The method comprises the following steps:

F ^C (x)＝w _c ·F(x)；

the mathematical expression of the residual attention module is as follows:

wherein ,activation using sigmoid function, F ^R (x) Is the residual attention feature,/-> and />Respectively representing a bitwise multiply and add operation;

(3) Three outlets are arranged in the whole network, each outlet has the same structure, wherein the three outlets comprise a region feature extraction layer for extracting features corresponding to each RoI region, and two convolution layers Conv_Exit_1 and Conv_Exit_2 are used for dividing candidate samples into targets and backgrounds;

step two, training a plurality of outlets based on knowledge distillation, wherein the specific steps are as follows:

(1) Given a teacher classifier t and a student classifier s learned from t, the learning process is optimized by minimizing the cross entropy of its output:

[s ^1/temp (x)] _c ＝softmax(s(x)/temp),

[t ^1/temp (x)] _c ＝softmax(t(x)/temp),

wherein t (x) and s (x) represent predictions of t and s, respectively, temp is a temperature parameter, [ t ] ^1/temp (x)] _c and [s^1/temp (x)] _c Soft predictions of t and s, respectively, C representing the number of categories;

(2) The whole model is obtained by minimizing the classification loss L _cls And distillation loss L in a multiple outlet configuration _dis And (3) optimizing:

L＝L _cls +aL _dis ,

wherein a is a superparameter for balancing the two losses, L _dis The definition is as follows:

where Ex is the number of outlets, T (e) e Ex represents the set of teacher outlets, cf (·) represents the classifier corresponding to each outlet.

Compared with the prior art, the invention has the following advantages:

1. the invention provides a multi-outlet full convolution structure based on knowledge distillation, which is used for tracking based on classification, and encourages the precursor outlets to imitate and learn the probability output of the subsequent outlets by virtue of the advantages of knowledge distillation, so that the discrimination capability of the precursor outlets is improved.

2. The invention improves the discrimination capability by extracting the regional characteristics of different scales by utilizing a plurality of RoIAlignon layers and fusing the regional characteristics above each outlet.

3. According to the invention, different kinds of attention modules are used for capturing different target specific information, so that the distinguishing capability of the target and the background and the interferents thereof is improved.

4. Compared with the mainstream tracking method based on classification, the method provided by the invention has higher tracking precision and relatively higher processing speed.

Drawings

FIG. 1 is a flow chart of a method for target tracking for a multi-outlet full convolution network of knowledge distillation in accordance with the present invention;

FIG. 2 is an example of a simple, medium, and difficult frame;

FIG. 3 is a graph of output statistics for each outlet;

FIG. 4 is a comparison of the method of the present invention and other mainstream target tracking methods in an OTB-100 dataset;

FIG. 5 is a comparison of the method of the present invention and other mainstream target tracking methods at a UAV123 dataset;

figure 6 is a statistic of the output of each outlet at 4 data sets.

Detailed Description

The following description of the present invention is provided with reference to the accompanying drawings, but is not limited to the following description, and any modifications or equivalent substitutions of the present invention should be included in the scope of the present invention without departing from the spirit and scope of the present invention.

The invention provides a target tracking method based on knowledge distillation for multi-outlet full convolution, which is named DMENT. In DMENet, different types of attention mechanisms are embedded into different levels of the full convolutional network to capture more discriminative feature representations. And, three additional outlets are added in the full convolution network to obtain accurate estimation of the target position in the current frame as soon as possible. The entire DMENet is trained by a strategy of knowledge distillation to improve the accuracy of preamble exit. Each of the outlets has a confidence score for deciding whether the processing of the video frame needs to end at the current outlet or need to be passed to an upper layer outlet.

Fig. 1 shows the overall structure of the entire network, which can be divided into three parts, specifically as follows:

the first part is a determination of the number of outlets. To determine the appropriate number of outlets, assume that the target difficulty in a video sequence can be divided into three categories: simple (the change in the appearance of the object is relatively small), medium (the change in the appearance of the object is fast but not severe) and difficult (the change in the appearance of the object is relatively severe). Here, it is assumed that the above three kinds of targets with different difficulties can be located using the low, middle and high layer features, respectively. To verify this assumption, a verification is performed on the OTB-100 dataset.

In the OTB-100 dataset, each video frame is classified as: simple, medium and difficult three categories. The classification basis for the different classes is the average overlap ratio of the output prediction frames of the 12 tracking methods. The average overlapping rate threshold corresponding to the medium and difficult video frames is 0.7 and 0.5 respectively, and the threshold corresponding to the simple and difficult video frames is more than or equal to 0.7. As shown in fig. 2, examples of some simple, medium and difficult frames are shown.

To count the actual output ratio of each outlet, three outlets of the network were trained without knowledge distillation. At each exit, a confidence score is set to determine whether to locate the target of the current frame at that exit (high confidence) or to proceed to the next exit (low confidence). That is, only if the confidence score for the current outlet reaches a threshold, the position prediction of the target may be output at this outlet. Fig. 3 shows statistics of simple/medium/difficult frames output at the first/second/third outlets, which justifies the assumption.

The second part is a network structure, as shown in fig. 1, first three convolution layers of the VGG-M model pre-trained on ImageNet are selected, and after two MIN modules are respectively embedded into the first and second convolution layers, the nonlinear representation of the features is increased, and the gradient vanishing problem caused by ReLU is relieved. The three convolution layers and the two MIN modules form a basic network for extracting the characteristic representation of the target. Then, three attention modules are introduced in the base network, including two residual attention modules and one channel attention module. Finally, three outlets are set in the basic network to respectively correspond to video frames with three difficulties. The three outlets have the same structure: one RoIAlign layer is used to extract candidate region features, and two convolution layers (conv_exit_1 and conv_exit_2) are used to classify candidate regions. Details of the MIN module, the attention module, and the outlet are as follows.

MIN module: while classification-based tracking methods possess good accuracy, there are still some problems: (1) discrimination capability of the model; (2) gradient extinction and saturation problems during training. In most classification-based approaches, a feature representation of the object is extracted through a lightweight network that cannot cope with nonlinear changes in the object. Furthermore, a constant of 0 will block the gradient of the non-activated ReLU, causing the gradient to disappear. Also, changes in the data distribution during the training phase may saturate the activation function, which slows down the training process (especially during the online update phase).

To solve this problem, the present invention proposes to embed two MIN modules after the first and second convolutional layers, respectively. First, a two-layer multi-layer perceptron (MLP) is employed to increase local nonlinearity. After each MLP, there is one maxout unit to overcome the vanishing gradient problem caused when using ReLU. Wherein the math of the maxout unit is expressed as follows:

wherein ,x_i,j Is an input centered on (i, j), ch is the channel index of feature F, and w and b represent feature weights and offsets, respectively. F is constructed by taking the maximum of k maxout hidden parts. The maxout cell acts as a maximum pooling layer across channels, which selects the maximum output to input to the next layer. In addition, a normalized BN layer is introduced to avoid the effects of data distribution differences. The overall flow of the MIN module is as follows:

attention module: since the final object of the present invention is to stop the processing of the current frame as early as possible in the tracking process, the following exit should be more discriminant to guide the exit of the preamble. Therefore, two residual attention modules are added in the base network to enhance the discrimination capability of deep features. First, a max pooling layer (max pooling) is used to expand the receptive field to capture global features. Second, the spatial resolution is extended to the original spatial resolution using bilinear interpolation operations. The mathematical expression of the residual attention module is as follows:

wherein ,activation using sigmoid function, F ^R (x) Is the residual attention feature. /> and />Representing the bitwise multiply and add operations, respectively.

After the second residual attention module, a channel attention module is added to enhance the sensitivity of the channel to distinguish between the object and the background. The channel attention module takes the feature F as its input and removes the spatial information of F through a global pooling operation. Then, the channel dependency is obtained through the two fully connected layers. Then, the channel weight w is calculated using a sigmoid function _c . Output F ^C (x) Multiplying the channel weight w by F (x) _c The method comprises the following steps:

F ^C (x)＝w _c ·F(x), (4)。

and (3) an outlet: in the multiple output architecture of the present invention, each of the outputs comprises a RoIAlign layer and two convolution layers (conv_exit_1 and conv_exit_2), the outputs of which comprise two nodes corresponding to the target and background, respectively. Each outlet can be seen as a binary classifier. The output sizes of conv_exit_1 and conv_exit_2 are 3×3×128 and 1×1×2, respectively. The confidence score for an exit is used to decide whether to locate a target with high confidence at that exit or to proceed to the next exit for further processing.

In most classification-based tracking methods, region (RoI) features are typically extracted on high-level features, however the high-level lacks detailed information for accurately locating the target. To supplement the detail features in the region features, the present invention proposes to superimpose the region features of the preamble exit onto the current exit. Specifically, the region feature may be expressed as 3×3×ch, ch representing the number of feature channels. For the current outlet, the regional features of the preamble outlet are serially connected along the channel axis.

The third part is multi-outlet training based on knowledge distillation, given a teacher classifier t and student classifiers s learned from t, the learning process can be optimized by minimizing the cross entropy of its output:

wherein t (x) and s (x) represent predictions of t and s, respectively. temp is a temperature parameter used to control the softness of the teacher's t output. [ t ] ^1/temp (x)] _c and [s^1/temp (x)] _c Representing soft predictions of t and s, respectively. C represents the number of categories. The distillation loss of the multi-outlet structure is then defined as follows:

where Ex is the number of outlets. T (e) εEx represents the set of teacher outlets. Here, all outlets are set to learn for the last outlet. cf (·) represents the classifier corresponding to each outlet. Finally, the whole model is obtained by minimizing the classification loss L _cls and L_dis And (3) optimizing:

L＝L _cls +aL _dis , (7)；

where a is a super parameter used to balance the two losses. In the experimental work, a=1 was set.

4. Experimental results

The invention is implemented by Pytorch and runs on a machine equipped with Intel (R) 4790k CPU and an NvidiaTeslaK40c GPU. For the offline training phase, training was performed using the ImageNet-Vid dataset. 8 video frames are randomly selected in a given video, and 64 positive samples and 192 negative samples are taken in each video frame. Given a marking frame, the collection threshold of the positive sample is more than or equal to 0.7, and the collection range of the negative sample is 0 to 0.5. Training was performed for 1000 cycles at a learning rate of 0.0001. For the online training phase, 500 positive samples and 5000 negative samples are taken in the first frame to initialize the model. And 96 positive samples and 192 negative samples are collected when the estimated position of the current frame is obtained. After every 10 frames, the model was trained using the positive and negative samples collected.

In validating the performance of the present invention, it was named DMENet and four public data sets (OTB-100, UAV123, laSOT and VOT 2018) were used to evaluate the performance.

Figure 4 shows the results of a comparison of the method of the present invention with other 11 mainstream tracking methods on an OTB-100 dataset. The comparison method comprises the following steps: VITAL, siamRPN ++, MDNet, KYS, diMP, prDiMP, DAT, daSiamRPN, ATOM, TRACA, and UDT. As shown in fig. 4, DMENet achieves the highest Success rate (Success) score. Meanwhile, the accuracy (Precision) and the success rate of the DMENT are both higher than those of the current tracking method VITAL based on classification.

Unlike the video capture of OTB-100 from real life, the video of UAV123 is captured from the drone platform. The results of comparing dment with other mainstream methods on UAV123 dataset are shown in fig. 5, and it can be seen from fig. 5 that dment achieves competitive results in all comparative tracking methods.

The LaSOT dataset consists of 1400 video sequences. In this dataset, the tracking method is evaluated mainly in terms of Success rate (Success). All methods were tested on a test set containing 280 videos. Table 1 shows the success rate of each method. As shown in table 1, the success rate value of DMENet is far higher than other class-based tracking methods, i.e., VITAL and MDNet.

Table 1 comparison of success rates on LaSOT dataset

The VOT2018 dataset contains 60 video sequences, and the evaluation criteria include: accuracy (Ar), robustness (Rr) and desired average overlap ratio (EAO). As shown in table 2, DMENet ranks higher among all the comparative tracking methods, with competitive results.

Table 2 comparison results of vot2018

Figure 6 counts the output of each outlet at the presence or absence of known distillation. Where E represents outlets and E w/Dis represents each outlet trained by knowledge distillation. As can be seen from fig. 6, in the case of the knowledge distillation, the output of the preamble exit increases more, and the operation speed of the algorithm is increased.

Claims

1. A method for tracking a target of a multi-outlet full convolution network based on knowledge distillation, the method comprising the steps of:

(1) Selecting the first three convolution layers of the VGG-M pre-training network, respectively embedding two MIN modules into the first convolution layer and the second convolution layer, and forming a basic network by the three convolution layers and the two MIN modules together;

(2) On the basis of the basic network, three attention modules are added to increase the discrimination capability of the feature representation, wherein the discrimination capability comprises two residual attention modules and one channel attention module, and the second residual attention module is used for injectingAfter the intention module, a channel attention module is added to enhance the sensitivity of the channel to distinguishing the target and the background, the channel attention module takes the characteristic F as the input of the channel attention module, removes the space information of the F through global pooling operation, obtains the channel dependency relationship through two fully connected layers, and calculates the channel weight w by using a sigmoid function _c Output F ^C (x) Multiplying the channel weight w by F (x) _c Obtaining;

[s ^1/temp (x)] _c ＝softmax(s(x)/temp),

[t ^1/temp (x)] _c ＝softmax(t(x)/temp),

L＝L _cls +aL _dis ,

where a is a super parameter used to balance the two losses.

2. The target tracking method for a knowledge distillation based multi-outlet full convolution network according to claim 1, wherein the overall flow of the MIN module is as follows:

wherein ,x_i,j Is an input centered on coordinates (i, j), ch is the channel index of feature F, and w and b represent feature weights and offsets, respectively.

3. The method for target tracking for a knowledge distillation based multi-outlet full convolution network according to claim 1, characterized in that the mathematical expression of the residual attention module is as follows:

wherein ,activation using sigmoid function, F ^R (x) Is the residual attention feature,/-> and />Representing the bitwise multiply and add operations, respectively.

4. Knowledge-based as claimed in claim 1A target tracking method of a distillation-aware multi-outlet full convolution network is characterized in that L is as follows _dis The definition is as follows: