CN110570450B

CN110570450B - Target tracking method based on cascade context-aware framework

Info

Publication number: CN110570450B
Application number: CN201910882861.XA
Authority: CN
Inventors: 邬向前; 卜巍; 马丁
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2023-03-24
Anticipated expiration: 2039-09-18
Also published as: CN110570450A

Abstract

The invention discloses a target tracking method based on a cascaded context-aware framework, which provides the cascaded context-aware framework based on two networks, and the cascaded context-aware framework comprises two sub-networks: an image-based context-aware network ICANet and an image-based context-aware network PCANet. The framework models progressively various changes between various targets and their context information. The first network focuses on the most discriminative information between the target and its context and the coarse structure of the target, and the second network focuses on the fine structure information of the target itself. According to the output of the two networks, namely the final context perception graph, a positioning frame of the target can be flexibly generated, and background information such as the target and interferents around the target can be effectively distinguished. The FCA map obtained by the invention can be flexibly embedded into various tracking frameworks.

Description

Target tracking method based on cascade context-aware framework

Technical Field

The invention relates to a target tracking method, in particular to a target tracking method based on a cascaded context-aware framework.

Background

Based on the powerful representation capabilities of Convolutional Neural Networks (CNNs), researchers have proposed a number of trackers based on convolutional neural networks. Among them, most trackers use a rectangular box to mark the position of the target. In this case, the target model will contain more or less context information. Moreover, ignoring context information can have a significant impact on tracking performance. First, learning the target model from a limited spatial region may result in overfitting and is not robust to rapid changes in the appearance of the target. Secondly, the lack of true negative examples will greatly impair the robustness of the tracker to complex backgrounds, especially when similar visual information exists in the target and its context, which will greatly increase the risk of tracking drift phenomena occurring. Third, when the context information is not fully considered, it is difficult for the tracker to effectively handle cases where the target encounters occlusion.

Most of the existing target tracking algorithms only aim at the context information in the target local range, little attention is paid to the context information of the whole input image, so that interferents and background information existing in the whole image range are ignored, and the robustness of the tracking algorithms is influenced.

Disclosure of Invention

In order to reduce the interference of the background to a tracker, pay attention to the context information of each corner of the whole image and solve the problems existing in the existing target tracking algorithm, the invention provides a target tracking method based on a cascade context-aware framework.

The purpose of the invention is realized by the following technical scheme:

a target tracking method based on a cascade context-aware framework comprises the following steps:

step one, constructing a context awareness framework (CAT) based on cascade, wherein the context awareness framework comprises two sub-networks: an image-based context-aware network (ICANet) and an image block-based context-aware network (PCANet), wherein: the input of ICANet is a whole image and is used for capturing background information in the range of the whole input image, and PCANet is used for distinguishing similar interferents in a target local range;

step two, learning an image-level context perception map (ICA map) through ICANet, and capturing the most discriminative features between the target and the surrounding context and the approximate structure information of the target;

step three, learning a context perception map (PCA map) of an image block level through PCANet, and acquiring self-structure information of a target and inhibiting information of an interfering object based on the PCAmap;

step four, after obtaining the ICA map and the PCA map, mapping the pixels of the PCA map to the ICA map so as to obtain a final context perception map (FCA map);

and step five, based on the final context awareness graph (FCA map), obtaining a positioning frame of the target by using two strategies, wherein:

strategy one, using sigmoid for each pixel in an FCA map, then obtaining a binary mask by carrying out binarization on the FCA map (the threshold value is 0.5), and generating a bounding box through an axis-aligned bounding rectangle according to the binary mask;

and secondly, embedding the FCA map into a Bayesian framework, namely calculating the maximum posterior estimation according to the probability that the candidate sample belongs to the target.

Compared with the prior art, the invention has the following advantages:

1. the invention provides a context-aware framework based on the cascade of two networks, which comprises an ICANet and a PCANet. The framework models progressively various changes between various targets and their context information. The first network focuses on the most discriminative information between the target and its context and the coarse structure of the target, and the second network focuses on the fine structure information of the target itself. According to the output of the two networks, namely the final context perception graph, a positioning frame of the target can be flexibly generated, and background information such as the target and interferents around the target can be effectively distinguished.

2. The FCA map obtained by the invention can be flexibly embedded into various tracking frameworks.

Drawings

FIG. 1 is a general flow diagram of the CAT framework proposed by the present invention;

FIG. 2 is an architecture of ICANet;

FIG. 3 is an architecture of a PCANet;

FIG. 4 is a visualization result, (a) a label, (b) a visualization result with no LBoutlay for the FCA map, and (c) a visualization result with LBoutlay added to the FCA map;

FIG. 5 is a table of accuracy and success rates on an OTB100 data set, (a) accuracy rate, (b) success rate;

FIG. 6 is a table of accuracy and success rates on a TC128 dataset, (a) accuracy, (b) success rate;

fig. 7 is a visualization of the CAT tracker proposed by the present invention in a challenging sequence.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The invention provides a target tracking method based on a cascade context-aware framework, which comprises the following steps:

1. image level context aware network (ICANet)

The present invention recognizes that the loop structure is very important for generating a context image of an object, because it can help the network know the position of the object in consecutive frames. As shown in FIG. 1, the loop structure generates an image-level context awareness map (ICA map) in a loop fashion. The entire network consists of one feature extractor (five convolutional layers conv1-conv5 in VGG-M) and five additional modules, each consisting of convolutional layer, average pooling layer, convolutional LSTM unit and deconvolution layer.

For ICANet, the target and background are treated as a binary classification problem. In most cases, there is contrary information between the target and its context. To capture such opposite information, the present invention proposes to subtract the mean of the features from the features themselves. This average is achieved by an average pooling layer with a kernel size of 3 x 3.

In most cases, the context changes relatively slowly compared to the appearance of the target. Therefore, the present invention selects the LSTM to handle this long term dependence. As shown in FIG. 2, the convolution LSTM cell (pink rectangle) is formed by input gate I _t Forgetting door F _t Cell state C _t And an output gate O _t And (4) forming. In the time dimension, the relationship between the gate and the state can be expressed as:

wherein, X _t Is a feature of contrast layer generation. Cell state C _t Will be input to the next LSTM. Hidden output is represented by H _t And (4) showing. * Is a convolution operation. W _* Is the parameter to be learned. b is a mixture of _* Is the bias term. An as a dot product operation. tan h is the tangent operation. The output of the LSTM is concatenated with the inverse signature and sent to the deconvolution layer. Upsampling different sized feature maps to the output after five additional modulesEnter the size. Finally, a convolution layer with kernel size 1 × 1 is concatenated after the last deconvolution to produce a single-channel score map. For the loss function, the invention treats the output as likelihood probability and there will be an imbalance in the distribution of target/background pixels, where a class-balanced cross-entropy loss function is employed for training:

where K is the total number of training pixels, Q _k Is a Gaussian-shaped tag, P _k Is the prediction probability.

2. Image block level context-aware network (PCANet)

The structure of ICANet is based on 2D CNN and convolutional LSTM, which usually focuses on capturing the coarser and long-term time dependence. However, such architectures may lack the ability to represent more refined structural information in local spatiotemporal windows. Furthermore, the output of ICANet is a gaussian shape map, and in some cases, the output cannot describe the exact contour of the target.

Figure 3 shows the network structure of the PCANet. The present invention crops an image block from the current frame, the center of the image block is located in the highest response area of the ICA map. The PCANet consists of the feature extractor (the first three convolutional layers in ICANet) and the remaining three additional modules. Each additional module consists of a convolutional layer for reducing the feature size, an RNN unit for modeling the structure itself, and a deconvolution layer for incrementally increasing the feature to the input size.

PCANet aims to obtain the structure of the target itself. However, the resolution of the target features is low and the target occupies only a small portion of the image. In order to capture the complete structure of the target, a feature map with high resolution needs to be constructed. The present invention meets this need by expanding the receptive field of each activation. To this end, the maximum pooling layers after conv1 and conv2 in the VGG-M network are deleted. After this operation, the output profile of conv3 is four times larger than the profile in the original VGG-M network. This operation enables the extraction of high-resolution features and improves the quality of the constructed structure.

The native structure of the build target is then based on the RNN unit. In each RNN unit, several directed RNNs are used to model the target's own structure, i.e. the topology of the undirected graph is approximated by a combination of some directed graphs. In the RNN unit, the undirected graph is decomposed into four directed graphs, right (G) ₁ ) Left (G) ₂ ) Upper (G) ₃ ) And (G) below ₄ ). By executing RNN, hidden state h _n (n =1 _n And (4) calculating. And the sum of the outputs of all hidden layers is fed to the output layer. This process can be expressed as:

wherein, U _n ，W _n And V _n Is corresponding to G _n The matrix parameters of (2). b _n And c is a bias term.

Is that _vi vi, precursors thereof. The output of the RNN unit is then input to a deconvolution layer to expand the feature map. Finally, the final output is a single-channel score map, which is the same size as the input.

In order to emphasize the boundary of an object, the invention proposes a boundary loss L _Boundary As an auxiliary loss. The total loss includes class balance cross entropy loss and boundary loss L _Boundary . To calculate the boundary loss L _Boundary First, the boundaries of the prediction and the grountruth need to be extracted. Here, the Sobel filter detects the boundary as a convolution of 3 × 3. Mathematically, the Sobel filter can be expressed as:

which encode the horizontal and vertical gradients, respectively. Then, a Sobel filter is constructed by connecting the above filters. L is _Boundary From labels q _k And predicting p _k And calculating the mean square error between pk. The overall loss function for training the proposed PCANet is then calculated by the following formula:

fig. 4 is a visualization of PCAmap. As can be seen from fig. 4, PCANet focuses more on the finer structure of the target.

3. Determination of target position

To estimate the position of the target, the final context aware map (FCA map) is constructed by the projection of two results, i.e., the result of PCANet is mapped to the result of ICANet by means of pixel value mapping. Then, two different strategies are considered to generate a rectangular box of the target:

(1) Given the FCA map, we use sigmoid for each pixel at the FCA map. Then, a binary mask is obtained by binarizing the FCA map (threshold value of 0.5). From this binary mask, a bounding box (denoted as Seg) is generated by the axis-aligned bounding rectangle.

(2) FCAmap is embedded in a bayesian framework, i.e. the maximum a posteriori estimate is calculated based on the likelihood that the candidate sample belongs to the target. In order to obtain clearer and more accurate target description, the detailed information (denoted as ICA) of the target is described by using Independent Component Analysis (ICA).

ICA is a method for extracting a desired signal between signal sources under the guidance of a reference. To get a reference, the input frame is first convolved with a laplacian gaussian filter and a boundary map is output. Then, refer to m _r The element multiplication is carried out on the FCA map through the boundary map. By giving m _r For reference, m _s As a signal, the desired signal is represented by the projection space s = w ^T m _s And (4) obtaining. Its goal is to maximize negative entropy J(s):

ε(s,m _r )≤δ (7)；

wherein the content of the first and second substances,

is a non-quadratic function, ρ is a constant, ε is a g-uniform variable, ε (·) is a normalization function, E [ ·]Is desirable. The results of the ICA are then input to the appearance model in a bayesian framework. In this framework, the position of the target is denoted l _t = (x, y, σ), where x, y, and σ denote the center point coordinates and scale of the rectangular box, respectively. All candidate samples are normalized to a standard size->

To do this, the confidence of the r-th candidate sample is determined by summing all pixel values in the heat map:

the final position is calculated by:

wherein, the optimal appearance state l of the corresponding target in the current frame _t 。

4. Online update

Online update policies play an important role in the tracking process. For ICANet, the input is the entire image. Since ICANet trains over a sequence with a maximum length of 16 frames, the LSTM state is reset after every 16 frames. The state of the LSTM is set to the first forward-passing output, which encodes the information of the tracked object. For PCANet, the network is updated frame by frame using the estimated binary mask.

5. Details of training

For ICANet, weight initialization using VGG-M, and other parameters are initialized randomly using a normal distribution. Here the AdamaOptimizer method is used at 10 ^-4 The learning rate of (c) is updated. ICANet is trained in two stages. In the first stageThe training duration of ICANet is 300 cycles, the batch size is 16 frames, and the CDnet2014 data set is used. Then, ICANet fine-tunes 200 cycles on the DAVIS2016 dataset with a batch size of 16 frames. For PCANet, the feature extractor was initialized using the first three convolutional layers of ICANet. Use 10 ^-5 For about 300 iterations. All parameters were fixed throughout the experiment.

6. Results and analysis of the experiments

To evaluate CAT performance, the present invention uses standard evaluation metrics. In the OTB100 dataset, the algorithm was evaluated using the commonly used one-pass evaluation (OPE) and a chart of accuracy and success rate was used as an index. For the accuracy metric, the estimated position is measured within a certain threshold distance from the nominal position. Typically, the threshold distance is set to 20 pixels. The success rate measure is the overlap rate between the prediction bounding box and the grountruth bounding box. Graphs of accuracy and success rates are also used for the TC128 data set. In the VOT2016 dataset, each tracker is evaluated by metrics of accuracy ranking (A), robustness ranking (R), and Expected Average Overlap (EAO).

1. Implementation details

The tracking method proposed by the present invention is implemented using the Matconvnet toolkit and runs on the PCs of the Intel (R) Core (TM) i7-4790K CPU and the NVIDIA Tesla K40c GPU. The input sizes of ICANet and PCANet are 300 × 300 and 100 × 100, respectively. The LSTM layer has 1024 cells. All new layers are initialized using the MSRA initialization method. The label for ICANet was generated using a two-dimensional gaussian function with a peak of 1.0. In PCANet, the state of the target in the first frame is initialized by GrabCut. The dimensions of the hidden layer of the RNN are set to 512,256, and 128. For the Bayesian framework tracking strategy, 600 candidates are generated for each frame using a Gaussian distribution model. The variances of the candidate position parameters are set to {10, 0.01}, respectively.

2. Self-comparative experiment

To verify the effectiveness of the various components in CAT, six CAT variants were designed and evaluated using OTB 100. The grey line represents the variation of ICANet using the Seg strategy. White line shows IC using ICA strategyVariants of ANet. Where the scores for accuracy and success are illustrated in the last two columns of table 1. For the Seg strategy, "ICANet + PCANet" and "ICANet + PCANet + L _Boundary "improved by 4% and 4.4% respectively in the measurement of accuracy. For the ICA strategy, "ICANet + PCANet" and "ICANet + PCANet + L _Boundary "improved 5.2% and 5.8% respectively in the measurement of accuracy. The results show that the framework proposed by the present invention can improve performance with sufficient consideration of context and boundary information. The invention selects' ICANet + PCANet + ICA + L _Boundary "architecture compared with other most advanced trackers on the following 3 public data sets.

TABLE 1

3. Experimental results on OTB100 data set

The CAT tracker proposed by the present invention was compared with the most recently released 16 trackers on the OTB100 dataset: siamRPN + +, DSLT, DAT, daSiamRPN, MCPF, TADT, ACT, meta CREST, PTAV, CREST, TRACA, CNN-SVM, BACF, ACFN, cfnet, and UDT. Tracking performance is measured by one-pass evaluation (OPE) based on two metrics: center position error and overlap ratio, the results are shown in fig. 5. According to fig. 5, the cat tracker exhibits competitive performance in this data set. The accuracy and success rate values for OTB100 are 0.909 and 0.697, respectively.

4. Experimental results on the TC128 data set

The invention was experimented with a TC128 data set containing 128 videos and the results compared to the 12 most advanced trackers are shown in fig. 6. In all compared methods, the method of the present invention improved the accuracy score from 0.8073 for the most advanced tracker to 0.8153. Fig. 6 (b) shows the success rate of a total of 128 videos in the TC128 dataset. The CAT tracker performance was superior to the most advanced method with an AUC score of 0.6138. This result verifies the robustness of the CAT proposed by the present invention.

5. Experimental results on VOT2016 dataset

The present invention evaluates the performance of CAT on the (VOT 2016) data set. The VOT2016 report shows that under the EAO index, the advanced index is set to 0.251 and trackers with EAO values exceeding this range are defined as the most advanced. We compared CAT trackers to the 7 most advanced trackers, including ECO, C-COT, stack, MDNet, CREST, siamFC, and ECO-hc. As shown in table 2, the CAT tracker obtained a higher ranking in all the compared trackers.

TABLE 2

Tracker	CAT	ECO	C-COT	Staple	MDNet	CREST	SiamFC	ECO-hc
									EAO	0.332	0.367	0.331	0.295	0.257	0.283	0.235	0.322
A	0.57	0.55	0.54	0.54	0.54	0.51	0.53	0.54
									R	0.23	0.20	0.24	0.38	0.34	0.25	0.46	0.30

6. Analysis and discussion

The qualitative results of the CAT tracker proposed by the present invention in the challenging sequence subset are shown in fig. 7. The CAT proposed by the present invention is able to successfully cope with both in-plane and out-of-plane rotations of the target. ICAmap generated by ICANet captures more discriminatory features for separating foreground and background, i.e. it retains the most robust features over a long time span. The loop unit in the PCANet is more robust to the occlusion of the target. In addition, the PCANet of the invention can effectively capture the structural change of the target. Compared to BACF, daSiamRPN, our CAT obtained a better tracking structure in the presence of illumination variations and complex backgrounds. This is because context information of each corner in the entire image is extracted. Meanwhile, the FCA map captures coarse-grained and fine-grained information of the target, and the tracker of the invention has better performance than the BACF in a sequence with a small target size.

Claims

1. A cascade-based context-aware framework target tracking method is characterized by comprising the following steps:

step one, constructing a context awareness framework CAT based on cascade, wherein the CAT comprises two sub-networks: an image-based context-aware network ICANet and an image-based context-aware network PCANet, wherein: the input of ICANet is a whole image and is used for capturing background information in the range of the whole input image, and PCANet is used for distinguishing similar interferents in a target local range;

the ICANet is comprised of a feature extractor and five additional modules, wherein: the feature extractor comprises five convolutional layers in the VGG-M, and each additional module consists of a convolutional layer, an average pooling layer, a convolution LSTM unit and a deconvolution layer;

the PCANet is comprised of a feature extractor and three additional modules, wherein: the feature extractor comprises the first three convolutional layers in ICANet, each additional module consists of a convolutional layer for reducing the feature size, an RNN unit for modeling the structure of itself, and an anti-convolutional layer for progressively increasing the feature to the input size;

step two, learning an image-level context perception map ICA map through ICANet, and capturing the most discriminative features between a target and the surrounding context and the approximate structural information of the target;

step three, learning a PCA map of a block-level context perception map of the image through PCANet, and acquiring self-structure information of the target and inhibiting information of an interfering object based on the PCAmap;

step four, after obtaining the ICA map and the PCA map, mapping the pixels of the PCA map to the ICA map to obtain a final context perception map FCA map;

and step five, obtaining a positioning frame of the target based on the final FCA map.

2. The cascade-based context-aware framework target tracking method according to claim 1, wherein the kernel size of the average pooling layer is 3 x 3.

3. The cascade-based context-aware framework-based target tracking method according to claim 1, wherein the convolution LSTM unit is composed of an input gate I _t Door F for forgetting to leave _t Cell state C _t And an output gate O _t The relationship between the gate and the state in the time dimension is expressed as:

wherein, X _t Is a feature of contrast layer generation, C _t Is a cellular state, H _t Representing hidden outputs, is a convolution operation, W _* For the parameter to be learned, b _* For an offset term, the case is a dot product operation, and tanh is a tangent operation.

4. The cascade-based target tracking method of context-aware framework according to claim 1, wherein in the first step, ICANet employs class-balanced cross entropy loss function for training:

5. The cascade-based context-aware framework target tracking method according to claim 1, wherein in the first step, the whole loss function for training the PCANet is calculated by the following formula:

wherein L is _Boundary To the boundary loss, q _k For labels, K is the total number of training pixels, P _k Is the prediction probability.

6. The cascade-based target tracking method of context-aware framework according to claim 1, wherein in the fifth step, the method for obtaining the location box of the target based on the final FCA map is as follows: sigmoid is used for each pixel at the FCA map, and then a binary mask is obtained by binarizing the FCA map, from which bounding boxes are generated by axis-aligned bounding rectangles.

7. The cascade-based target tracking method of context-aware framework according to claim 1, wherein in the fifth step, the method for obtaining the location box of the target based on the final FCA map is as follows: the FCA map is embedded in a bayesian framework, i.e. the maximum a posteriori estimate is calculated based on the likelihood that the candidate sample belongs to the target.