CN110135365B

CN110135365B - Robust object tracking method based on hallucination adversarial network

Info

Publication number: CN110135365B
Application number: CN201910418050.4A
Authority: CN
Inventors: 王菡子; 吴强强; 严严
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2021-04-06
Anticipated expiration: 2039-05-20
Also published as: CN110135365A

Abstract

Robust target tracking method based on hallucination adversarial network, involving computer vision technology. First, a new hallucination adversarial network is proposed, which aims to learn the nonlinear deformation between sample pairs, and apply the learned deformation to the new target to generate new target deformation samples. In order to effectively train the proposed hallucination adversarial network, a deformation reconstruction loss is proposed. Based on the hallucination adversarial network trained offline, a target tracking method based on the hallucination adversarial network is proposed, which can effectively alleviate the overfitting problem of the deep neural network due to online update in the target tracking process. In addition, in order to further improve the quality of deformation transfer, a selective deformation transfer method is proposed, which further improves the tracking accuracy. The proposed object tracking method achieves competitive results on current mainstream object tracking datasets.

Description

Robust object tracking method based on hallucination adversarial network

技术领域technical field

本发明涉及计算机视觉技术，尤其是涉及一种基于幻觉对抗网络的鲁棒目标跟踪方法。The invention relates to computer vision technology, in particular to a robust target tracking method based on hallucination confrontation network.

背景技术Background technique

近几年，深度神经网络在计算机视觉领域的应用取得了巨大成功。目标跟踪作为计算机视觉领域的基础问题之一，其在当前许多计算机视觉任务中均扮演了十分重要的角色，如无人驾驶、增强现实、机器人等领域。近来，基于深度神经网络的目标跟踪算法研究受到了国内外研究者的广泛关注。然而，与其他计算机视觉任务所不同(如目标检测和语义分割)，深度神经网络在目标跟踪任务中的应用仍然十分的有效，主要原因为目标跟踪任务本身存在一定的特殊性，其缺少多样化的在线目标训练样本，因此极大地限制了深度神经网络的泛化性，进而影响跟踪结果。同时，目标跟踪任务旨在于跟踪任意目标，其对于要跟踪的目标不提前给出任何先验知识，这一点也对于深度神经网络离线训练数据集的选择带来了巨大挑战。因此，提出一个具有强泛化性的基于深度神经网络的目标跟踪算法具有重要的现实意义。In recent years, the application of deep neural networks in the field of computer vision has achieved great success. As one of the basic problems in the field of computer vision, object tracking plays a very important role in many current computer vision tasks, such as unmanned driving, augmented reality, robotics and other fields. Recently, the research of target tracking algorithm based on deep neural network has received extensive attention from researchers at home and abroad. However, unlike other computer vision tasks (such as object detection and semantic segmentation), the application of deep neural networks in object tracking tasks is still very effective. Therefore, the generalization of deep neural networks is greatly limited, which in turn affects the tracking results. At the same time, the target tracking task is designed to track any target, and it does not give any prior knowledge of the target to be tracked in advance, which also brings great challenges to the selection of offline training datasets for deep neural networks. Therefore, it is of great practical significance to propose a target tracking algorithm based on deep neural network with strong generalization.

为了缓解上述问题，当前国内外的研究者们主要提出了两种类型的解决方法。第一类方法将目标跟踪任务看作是一个模板匹配的问题，其具体实现往往采用深度孪生网络，将目标模板和搜索区域同时作为深度孪生网络的输入，最后得到搜索区域中与目标模板最为相似的子区域位置。基于相似度计算的深度孪生网络可以通过使用大量标注的目标跟踪数据集进行完全离线的训练，因此其可以避免由于在线训练样本过少所带来的过拟合问题。在基于深度孪生网络的目标跟踪算法中，其开创性的算法为SiamFC。基于SiamFC，研究者们提出了许多改进算法，其包括使用区域建议窗口生成网络的SiamRPN、使用动态记忆网络的MemSiamFC、使用更深层次骨架网络的SiamRPN++等。由于SiamFC类型的跟踪算法能避免耗时的在线训练步骤，因此其往往能达到远超实时的跟踪速度。然而，由于此类算法缺少对于目标表观变化在线学习的过程，其精度仍然较为受限(如在OTB数据集上的精度结果)。研究者们所提出的另一类方法旨在于利用有限的在线样本来学习鲁棒的神经网络分类器。此类方法的一般思路为使用迁移学习领域的方法来缓解过拟合问题，其较为代表性的方法为H.Nam等人于2016年提出的MDNet。MDNet首先使用多域离线学习来学习较好的分类器初始模型参数，然后在跟踪过程中，通过收集目标的正负样本来进一步训练分类器。近来，基于MDNet，研究者们提出了使用对抗学习的VITAL、学习不同层次目标表征的BranchOut、使用RNN的SANet等。相比于前一类方法，此类方法比上一类方法能达到更高的跟踪精度。然而，由于极为有限的在线样本(尤其是目标样本)，使得此类方法的在线学习十分受限，仍易造成过拟合的问题，进而影响跟踪性能。因此，设计一种简单有效的方法来缓解深度目标跟踪算法在跟踪过程中发生的过拟合问题，具有非常重大的意义。In order to alleviate the above problems, researchers at home and abroad have mainly proposed two types of solutions. The first method regards the target tracking task as a template matching problem, and its specific implementation often uses a deep siamese network, taking the target template and the search area as the input of the deep siamese network at the same time, and finally obtains the search area that is most similar to the target template. sub-region location. The deep Siamese network based on similarity calculation can be trained completely offline by using a large number of labeled target tracking data sets, so it can avoid the overfitting problem caused by too few online training samples. In the target tracking algorithm based on deep Siamese network, its pioneering algorithm is SiamFC. Based on SiamFC, researchers have proposed many improved algorithms, including SiamRPN using region proposal window generation network, MemSiamFC using dynamic memory network, SiamRPN++ using deeper skeleton network, etc. Since SiamFC-type tracking algorithms avoid time-consuming online training steps, they often achieve tracking speeds far beyond real-time. However, since such algorithms lack an online learning process for target appearance changes, their accuracy is still limited (such as the accuracy results on the OTB dataset). Another class of methods proposed by researchers aims to learn robust neural network classifiers with limited online samples. The general idea of such methods is to use methods in the field of transfer learning to alleviate the overfitting problem, and a more representative method is MDNet proposed by H. Nam et al. in 2016. MDNet first uses multi-domain offline learning to learn better initial model parameters of the classifier, and then further trains the classifier by collecting positive and negative samples of the target during the tracking process. Recently, based on MDNet, researchers have proposed VITAL using adversarial learning, BranchOut learning target representations at different levels, SANet using RNN, etc. Compared with the previous method, this method can achieve higher tracking accuracy than the previous method. However, due to the extremely limited online samples (especially target samples), the online learning of such methods is very limited, and it is still prone to overfitting, which in turn affects the tracking performance. Therefore, it is of great significance to design a simple and effective method to alleviate the overfitting problem of deep target tracking algorithms during the tracking process.

与当前的目标跟踪算法相比，人类可以轻而易举的对移动的目标进行跟踪。虽然人脑的机制到目前为止还没被完全的探索清楚，但我们可以确定的是通过人类以前的学习经历，人脑衍生出了无与伦比的想象机制。人类可以从平时看到的各类事物中学习到相似的动作或变换，从而将这种相似的变换施加到不同的目标，以此想象出新的目标在不同姿态或动作下的样子。这样的想象机制与机器学习中的数据增强方法极为的类似，人脑可以类比为一个视觉分类器，然后使用想象机制来得到不同状态下的目标样本，从而训练出一个鲁棒的视觉分类器。Compared with current object tracking algorithms, humans can track moving objects with ease. Although the mechanism of the human brain has not been fully explored so far, we can be sure that through the previous learning experience of human beings, the human brain has derived an unparalleled imagination mechanism. Humans can learn similar actions or transformations from various things they see at ordinary times, so as to apply such similar transformations to different targets, so as to imagine the appearance of new targets in different poses or actions. Such an imaginary mechanism is very similar to the data augmentation method in machine learning. The human brain can be compared to a visual classifier, and then the imaginary mechanism is used to obtain target samples in different states, thereby training a robust visual classifier.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供基于幻觉对抗网络的鲁棒目标跟踪方法。The purpose of the present invention is to provide a robust target tracking method based on hallucination confrontation network.

本发明包括以下步骤：The present invention includes the following steps:

1)在有标注的目标跟踪数据集中收集大量形变样本对作为训练样本集合；1) Collect a large number of deformation sample pairs in the labeled target tracking data set as a training sample set;

在步骤1)中，所述在有标注的目标跟踪数据集中收集大量形变样本对作为训练样本集合具体过程可为：标记视频序列收集大量目标样本对，一对样本包含同一个目标；在视频序列a中，首先在第t帧选取目标样本

然后在后20帧内随机选取一帧中的目标样本作为

用于构成一组形变样本对

选取大量的形变样本对构成训练样本集合；所述数据集为Fei-Fei Li等人在2015年提出的ILSVRC-2015视频目标检测数据集。In step 1), the specific process of collecting a large number of deformation sample pairs in the marked target tracking data set as a training sample set may be: marking a video sequence to collect a large number of target sample pairs, and a pair of samples contains the same target; In a, first select the target sample in the t-th frame

Then randomly select the target sample in one frame in the next 20 frames as

used to form a set of deformed sample pairs

A large number of deformed sample pairs are selected to form a training sample set; the data set is the ILSVRC-2015 video target detection data set proposed by Fei-Fei Li et al. in 2015.

2)对步骤1)中所得到的训练样本集合中的所有样本进行特征提取，得到训练样本特征集合；2) Feature extraction is performed on all samples in the training sample set obtained in step 1) to obtain a training sample feature set;

在步骤2)中，所述特征提取的步骤可为：首先将目标样本使用双线形插值方法改变大小至107×107×3，然后使用神经网络特征提取器φ(·)对所有插值后的目标样本进行特征提取；所述特征提取器φ(·)的结构可为在Imagenet数据集上预训练的VGG-M模型的前三层卷积层。In step 2), the feature extraction step may be as follows: first, use the bilinear interpolation method to change the size of the target sample to 107×107×3, and then use the neural network feature extractor Feature extraction is performed on the target sample; the structure of the feature extractor φ(·) can be the first three convolutional layers of the VGG-M model pre-trained on the Imagenet dataset.

3)使用步骤2)中得到的训练样本特征集合、对抗损失和所提出的形变重构损失来离线训练所提出的幻觉对抗网络；3) using the training sample feature set, adversarial loss and the proposed deformation reconstruction loss obtained in step 2) to train the proposed hallucination adversarial network offline;

在步骤3)中，所述训练的过程可为：首先从训练样本特征集合中选取两组训练样本特征对，表示为

和

使用幻觉对抗网络学习

和

间的形变，并将此形变施加到

用以生成关于目标b新的形变样本，使用对抗损失保证生成的样本分布与目标b分布相近：In step 3), the training process can be as follows: first, two sets of training sample feature pairs are selected from the training sample feature set, which are expressed as

and

Learning with Illusion Adversarial Networks

and

deformation between and apply this deformation to

It is used to generate new deformed samples about the target b, and the adversarial loss is used to ensure that the generated sample distribution is similar to the distribution of the target b:

其中，

E_n和D_e分表表示所提出的对抗幻想器中的编码器和解码器部分；为了使得生成样本有效编码形变z^a，提出形变重构损失对生成样本进行约束：in,

_En and D _e sub-tables represent the encoder and decoder parts in the proposed adversarial fantasy device; in order to make the generated samples effectively encode the deformation za , ^a deformation reconstruction loss is proposed to constrain the generated samples:

其中，

最终，用于离线训练所提出的幻觉对抗网络的总损失函数为：in,

Finally, the total loss function for offline training of the proposed hallucination adversarial network is:

l_all＝l_adv+λl_def, (公式三)l _all =l _adv +λl _def , (Formula 3)

其中，λ为用于平衡两项损失的超参数；where λ is the hyperparameter used to balance the two losses;

所述幻觉对抗网络的离线训练可包括以下子步骤：The offline training of the hallucination confrontation network may include the following sub-steps:

3.1公式(三)中的参数λ设置为0.5；3.1 The parameter λ in formula (3) is set to 0.5;

3.2在训练中，使用的优化器为Adam(D.P.Kingma,and J.L.Ba,“Adam:A methodfor stochastic optimization,”in Proceedings of the International Conferenceon Learning Representations,2014)，迭代次数为5×10⁵，学习率为2×10^-4；3.2 In training, the optimizer used is Adam (DPKingma, and JLBa, “Adam: A method for stochastic optimization,” in Proceedings of the International Conferenceon Learning Representations, 2014), the number of iterations is 5×10 ⁵ , and the learning rate is 2 × ^10-4 ;

3.3所提出的幻觉对抗网络的编码器和解码器结构均为隐层节点数为2048的三层感知机，编码器输入层节点为9216，编码器输出层节点为64；解码器输入层节点为4672；判别网络同样为隐层节点数为2048的三层感知机，其输入节点数为9216，输出节点数为1。3.3 The encoder and decoder structures of the proposed hallucination adversarial network are both three-layer perceptrons with 2048 hidden layer nodes, 9216 encoder input layer nodes, 64 encoder output layer nodes; decoder input layer nodes are 4672; the discriminant network is also a three-layer perceptron with 2048 hidden layer nodes, the number of input nodes is 9216, and the number of output nodes is 1.

4)给定测试视频中的第一帧标注图像，采集目标样本，并在目标样本周围采用高斯和随机采样方式进行正负样本的采样；4) Given the first frame annotated image in the test video, collect target samples, and use Gaussian and random sampling methods to sample positive and negative samples around the target samples;

在步骤4)中，所述采样的细节可为：在每一次迭代训练中，正负样本比例按照1︰3的比例进行采样，即32个正样本和96个负样本，正样本判定标准为所采样样本和目标样本的区域重叠率大于0.7，负样本的判定标准为所采样样本和目标样本的区域重叠率低于0.5。In step 4), the details of the sampling can be: in each iterative training, the ratio of positive and negative samples is sampled at a ratio of 1:3, that is, 32 positive samples and 96 negative samples, and the positive sample determination standard is The regional overlap ratio of the sampled sample and the target sample is greater than 0.7, and the criterion for negative samples is that the regional overlap ratio of the sampled sample and the target sample is lower than 0.5.

5)使用所提出的选择性形变迁移方法对跟踪目标进行待迁移样本对的选择；5) Use the proposed selective deformation migration method to select the pair of samples to be migrated for the tracking target;

在步骤5)中，所述待迁移样本对的选择的过程可为：定义N_s表示用于收集形变样本对的数据集中视频片断的数目，s_i为视频片断的身份标识，其中，

表示视频片断s_i中对应样本的个数；对于视频片断s_i的特征表达ψ(s_i)，可以通过如下方式计算得到：In step 5), the process of selecting the pair of samples to be migrated may be: defining N _s to represent the number of video clips in the data set used to collect deformation sample pairs, and s _i to be the identification of the video clips, wherein,

Represents the number of corresponding samples in the video segment s _i ; for the feature expression ψ(s _i ) of the video segment s _i , it can be calculated as follows:

其中，

为深度特征提取器，对于目标特征

计算其余每个视频片断表征ψ(s_i)间的欧式距离，选取距离最近的T个视频片断；在选择的T个视频片断中，采用与步骤1)中相同的方式收集大量的形变样本对，构成集合D_S，用于后续目标形变迁移；in,

is the deep feature extractor, for the target feature

Calculate the Euclidean distance between the remaining video clips representing ψ(s _i ), and select the T video clips with the nearest distance; in the selected T video clips, collect a large number of deformation sample pairs in the same way as in step 1). , forming a set D _S for subsequent target deformation migration;

所述选择性形变迁移方法可包括以下子步骤：The selective deformation transfer method may include the following sub-steps:

5.1在计算视频片断的特征表达时，所使用的深度特征提取器

为去掉全连接层的ResNet34模型；5.1 The deep feature extractor used when computing the feature representation of video clips

To remove the ResNet34 model of the fully connected layer;

5.2在选择相似视频片断时，参数T设置为2×10³。5.2 When selecting similar video clips, the parameter T is set to 2×10 ³ .

6)基于选择得到的待迁移样本对，使用离线训练好的幻觉对抗网络生成形变的正样本；6) Based on the selected sample pairs to be migrated, use the offline trained hallucination confrontation network to generate deformed positive samples;

在步骤6)中，所述基于选择得到的待迁移样本对，使用离线训练好的幻觉对抗网络生成形变的正样本的具体步骤可为：在每一次迭代训练中，从集合D_S随机选择64对样本对，每一对样本对与目标样本输入对抗幻想，生成对应形变样本，最终，对于每一次迭代，共计生成64个正样本。In step 6), based on the selected sample pairs to be migrated, the specific steps of using the offline trained hallucination confrontation network to generate deformed positive samples may be: in each iteration training, randomly select 64 samples from the set D _S For sample pairs, each pair of samples is input against the target sample to generate corresponding deformation samples, and finally, for each iteration, a total of 64 positive samples are generated.

7)使用空间采样的正负样本和生成的正样本共同对分类器进行训练，其产生的分类误差损失用于同时更新分类器和幻觉对抗网络；7) Use the positive and negative samples of spatial sampling and the generated positive samples to jointly train the classifier, and the resulting classification error loss is used to update the classifier and the hallucination confrontation network at the same time;

在步骤7)中，所述使用空间采样的正负样本和生成的正样本共同对分类器进行训练，其产生的分类误差损失用于同时更新分类器和幻觉对抗网络的具体方法可为：将生成的64个正样本、32个空间采样的正样本和96个空间采样的负样本共同输入分类器，计算二分类的交叉熵损失，然后使用Adam优化器，通过反向传播算法同时更新分类器和幻觉对抗网络。In step 7), the classifier is jointly trained using the positive and negative samples of spatial sampling and the generated positive samples, and the classification error loss generated by the classifier is used to update the classifier and the hallucination confrontation network at the same time. The specific method can be: The generated 64 positive samples, 32 spatially sampled positive samples, and 96 spatially sampled negative samples are input to the classifier, and the cross-entropy loss of the binary classification is calculated, and then the Adam optimizer is used to update the classifier at the same time through the back-propagation algorithm. And the hallucinations adversarial network.

8)给定新的测试帧，使用训练好的分类器置信度最高的区域作为目标位置，完成当前帧的跟踪；8) Given a new test frame, use the region with the highest confidence of the trained classifier as the target position to complete the tracking of the current frame;

在步骤8)中，所述给定新的测试帧，使用训练好的分类器置信度最高的区域作为目标位置，完成当前帧的跟踪的具体过程可为：在当前测试帧，同时使用随机采样和高斯采样在上一帧估计的目标位置处进行样本采样；采样的样本输入分类器得到其对应的目标置信度。In step 8), the given new test frame, using the region with the highest confidence of the trained classifier as the target position, the specific process of completing the tracking of the current frame can be: in the current test frame, use random sampling at the same time With Gaussian sampling, sample sampling is performed at the target position estimated in the previous frame; the sampled sample is input to the classifier to obtain its corresponding target confidence.

本发明旨在于将人脑的想象机制用于当前基于深度学习的目标跟踪算法，提出一种新的基于幻觉对抗网络的鲁棒目标跟踪方法。本发明首先提出一种新的幻觉对抗网络，旨在于学习样本对间的非线性形变，并将学习到的形变施加在新目标以此来生成新的目标形变样本。为了能有效训练所提出的幻觉对抗网络，提出形变重构损失。基于离线训练的幻觉对抗网络，提出基于幻觉对抗网络的目标跟踪方法，该方法能有效缓解深度神经网络在目标跟踪过程中由于在线更新发生的过拟合问题。此外，为了能进一步提升形变迁移质量，提出选择性性变迁移方法，进一步提升了跟踪精度。本发明提出的目标跟踪方法在当前主流目标跟踪数据集上取得了具有竞争力的结果。The invention aims to apply the imagination mechanism of the human brain to the current target tracking algorithm based on deep learning, and proposes a new robust target tracking method based on the hallucination confrontation network. The present invention first proposes a new hallucination confrontation network, which aims to learn the nonlinear deformation between pairs of samples, and apply the learned deformation to a new target to generate a new target deformation sample. In order to effectively train the proposed hallucination adversarial network, a deformation reconstruction loss is proposed. Based on the hallucination adversarial network trained offline, a target tracking method based on the hallucination adversarial network is proposed, which can effectively alleviate the overfitting problem of the deep neural network due to online update in the target tracking process. In addition, in order to further improve the quality of deformation transfer, a selective deformation transfer method is proposed, which further improves the tracking accuracy. The target tracking method proposed by the present invention has achieved competitive results on the current mainstream target tracking data sets.

附图说明Description of drawings

图1为本发明实施例的流程示意图。FIG. 1 is a schematic flowchart of an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明的方法作详细说明，本实施例在以本发明技术方案为前提下进行实施，给出了实施方式和具体操作过程，但本发明的保护范围不限于下述的实施例。The method of the present invention will be described in detail below in conjunction with the accompanying drawings and examples. The present example is implemented on the premise of the technical solution of the present invention, and the implementation manner and specific operation process are given, but the protection scope of the present invention is not limited to the following example.

参见图1，本发明实施例包括以下步骤：Referring to Fig. 1, the embodiment of the present invention includes the following steps:

A.在有标注的目标跟踪数据集中收集大量形变样本对作为训练样本集合。具体过程如下：标记视频序列来收集大量目标样本对(一对样本包含同一个目标)。如在视频序列a中，首先在第t帧选取目标样本

然后在后20帧内随机选取一帧中的目标样本作为

以此来构成一组形变样本对

按照上述步骤，选取大量的形变样本对来构成训练样本集合。A. Collect a large number of deformation sample pairs in an annotated target tracking dataset as a training sample set. The specific process is as follows: label the video sequence to collect a large number of target sample pairs (a pair of samples contains the same target). As in the video sequence a, first select the target sample in the t-th frame

Then randomly select the target sample in one frame in the next 20 frames as

This constitutes a set of deformation sample pairs

According to the above steps, a large number of deformation sample pairs are selected to form a training sample set.

B.对步骤A中所得到的训练样本集合中的所有样本进行特征提取，得到训练样本特征集合。特征提取步骤如下：首先将目标样本使用双线形插值方法改变大小至107×107×3，然后使用神经网络特征提取器φ(·)对所有插值后的目标样本进行特征提取。B. Perform feature extraction on all samples in the training sample set obtained in step A to obtain a training sample feature set. The feature extraction steps are as follows: first, the target sample is changed to 107×107×3 using the bilinear interpolation method, and then the feature extraction is performed on all the interpolated target samples by using the neural network feature extractor φ(·).

C.使用步骤B中得到的训练样本特征集合、对抗损失和所提出的形变重构损失来离线训练所提出的幻觉对抗网络。其具体训练过程描述如下：首先从训练样本特征集合中选取两组训练样本特征对，表示为

和

使用幻觉对抗网络学习

和

间的形变，并将此形变施加到

用以生成关于目标b新的形变样本。使用对抗损失保证生成的样本分布与目标b分布相近：C. Use the training sample feature set obtained in step B, the adversarial loss, and the proposed deformation reconstruction loss to train the proposed hallucination adversarial network offline. The specific training process is described as follows: First, two sets of training sample feature pairs are selected from the training sample feature set, expressed as

and

Learning with Illusion Adversarial Networks

and

deformation between and apply this deformation to

Used to generate new deformation samples about target b. Using an adversarial loss ensures that the generated sample distribution is close to the target b distribution:

其中，

E_n和D_e分表表示所提出的幻觉对抗网络中的编码器和解码器部分。为了使得生成样本有效编码形变z^a，提出形变重构损失对生成样本进行约束：in,

The _En and _De sub-tables represent the encoder and decoder parts in the proposed hallucination adversarial network. In order to make the generated samples effectively encode the deformation za, ^a deformation reconstruction loss is proposed to constrain the generated samples:

其中，

l_all＝l_adv+λl_def, (公式三)l _all =l _adv +λl _def , (Formula 3)

其中，λ为用于平衡两项损失的超参数。where λ is the hyperparameter used to balance the two losses.

D、给定测试视频中的第一帧标注图像，采集目标样本，并在目标样本周围采用高斯和随机采样方式进行正负样本的采样。采样细节描述如下：在每一次迭代训练中，正负样本比例按照1︰3的比例进行采样，即32个正样本和96个负样本。正样本判定标准为所采样样本和目标样本的区域重叠率大于0.7，负样本的判定标准为所采样样本和目标样本的区域重叠率低于0.5。D. Given the first frame annotated image in the test video, collect target samples, and use Gaussian and random sampling methods to sample positive and negative samples around the target samples. The sampling details are described as follows: In each iteration of training, the ratio of positive and negative samples is sampled in a ratio of 1:3, that is, 32 positive samples and 96 negative samples. The criterion for positive samples is that the regional overlap ratio between the sampled sample and the target sample is greater than 0.7, and the criterion for negative samples is that the regional overlap ratio between the sampled sample and the target sample is less than 0.5.

E、使用所提出的选择性形变迁移方法来对跟踪目标进行待迁移样本对的选择。具体选择过程描述如下：定义N_s表示用于收集形变样本对的数据集中视频片断的数目，s_i为视频片断的身份标识，其中，

表示视频片断s_i中对应样本的个数。对于视频片断s_i的特征表达ψ(s_i)，可以通过如下方式计算得到：E. Use the proposed selective deformation transfer method to select pairs of samples to be migrated for the tracking target. The specific selection process is described as follows: define N _s to represent the number of video clips in the data set used to collect deformation sample pairs, s _i to be the identity of the video clips, where,

Indicates the number of corresponding samples in the video segment _si . For the feature expression ψ(s _i ) of the video segment _si , it can be calculated as follows:

其中，

为深度特征提取器。对于目标特征

计算其余每个视频片断表征ψ(s_i)间的欧式距离，选取距离最近的T个视频片断。在选择的T个视频片断中，采用与步骤A中相同的方式收集大量的形变样本对，构成集合D_S，用于后续目标形变迁移。in,

is a deep feature extractor. For target features

Calculate the Euclidean distance between the representation ψ(s _i ) of each of the remaining video clips, and select the T video clips with the closest distance. In the selected T video clips, a large number of deformation sample pairs are collected in the same manner as in step A to form a set D _S for subsequent target deformation migration.

F、基于选择得到的待迁移样本对，使用离线训练好的幻觉对抗网络生成形变的正样本。具体生成步骤如下：在每一次迭代训练中，从集合D_S随机选择64对样本对，每一对样本对与目标样本输入对抗幻想，生成对应形变样本。最终，对于每一次迭代，共计生成64个正样本。F. Based on the selected sample pairs to be migrated, use the offline trained hallucination confrontation network to generate deformed positive samples. The specific generation steps are as follows: in each iterative training, 64 pairs of samples are randomly selected from the set D _S , and each pair of samples is input against the target sample to generate corresponding deformation samples. Finally, for each iteration, a total of 64 positive samples are generated.

G、使用空间采样的正负样本和生成的正样本来共同对分类器进行训练，其产生的分类误差损失用于同时更新分类器和幻觉对抗网络。具体优化过程如下：将生成的64个正样本、32个空间采样的正样本和96个空间采样的负样本共同输入分类器，计算二分类的交叉熵损失，然后使用Adam优化器，通过反向传播算法同时更新分类器和幻觉对抗网络。G. The classifier is jointly trained using the spatially sampled positive and negative samples and the generated positive samples, and the resulting classification error loss is used to simultaneously update the classifier and the hallucination adversarial network. The specific optimization process is as follows: Input the generated 64 positive samples, 32 spatially sampled positive samples and 96 spatially sampled negative samples into the classifier, calculate the cross-entropy loss of the binary classification, and then use the Adam optimizer to reverse The propagation algorithm updates both the classifier and the hallucination adversarial network.

H、给定新的测试帧，使用训练好的分类器置信度最高的区域作为目标位置，完成当前帧的跟踪。具体过程如下：在当前测试帧，同时使用随机采样和高斯采样在上一帧估计的目标位置处进行样本采样。采样的样本输入分类器得到其对应的目标置信度。H. Given a new test frame, use the region with the highest confidence of the trained classifier as the target position to complete the tracking of the current frame. The specific process is as follows: in the current test frame, use random sampling and Gaussian sampling to sample samples at the target position estimated in the previous frame. The sampled samples are input to the classifier to obtain their corresponding target confidences.

表1为本发明与其他9个目标跟踪算法在OTB-2013数据集上所取得的精度和成功率对比。本发明方法在主流的数据集上取得了优异的跟踪结果。Table 1 compares the accuracy and success rate achieved by the present invention and other 9 target tracking algorithms on the OTB-2013 data set. The method of the present invention achieves excellent tracking results on mainstream datasets.

表1Table 1

方法method 精度(％)Accuracy (%) 成功率(％)Success rate(%) 本发明this invention 95.195.1 69.669.6 VITAL(2018)VITAL(2018) 92.992.9 69.469.4 MCPF(2017)MCPF (2017) 91.691.6 67.767.7 CCOT(2016)CCOT (2016) 91.291.2 67.867.8 MDNet(2016)MDNet (2016) 90.990.9 66.866.8 CREST(2017)CREST (2017) 90.890.8 67.367.3 MetaSDNet(2018)MetaSDNet (2018) 90.590.5 68.468.4 ADNet(2017)ADNet (2017) 90.390.3 65.965.9 TRACA(2018)TRACA (2018) 89.889.8 65.265.2 HCFT(2015)HCFT (2015) 89.189.1 60.560.5

在表1中：In Table 1:

VITAL对应为Y.Song等人提出的方法(Y.Song,C.Ma,X.Wu,L.Gong,L.Bao,W.Zuo,C.Shen,R.Lau,and M.-H.Yang,“VITAL:VIsual Tracking via Adversarial Learning,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition,2018,pp.8990-8999.)VITAL corresponds to the method proposed by Y.Song et al. (Y.Song, C.Ma, X.Wu, L.Gong, L.Bao, W.Zuo, C.Shen, R.Lau, and M.-H. Yang, "VITAL: VIsual Tracking via Adversarial Learning," in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2018, pp.8990-8999.)

MCPF对应为T.Zhang等人提出的方法(T.Zhang,C.Xu,and M.-H.Yang,“Multi-Task Correlation Particle Filter for Robust Object Tracking,”in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.4819-4827.)MCPF corresponds to the method proposed by T. Zhang et al. (T. Zhang, C. Xu, and M.-H. Yang, "Multi-Task Correlation Particle Filter for Robust Object Tracking," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.4819-4827.)

CCOT对应为M.Danelljan等人提出的方法(M.Danelljan,A.Robinson,F.S.Khan,and M.Felsberg,“Beyond Correlation Filters:Learning Continuous ConvolutionOperators for Visual Tracking,”in Proceedings of the European Conference onComputer Vision,2016,pp.472-488.)CCOT corresponds to the method proposed by M. Danelljan et al. (M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, "Beyond Correlation Filters: Learning Continuous ConvolutionOperators for Visual Tracking," in Proceedings of the European Conference on Computer Vision, 2016, pp.472-488.)

MDNet对应为H.Nam等人提出的方法(H.Nam and B.Han,“Learning Multi-domainConvolutional Neural Networks for Visual Tracking,”in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,2016,pp.817-825.)MDNet corresponds to the method proposed by H.Nam et al. (H.Nam and B.Han, "Learning Multi-domain Convolutional Neural Networks for Visual Tracking," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.817- 825.)

CREST对应为Y.Song等人提出的方法(Y.Song,C.Ma,L.Gong,J.Zhang,R.～W.H.Lau,and M.-H.Yang,“CREST:Convolutional Residual Learning for VisualTracking,”in Proceedings of the IEEE International Conference on ComputerVision,2017,pp.2555-2564.)CREST corresponds to the method proposed by Y. Song et al. (Y. Song, C. Ma, L. Gong, J. Zhang, R. ~ W. H. Lau, and M.-H. Yang, "CREST: Convolutional Residual Learning for Visual Tracking ,” in Proceedings of the IEEE International Conference on ComputerVision, 2017, pp.2555-2564.)

MetaSDNet对应为E.Park等人提出的方法(E.Park and A.C.Berg,“Meta-Tracker:Fast and Robust Online Adaptation for Visual Object Trackers,”inProceedings of the European Conference on Computer Vision,2018,pp.569-585.)MetaSDNet corresponds to the method proposed by E.Park et al. (E.Park and A.C.Berg, "Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers," in Proceedings of the European Conference on Computer Vision, 2018, pp.569- 585.)

ADNet对应为S.Yun等人提出的方法(S.Yun,J.Choi,Y.Yoo,K.Yun,and J.Y.Choi,“Action-decision Networks for Visual Tracking with Deep ReinforcementLearning,”in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition,2017,pp.2711-2720.)ADNet corresponds to the method proposed by S. Yun et al. (S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi, "Action-decision Networks for Visual Tracking with Deep Reinforcement Learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.2711-2720.)

TRACA对应为J.Choi等人提出的方法(J.Choi,H.J.Chang,T.Fischer,S.Yun,andJ.Y.Choi,“Context-aware Deep Feature Compression for High-speed VisualTracking,”in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition,2018,pp.479-488)。TRACA corresponds to the method proposed by J. Choi et al. (J. Choi, H. J. Chang, T. Fischer, S. Yun, and J. Y. Choi, "Context-aware Deep Feature Compression for High-speed Visual Tracking," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 479-488).

HCFT对应为C.Ma等人提出的方法(C.Ma,J.-B.Huang,X.Yang,and M.-H.Yang,“Hierarchical Convolutional Features for Visual Tracking,”in Proceedings ofthe IEEE International Conference on Computer Vision,2015,3074-3082)。HCFT corresponds to the method proposed by C.Ma et al. (C.Ma, J.-B.Huang, X.Yang, and M.-H.Yang, "Hierarchical Convolutional Features for Visual Tracking," in Proceedings of the IEEE International Conference on Computer Vision, 2015, 3074-3082).

Claims

1. The robust target tracking method based on hallucination confrontation network is characterized in that comprising the following steps:

1) Collect a large number of deformation sample pairs in the labeled target tracking data set as a training sample set;

2) Feature extraction is performed on all samples in the training sample set obtained in step 1) to obtain a training sample feature set;

3) using the training sample feature set, adversarial loss and the proposed deformation reconstruction loss obtained in step 2) to train the proposed hallucination adversarial network offline;

The training process is as follows: first, two sets of training sample feature pairs are selected from the training sample feature set, which are expressed as

and

Learning with Illusion Adversarial Networks

and

deformation between and apply this deformation to

in,

_En and D _e represent the encoder and decoder parts in the proposed adversarial fantasy device, respectively; in order to make the generated samples effectively encode the deformation za , ^a deformation reconstruction loss is proposed to constrain the generated samples:

in,

where λ is the hyperparameter used to balance the two losses;

The offline training of the hallucination confrontation network includes the following sub-steps:

The parameter λ in 3.1 (formula 3) is set to 0.5;

3.2 In training, the optimizer used is Adam, the number of iterations is 5×10 ⁵ , and the learning rate is 2×10 ^-4 ;

3.3 The encoder and decoder structures of the proposed hallucination adversarial network are both three-layer perceptrons with 2048 hidden layer nodes, 9216 encoder input layer nodes, 64 encoder output layer nodes; decoder input layer nodes are 4672; the discriminant network is also a three-layer perceptron with 2048 hidden layer nodes, the number of input nodes is 9216, and the number of output nodes is 1;

4) Given the first frame annotated image in the test video, collect target samples, and use Gaussian and random sampling methods to sample positive and negative samples around the target samples;

5) Use the proposed selective deformation migration method to select the pair of samples to be migrated for the tracking target;

The process of selecting the pair of samples to be migrated is: defining N _s to represent the number of video clips in the data set used to collect the pairs of deformed samples, and s _i to be the identification of the video clips, where,

Represents the number of corresponding samples in the video segment s _i ; for the feature expression ψ(s _i ) of the video segment s _i , it is calculated by the following method:

in,

is the deep feature extractor, for the target feature

The selective deformation migration method includes the following sub-steps:

5.1 The deep feature extractor used when computing the feature representation of video clips

To remove the ResNet34 model of the fully connected layer;

5.2 When selecting similar video clips, the parameter T is set to 2×10 ³ ;

6) Based on the selected sample pairs to be migrated, use the offline trained hallucination confrontation network to generate deformed positive samples;

7) Use the positive and negative samples of spatial sampling and the generated positive samples to jointly train the classifier, and the resulting classification error loss is used to update the classifier and the hallucination confrontation network at the same time;

8) Given a new test frame, use the region with the highest confidence of the trained classifier as the target position to complete the tracking of the current frame.

2. the robust target tracking method based on hallucination confrontation network as claimed in claim 1, it is characterized in that in step 1) in, described in the target tracking data set with labeling, collect a large amount of deformation samples pair as training sample collection The concrete process is: : Mark the video sequence to collect a large number of target sample pairs, and a pair of samples contains the same target; in the video sequence a, first select the target sample in the t-th frame

Then randomly select the target sample in one frame in the next 20 frames as

used to form a set of deformed sample pairs

A large number of deformation sample pairs are selected to form a training sample set; the data set is the ILSVRC-2015 video target detection data set proposed by Fei-Fei Li in 2015.

3. the robust target tracking method based on hallucination confrontation network as claimed in claim 1, it is characterized in that in step 2), the step of described feature extraction is: at first the target sample is changed size to 107 using bilinear interpolation method ×107×3, and then use the neural network feature extractor φ(·) to perform feature extraction on all interpolated target samples; the structure of the feature extractor φ(·) is the VGG-M pre-trained on the Imagenet dataset The first three convolutional layers of the model.

4. the robust target tracking method based on hallucination confrontation network as claimed in claim 1 is characterized in that in step 4), the details of the sampling are: in each iteration training, the ratio of positive and negative samples is 1:3 The ratio of samples to be sampled, that is, 32 positive samples and 96 negative samples, the determination standard of positive samples is that the regional overlap rate of the sampled samples and the target samples is greater than 0.7, and the determination standard of negative samples is the regional overlap ratio of the sampled samples and the target samples. below 0.5.

5. the robust target tracking method based on hallucination confrontation network as claimed in claim 1, it is characterized in that in step 6) in, described based on the sample pair to be migrated that obtains by selection, use offline trained hallucination confrontation network to generate deformation The specific steps of positive samples are: in each iterative training, randomly select 64 pairs of samples from the set D _S , each pair of sample pairs and the target sample input against fantasy, generate corresponding deformation samples, and finally, for each iteration, a total of Generate 64 positive samples.

6. the robust target tracking method based on hallucination confrontation network as claimed in claim 1, is characterized in that in step 7) in, described using the positive and negative samples of space sampling and the positive samples generated to jointly train the classifier, its The resulting classification error loss is used to update the classifier and the hallucination adversarial network at the same time. The cross-entropy loss for classification is then used to update both the classifier and the hallucination adversarial network through a back-propagation algorithm using the Adam optimizer.

7. the robust target tracking method based on hallucination confrontation network as claimed in claim 1, is characterized in that in step 8) in, described given new test frame, use the region with the highest confidence of trained classifier as target The specific process of completing the tracking of the current frame is: in the current test frame, use random sampling and Gaussian sampling to sample samples at the target position estimated in the previous frame; the sampled samples are input into the classifier to obtain the corresponding target confidence. .