CN112785626A

CN112785626A - Twin network small target tracking method based on multi-scale feature fusion

Info

Publication number: CN112785626A
Application number: CN202110111717.3A
Authority: CN
Inventors: 涂铮铮; 朱庆文; 李成龙; 汤进; 罗斌
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-11

Abstract

The invention discloses a twin network small target tracking method based on multi-scale feature fusion. The multi-scale fusion feature module and the optimized twin neural network comprehensively take into account the precise position of the target in the lower layer of the deep neural network structure, and the high layer can capture the target's precise position. The advantages of semantic information, through the effective fusion of different levels, make full use of the underlying information to avoid the problem that the convolution operation of the deep network will discard the information of small targets, solve the challenges of small targets in the tracking process, and achieve good tracking. Effect.

Description

A Siamese Network Small Object Tracking Method Based on Multi-scale Feature Fusion

技术领域technical field

本发明涉及视觉识别技术，具体涉及一种基于多尺度特征融合的孪生网络小目标跟踪方法。The invention relates to visual recognition technology, in particular to a twin network small target tracking method based on multi-scale feature fusion.

背景技术Background technique

运动物体跟踪是指在给定一段视频序列的第一帧的感兴趣目标的位置信息后，跟踪器能在后续的序列中继续精确地、实时地跟踪目标，返回位置信息。近几年，目标跟踪的理论方法发展非常迅速，它是计算机视觉领域的一个重要的研究方向，并且已经被成功地应用于视频监控、无人驾驶、语义分割等多个领域。深度学习方法的出现极大的促进了跟踪问题的发展，但是小目标跟踪问题仍是一个非常大的挑战，特别是在复杂的场境中如何实时地、精确地追踪小目标是研究的重点问题。Moving object tracking means that after the position information of the target of interest in the first frame of a video sequence is given, the tracker can continue to track the target accurately and in real time in the subsequent sequence and return the position information. In recent years, the theoretical method of object tracking has developed very rapidly. It is an important research direction in the field of computer vision, and has been successfully applied to many fields such as video surveillance, unmanned driving, and semantic segmentation. The emergence of deep learning methods has greatly promoted the development of tracking problems, but the problem of small target tracking is still a very big challenge, especially in complex environments, how to accurately track small targets in real time is a key issue of research. .

目前，小目标跟踪的挑战性主要来源于两方面：小目标物体随着神经网络深度的增加其特征非常难以获取，因此获取特征表示困难。另一方面，在跟踪过程中，由于镜头抖动，与正常尺寸的目标相比，小目标往往会发生突然的大幅度的漂移。目前的研究仅仅关注于通用数据集上的正常尺寸目标物体的跟踪结果，但是却忽略了小目标跟踪问题。At present, the challenges of small target tracking mainly come from two aspects: the features of small target objects are very difficult to obtain with the increase of the depth of the neural network, so it is difficult to obtain the feature representation. On the other hand, during tracking, small targets tend to drift abruptly and drastically compared to normal-sized targets due to lens shake. Current research only focuses on the tracking results of normal-sized target objects on general datasets, but ignores the problem of small target tracking.

现有的小目标跟踪算法都是基于传统的机器学习算法，无论是精度的提升或跟踪的实时性上都存在很大的局限性，而深度神经网络由于其较深的网络层数能够提取高层的语义信息从而更好的表达特征，但是对于小目标的物体来说，随着网络层数的加深，不断的卷积操作会导致网络逐渐丢失小目标的位置信息。The existing small target tracking algorithms are all based on traditional machine learning algorithms, which have great limitations in terms of accuracy improvement or real-time tracking, while deep neural networks can extract high-level layers due to their deep network layers. However, for objects with small targets, as the number of network layers deepens, continuous convolution operations will cause the network to gradually lose the location information of small targets.

因此，利用孪生网络的深度神经网络结构，可从多尺度特征融合角度出发，通过融合不同网络层互补的特征信息，实现复杂场景和环境下实时性、鲁棒性的小目标物体跟踪，但现有孪生网络的应用存在以下问题：如何有效融合不同网络层的多尺度的特征、现有深度神经网络目标位置模糊以及语义信息较少等，最终导致难以获取小目标特征。Therefore, using the deep neural network structure of the Siamese network, from the perspective of multi-scale feature fusion, by fusing the complementary feature information of different network layers, real-time and robust small target object tracking in complex scenes and environments can be achieved. The application of Siamese network has the following problems: how to effectively integrate multi-scale features of different network layers, fuzzy target position of existing deep neural network and less semantic information, etc., which ultimately lead to difficulty in obtaining small target features.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明的目的在于解决现有技术中存在的不足，提供一种基于多尺度特征融合的孪生网络小目标跟踪方法，本发明中独创有类似特征金字塔的基于全卷积孪生网络的多尺度特征融合模块来增加网络对尺度变化的鲁棒性，使得端到端的训练从而实现精准地、实时地追踪小尺度目标。Purpose of the invention: The purpose of the present invention is to solve the deficiencies in the prior art, and to provide a twin network small target tracking method based on multi-scale feature fusion. The scale feature fusion module increases the robustness of the network to scale changes, enabling end-to-end training to track small-scale targets accurately and in real time.

技术方案：本发明一种基于多尺度特征融合的孪生网络小目标跟踪方法，包括以下步骤：Technical solution: A twin network small target tracking method based on multi-scale feature fusion of the present invention includes the following steps:

步骤(1)、对模板图像x和待搜索图像y两个图像分别依次进行修改尺寸和数据增广预处理，获得对应大小固定的裁剪后的训练样本对，分别输入孪生网络结构中模板分支和搜索分支；Step (1): Perform size modification and data augmentation preprocessing on the template image x and the image y to be searched in turn, to obtain a pair of cropped training samples with a fixed size, and input them into the template branch and the siamese network structure respectively. search branch;

步骤(2)、模板分支和搜索分支共享特征提取器，即使用多尺度特征融合模块来获取多尺度融合特征向量，包括从下到上提取特征和从上往下横向融合特征两个阶段；Step (2), the template branch and the search branch share the feature extractor, that is, the multi-scale feature fusion module is used to obtain the multi-scale fusion feature vector, including two stages of extracting features from bottom to top and horizontal fusion features from top to bottom;

从下到上提取特征时构建优化孪生网络结构，该优化孪生网络结构包括5个卷积层，每层的输出依次记为{C1,C2,C3,C4,C5}；When extracting features from bottom to top, an optimized twin network structure is constructed. The optimized twin network structure includes 5 convolutional layers, and the output of each layer is recorded as {C1, C2, C3, C4, C5} in turn;

从上往下横向融合特征时先对高层特征进行上采样扩大尺寸后与较低一层的特征融合，然后迭代分别生成模板分支与待搜索分支的多尺度融合后的特征图；When fusing features horizontally from top to bottom, first upsampling and expanding the size of high-level features and then merging them with the features of the lower layer, and then iteratively generate the multi-scale fusion feature maps of the template branch and the branch to be searched respectively;

步骤(3)、将步骤(2)所得模板特征图和搜索特征图输入相似度函数，进行相关交叉操作获取响应图，响应图中分值较高的位置则被认定为两幅图像目标物体最相似的位置，从而确定目标所在位置，即是指待搜索图像(即需要跟踪的帧)中目标位置；In step (3), the template feature map and the search feature map obtained in step (2) are input into the similarity function, and the relevant cross operation is performed to obtain the response map. Similar positions, so as to determine the position of the target, that is, the target position in the image to be searched (that is, the frame to be tracked);

步骤(4)、将响应图扩大到原待搜索图像y尺寸(例如225*225)，然后分析响应图得到最终跟踪结果，将得分最大的位置乘以优化孪生网络结构五层卷积的总步长，即可得到当前目标在待搜索图像上的位置信息。Step (4), expand the response map to the original y size of the image to be searched (for example, 225*225), then analyze the response map to obtain the final tracking result, multiply the position with the largest score by the total steps of the five-layer convolution of the optimized twin network structure long, the position information of the current target on the image to be searched can be obtained.

进一步的，所述步骤(1)中对模板图像x修改尺寸的具体方法如下：Further, in the described step (1), the specific method for modifying the size of the template image x is as follows:

目标跟踪过程中第一帧目标框的大小是已知的，设第一帧目标框的大小为(x_min,y_min,w,h)；然后根据第一帧目标框来计算模板图像x的大小，即以需要追踪的目标为中心裁剪出一个正方形区域，计算公式如下：During the target tracking process, the size of the target frame of the first frame is known, and the size of the target frame of the first frame is set as (x_min, y_min, w, h); then the size of the template image x is calculated according to the target frame of the first frame, That is, a square area is cut out with the target to be tracked as the center. The calculation formula is as follows:

s(w+2p)×s(h+2p)＝As(w+2p)×s(h+2p)=A

其中，(x_min,y_min)是指目标框的左下角的坐标值，w和h表示框的宽和高，s是指修改尺寸，A设定为127*127大小；通过以上操作将目标框大小扩展，然后修改尺寸到127*127大小以获得模板图像x。Among them, (x_min, y_min) refers to the coordinate value of the lower left corner of the target frame, w and h represent the width and height of the frame, s refers to the modified size, and A is set to 127*127 size; Expand, then resize to 127*127 to get the template image x.

本发明将一段视频帧中的第一帧叫做模板帧(即模板图像x)，后续帧都是待搜索目标位置的(即待搜索图像y)，位置均用左下角和宽高四个坐标表示。In the present invention, the first frame in a video frame is called a template frame (ie, template image x), and subsequent frames are all target positions to be searched (ie, image y to be searched), and the positions are represented by four coordinates of the lower left corner and width and height .

对待搜索图像y进行修改尺寸的具体方法：The specific method for modifying the size of the image y to be searched:

先根据上一帧预测的目标框的中心为裁剪中心，然后根据模板图像x裁剪出的正方形区域边长并按比例确定搜索框的边长；最后修改尺寸到255*255大小。First, the center of the target frame predicted according to the previous frame is the cropping center, and then the side length of the square area cropped from the template image x is determined in proportion to the side length of the search box; finally, the size is modified to 255*255.

进一步的，所述步骤(2)中构建优化孪生网络结构来从下到上提取特征，优化孪生网络结构设置如下：Further, in the step (2), an optimized twin network structure is constructed to extract features from bottom to top, and the optimized twin network structure is set as follows:

①、第一层为卷积层，使用11*11*96卷积核，步长为2，对图像进行卷积操作.然后使用3*3的最大池化操作和批标准化操作，输出C1；1. The first layer is a convolution layer, using a 11*11*96 convolution kernel with a stride of 2 to perform a convolution operation on the image. Then use a 3*3 maximum pooling operation and batch normalization operation to output C1;

②、第二层为卷积层，使用5*5*256，步长为1的卷积核使用两组GPU分别进行卷积操作，然后使用3*3的最大池化操作和批标准化操作来提取特征信息，输出C2；2. The second layer is a convolution layer, using a convolution kernel of 5*5*256 and a stride of 1, using two groups of GPUs to perform convolution operations respectively, and then using 3*3 maximum pooling operations and batch normalization operations to Extract feature information and output C2;

③、第三层为卷积层，使用3*3*192的卷积核分组进行卷积操作并继续批标准化操作，输出C3；3. The third layer is the convolution layer, which uses a 3*3*192 convolution kernel grouping to perform the convolution operation and continue the batch normalization operation to output C3;

④、第四层为卷积层，使用3*3*192的卷积核分组进行操作并继续批标准化操作，输出C4；④. The fourth layer is the convolution layer, which uses 3*3*192 convolution kernel grouping for operation and continues the batch normalization operation, and outputs C4;

⑤、第五层为卷积层，仅使用3*3*128的卷积操作，最后的输出256维的高层语义特征C5。⑤. The fifth layer is the convolution layer, which only uses 3*3*128 convolution operations, and finally outputs a 256-dimensional high-level semantic feature C5.

进一步的，所述步骤(2)中从上往下横向融合特征的具体方法为：Further, in described step (2), the specific method of horizontally fusing features from top to bottom is:

(A)、采用内插值法先在第五层的特征图像素基础上采用2倍上采样(最近邻上采样法)在像素之间插入新的元素，将其大小变为第四层的特征尺寸，从而扩大高层特征大小便于下一步融合；然后依次扩大第四层、第三层以及第二层的特征大小；(A), using the interpolation method, firstly, on the basis of the feature map pixels of the fifth layer, use 2 times upsampling (nearest neighbor upsampling method) to insert new elements between pixels, and change their size to the features of the fourth layer size, so as to expand the high-level feature size to facilitate the next step of fusion; then expand the feature size of the fourth layer, the third layer and the second layer in turn;

(B)、在C5层使用一个1*1的卷积操作得到低分辨率的特征P5，然后利用一个1×1的卷积核改变自下而上过程中生成的第四层特征图C4的通道数，将其通道统一固定为256-d，便于后续的特征融合接下来将第四层处理后的结果与第五层进行的采样后结果进行相加融合，并使用一个3*3的卷积核处理融合后的结果以解决上采样过程中可能产生的混叠效应，将最后得到的结果记作P4；(B), use a 1*1 convolution operation in the C5 layer to obtain the low-resolution feature P5, and then use a 1×1 convolution kernel to change the fourth layer feature map C4 generated in the bottom-up process. The number of channels is fixed to 256-d, which is convenient for subsequent feature fusion. Next, the result of the fourth layer processing and the sampled result of the fifth layer are added and fused, and a 3*3 volume is used. The fusion result is processed by the kernel to solve the aliasing effect that may occur in the upsampling process, and the final result is recorded as P4;

迭代上述(B)过程最终生成更加精确的特征图，分别获得模板分支与待搜索分支的多尺度融合后的特征图。Iterating the above (B) process finally generates a more accurate feature map, and obtains the feature map after multi-scale fusion of the template branch and the branch to be searched, respectively.

进一步的，所述步骤(3)将模板分支与待搜索分支对应的多尺度融合后的特征图，利用互相关操作获取响应图。互相关操作，具体操作来说，利用模板分支与待搜索分支对应多尺度融合后的特征，这两个特征的尺寸分别为22*22*256和6*6*256，接着将6*6*256作为一个卷积核在22*22*256的特征上进行卷积操作，获得一个17*17的响应图，此17*17的响应图上跟踪的目标位置处分值较高；Further, in the step (3), the multi-scale fusion feature map corresponding to the template branch and the branch to be searched is used to obtain a response map by using a cross-correlation operation. Cross-correlation operation. Specifically, the template branch and the branch to be searched are used to correspond to multi-scale fusion features. The sizes of these two features are 22*22*256 and 6*6*256 respectively, and then 6*6* 256 is used as a convolution kernel to perform a convolution operation on the features of 22*22*256 to obtain a 17*17 response map. The target position tracked on this 17*17 response map has a higher score;

在训练过程中，获得17*17响应图后接着确定正负样本：在搜索图像上只要距离目标的值小于R，则算为正样本，反之则认为负样本；In the training process, after obtaining the 17*17 response map, then determine the positive and negative samples: as long as the value of the distance target is less than R on the search image, it is regarded as a positive sample, otherwise, it is regarded as a negative sample;

最后，采用二分类交叉熵逻辑损失函数，利用随机梯度下降法，训练迭代次数设为50，最小批次设置为8，学习率从10^-2衰减为10^-8训练整个深度网络；Finally, the binary cross-entropy logistic loss function is used, and the stochastic gradient descent method is used, the number of training iterations is set to 50, the minimum batch is set to 8, and the learning rate is decayed from ^10-2 to ^10-8 to train the entire deep network;

上述相似度函数公式如下：

The above similarity function formula is as follows:

其中，

为卷积核，在

上进行卷积，b₁表示得分图上每个位置的取值。in,

is the convolution kernel, in

Convolution is performed on the score map, and b ₁ represents the value of each position on the score map.

有益效果：本发明中设有多尺度融合特征模块，全面考虑到深度神经网络结构中低层有利于目标的精确位置，高层可以捕获目标的语义信息的优势，通过不同层次的有效融合，充分利用底层信息避免深层网络的卷积操作会将小目标的信息抛弃的问题。此外，本发明优化现有孪生网络结构，能够精确实时地追踪小目标物体的视觉目标跟踪方法。Beneficial effects: The present invention is provided with a multi-scale fusion feature module, which fully considers the precise position of the target in the lower layer of the deep neural network structure, and the upper layer can capture the advantage of the semantic information of the target. Through effective fusion at different levels, the bottom layer can be fully utilized. The information avoids the problem that the convolution operation of the deep network will discard the information of the small target. In addition, the present invention optimizes the existing twin network structure, and can accurately track the visual target tracking method of small target objects in real time.

综上所述，本发明能全面有效融合不同网络层特征的结构，解决了跟踪过程中的小目标挑战，从而实现了良好的跟踪效果。To sum up, the present invention can comprehensively and effectively integrate the structure of features of different network layers, solves the challenge of small targets in the tracking process, and thus achieves a good tracking effect.

附图说明Description of drawings

图1为本发明的整体流程示意图；Fig. 1 is the overall flow schematic diagram of the present invention;

图2为本发明一实施例中待搜索分支的多尺度特征融合模块结构示意图；2 is a schematic structural diagram of a multi-scale feature fusion module of a branch to be searched in an embodiment of the present invention;

图3为本发明中一实施例对比示意图；Fig. 3 is a comparative schematic diagram of an embodiment of the present invention;

其中，图3(a)为采用本发明所得可视化特征图，图3(b)为采用现有孪生网络所得可视化特征图。Among them, Fig. 3(a) is a visualized feature map obtained by using the present invention, and Fig. 3(b) is a visualized feature map obtained by using an existing twin network.

具体实施方式Detailed ways

下面对本发明技术方案进行详细说明，但是本发明的保护范围不局限于所述实施例。The technical solutions of the present invention are described in detail below, but the protection scope of the present invention is not limited to the embodiments.

在目标跟踪的实际应用中，存在中高度的海拔下对相机拍摄的目标进行跟踪的情况，在远距离的场景中如何持续精确地追踪目标一直是跟踪领域研究的难点问题。In the practical application of target tracking, there is a situation in which the target captured by the camera is tracked at medium and high altitudes. How to continuously and accurately track the target in the long-distance scene has always been a difficult problem in the field of tracking research.

本发明基于优化后的孪生网络，通过自上而下的多尺度融合方法进行特征融合，解决了现有技术中小物体跟踪困难的问题，如图1所示，本发明一种基于多尺度特征融合的孪生网络小目标跟踪方法，包括以下步骤：Based on the optimized twin network, the present invention performs feature fusion through a top-down multi-scale fusion method, which solves the problem of difficulty in tracking small objects in the prior art. As shown in FIG. 1, the present invention is based on multi-scale feature fusion. The Siamese network small target tracking method includes the following steps:

步骤(1)、对模板图像x和待搜索图像y两个图像分别依次进行修改尺寸数据增广的预处理，获得对应大小固定的裁剪后的训练样本对，分别输入孪生网络结构中模板分支和搜索分支；Step (1): Perform the preprocessing of modified size data augmentation on the template image x and the image to be searched y in turn to obtain a pair of cropped training samples with a corresponding fixed size, and input them into the template branch and search branch;

目标跟踪过程中，设第一帧目标框的大小为(x_min,y_min,w,h)；然后根据第一帧目标框来计算模板图像x的大小，即以需要追踪的目标为中心裁剪出一个正方形区域，计算公式如下：During the target tracking process, set the size of the target frame of the first frame as (x_min, y_min, w, h); then calculate the size of the template image x according to the target frame of the first frame, that is, cut out a For a square area, the calculation formula is as follows:

s(w+2p)×s(h+2p)＝As(w+2p)×s(h+2p)=A

其中，s是指修改尺寸，A设定为127*127大小；通过以上操作将目标框大小扩展，然后修改尺寸到127*127大小以获得模板图像x；Among them, s refers to the modified size, and A is set to 127*127 size; through the above operations, expand the size of the target frame, and then modify the size to 127*127 size to obtain the template image x;

训练中，对待搜索图像y修改尺寸的具体方法：During training, the specific method to modify the size of the search image y:

先根据上一帧预测的目标框的中心为裁剪中心，然后根据模板图像x裁剪出的正方形区域边长并按比例确定搜索框的边长；最后修改尺寸到255*255大小；First, the center of the target frame predicted in the previous frame is the cropping center, and then the side length of the square area cropped from the template image x is determined proportionally to the side length of the search box; finally, the size is modified to 255*255;

如图2所示，构建优化孪生网络结构来从下到上提取特征，优化孪生网络结构设置如下：As shown in Figure 2, an optimized Siamese network structure is constructed to extract features from bottom to top. The optimized Siamese network structure is set as follows:

从上往下横向融合特征的具体方法为：The specific method of horizontally fusing features from top to bottom is as follows:

迭代上述(B)过程最终生成更加精确的特征图，分别获得模板分支与待搜索分支的多尺度融合后的特征图；Iterating the above (B) process finally generates a more accurate feature map, and obtains the feature map after the multi-scale fusion of the template branch and the branch to be searched respectively;

步骤(3)、所述步骤(3)将模板分支与待搜索分支对应的多尺度融合后的特征图，利用互相关操作获取响应图。互相关操作，具体操作来说，利用模板分支与待搜索分支对应多尺度融合后的特征，这两个特征的尺寸分别为22*22*256和6*6*256，接着将6*6*256作为一个卷积核在22*22*256的特征上进行卷积操作，获得一个17*17的响应图，此17*17的响应图上跟踪的目标位置处分值较高；Step (3), in the step (3), the multi-scale fusion feature map corresponding to the template branch and the branch to be searched is used to obtain a response map by using a cross-correlation operation. Cross-correlation operation. Specifically, the template branch and the branch to be searched are used to correspond to multi-scale fusion features. The sizes of these two features are 22*22*256 and 6*6*256 respectively, and then 6*6* 256 is used as a convolution kernel to perform a convolution operation on the features of 22*22*256 to obtain a 17*17 response map. The target position tracked on this 17*17 response map has a higher score;

在训练过程中，获得响应图后需确定正负样本：在搜索图像上只要距离目标的值小于R，则算为正样本，反之则认为负样本；In the training process, after obtaining the response map, it is necessary to determine the positive and negative samples: on the search image, as long as the value of the distance to the target is less than R, it is regarded as a positive sample, otherwise, it is regarded as a negative sample;

上述相似度函数公式如下：

The above similarity function formula is as follows:

其中，

为卷积核，在

上进行卷积，b₁表示得分图上每个位置的取值；in,

is the convolution kernel, in

Convolution is performed on, b ₁ represents the value of each position on the score map;

步骤(4)、将响应图扩大到原图像尺寸，然后分析响应图得到最终跟踪结果，将得分最大的位置乘以优化孪生网络结构五层卷积的总步长，即可得到当前目标在待搜索图像上的位置信息。Step (4), expand the response map to the original image size, and then analyze the response map to get the final tracking result, multiply the position with the largest score by the total step size of the five-layer convolution of the optimized twin network structure, and then the current target can be obtained. Search for location information on an image.

如图3所示，采用本发明方法所得目标定位精准，效果更清晰。As shown in FIG. 3 , the target positioning obtained by the method of the present invention is accurate, and the effect is clearer.

通过上述实施例可以看出，本发明将目标跟踪看作相似性度量问题的学习。对于模板图像x和待搜索图像y，我们将其输入到孪生网络结构中进行相同的变换,并设计多尺度特征融合模块分别获取对应特征向量，最后将模板特征图当成卷积核在搜索特征上进行互相关性操作，生成响应图从而比较两者之间的相似度，相似度较高的位置则返回一个高分值，即目标位置，否则返回一个低分值。It can be seen from the above embodiments that the present invention regards target tracking as the learning of the similarity measurement problem. For the template image x and the image to be searched y, we input them into the Siamese network structure for the same transformation, and design a multi-scale feature fusion module to obtain the corresponding feature vectors respectively, and finally use the template feature map as a convolution kernel on the search feature. Perform a cross-correlation operation to generate a response map to compare the similarity between the two. The position with higher similarity returns a high score, that is, the target position, otherwise it returns a low score.

Claims

1. a Siamese network small target tracking method based on multi-scale feature fusion, is characterized in that: comprise the following steps:

Step (1): Perform size modification and data augmentation preprocessing on the template image x and the image y to be searched in turn, to obtain a pair of cropped training samples with a fixed size, and input them into the template branch and the siamese network structure respectively. search branch;

Step (2), the template branch and the search branch share the feature extractor, that is, the multi-scale feature fusion module is used to obtain the multi-scale fusion feature vector, including two stages of extracting features from bottom to top and horizontal fusion features from top to bottom;

When extracting features from bottom to top, an optimized twin network structure is constructed. The optimized twin network structure includes 5 convolutional layers, and the output of each layer is recorded as {C1, C2, C3, C4, C5} in turn;

When fusing features horizontally from top to bottom, first upsampling and expanding the size of high-level features and then merging them with the features of the lower layer, and then iteratively generate the multi-scale fusion feature maps of the template branch and the branch to be searched respectively;

In step (3), the template feature map and the search feature map obtained in step (2) are input into the similarity function, and the relevant cross operation is performed to obtain the response map. Similar positions, so as to determine the position of the target in the image y to be searched;

Step (4), expand the response map to the size of the original image y to be searched, and then analyze the response map to obtain the final tracking result, multiply the position with the largest score by the total step size of the five-layer convolution of the optimized twin network structure, and then obtain The position information of the current target on the image to be searched.

2. the twin network small target tracking method based on multi-scale feature fusion according to claim 1, is characterized in that: in described step (1), the concrete method that modifies size to template image x is as follows:

Set the size of the target frame of the first frame as (x_min, y_min, w, h); then calculate the size of the template image x according to the target frame of the first frame, that is, cut out a square area with the target to be tracked as the center, the calculation formula as follows:

s(w+2p)×s(h+2p)=A

Among them, s refers to the modified size transformation, and A is set to 127*127 size; through the above operations, the size of the target frame is expanded, and then the modified size is transformed to 127*127 size to obtain the template image x;

The specific method to modify the size of the search image y:

First, the center of the target frame predicted according to the previous frame is the cropping center, and then the side length of the square area cropped from the template image x is determined in proportion to the side length of the search box; finally, the size is modified to 255*255.

3. the twin network small target tracking method based on multi-scale feature fusion according to claim 1, is characterized in that: the method for data augmentation in the described step (1) comprises in order to increase deep learning training data, here we use four. Data augmentation methods: randomstretch random stretch, randomcrop random crop, normalize normalize and convert totensor into tensor;

Finally, modify the size to the size that needs to be input into the network structure.

4. the twin network small target tracking method based on multi-scale feature fusion according to claim 1, it is characterized in that: in described step (2), construct optimized twin network structure to extract features from bottom to top, optimize twin network structure The settings are as follows:

1. The first layer is a convolution layer, using a 11*11*96 convolution kernel with a stride of 2 to perform a convolution operation on the image. Then use a 3*3 maximum pooling operation and batch normalization operation to output C1;

2. The second layer is a convolution layer, using a convolution kernel of 5*5*256 and a stride of 1, using two groups of GPUs to perform convolution operations respectively, and then using 3*3 maximum pooling operations and batch normalization operations to Extract feature information and output C2;

3. The third layer is the convolution layer, which uses a 3*3*192 convolution kernel grouping to perform the convolution operation and continue the batch normalization operation to output C3;

④. The fourth layer is the convolution layer, which uses 3*3*192 convolution kernel grouping for operation and continues the batch normalization operation, and outputs C4;

⑤. The fifth layer is the convolution layer, which only uses 3*3*128 convolution operations, and finally outputs a 256-dimensional high-level semantic feature C5.

5. The Siamese network small target tracking method based on multi-scale feature fusion according to claim 1, is characterized in that: the concrete method of horizontal fusion feature from top to bottom in described step (2) is:

(A) Using the interpolation method, firstly, on the basis of the feature map pixels of the fifth layer, 2 times upsampling is used to insert new elements between the pixels, and the size is changed to the feature size of the fourth layer, thereby expanding the high-level feature size. It is convenient for the next step of fusion; then expand the feature size of the fourth layer, the third layer and the second layer in turn;

(B), use a 1*1 convolution operation in the C5 layer to obtain the low-resolution feature P5, and then use a 1×1 convolution kernel to change the fourth layer feature map C4 generated in the bottom-up process. The number of channels is fixed as 256-d. Next, the result of the fourth layer processing and the sampling result of the fifth layer are added and fused, and a 3*3 convolution kernel is used to process the fusion. , denote the final result as P4;

Iterate the above (B) process to finally generate a feature map, and obtain the feature map after multi-scale fusion of the template branch and the branch to be searched, respectively.

6. Siamese network small target tracking method based on multi-scale feature fusion according to claim 1, is characterized in that: described step (3) will template branch and the feature map after multi-scale fusion corresponding to the branch to be searched, using The cross-correlation operation obtains the response graph;

The specific process of the cross-correlation operation is as follows: the template branch and the branch to be searched are used to correspond to the multi-scale fusion features. As a convolution kernel, perform the convolution operation on the features of 22*22*256 to obtain a 17*17 response map;

In the training process, after obtaining the 17*17 response map, then determine the positive and negative samples: as long as the value of the distance target is less than R on the search image, it will be regarded as a positive sample, otherwise, it will be regarded as a negative sample;

Finally, the binary cross-entropy logistic loss function is used, and the stochastic gradient descent method is used for iterative training to train the entire deep network;

The above similarity function formula is as follows:

in,

is the convolution kernel, in