CN112785626A - Twin network small target tracking method based on multi-scale feature fusion - Google Patents
Twin network small target tracking method based on multi-scale feature fusion Download PDFInfo
- Publication number
- CN112785626A CN112785626A CN202110111717.3A CN202110111717A CN112785626A CN 112785626 A CN112785626 A CN 112785626A CN 202110111717 A CN202110111717 A CN 202110111717A CN 112785626 A CN112785626 A CN 112785626A
- Authority
- CN
- China
- Prior art keywords
- layer
- size
- feature
- convolution
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000008569 process Effects 0.000 claims abstract description 17
- 230000004044 response Effects 0.000 claims description 20
- 238000010606 normalization Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 5
- 238000013434 data augmentation Methods 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 2
- 230000004048 modification Effects 0.000 claims description 2
- 238000012986 modification Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims 1
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 230000000694 effects Effects 0.000 abstract description 5
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000011160 research Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20112—Image segmentation details
- G06T2207/20132—Image cropping
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
本发明公开一种基于多尺度特征融合的孪生网络小目标跟踪方法,多尺度融合特征模块以及优化的孪生神经网络全面考虑到深度神经网络结构中低层有利于目标的精确位置,高层可以捕获目标的语义信息的优势,通过不同层次的有效融合,充分利用底层信息避免了深层网络的卷积操作会将小目标的信息抛弃的问题,解决了跟踪过程中的小目标挑战,从而实现了良好的跟踪效果。
The invention discloses a twin network small target tracking method based on multi-scale feature fusion. The multi-scale fusion feature module and the optimized twin neural network comprehensively take into account the precise position of the target in the lower layer of the deep neural network structure, and the high layer can capture the target's precise position. The advantages of semantic information, through the effective fusion of different levels, make full use of the underlying information to avoid the problem that the convolution operation of the deep network will discard the information of small targets, solve the challenges of small targets in the tracking process, and achieve good tracking. Effect.
Description
技术领域technical field
本发明涉及视觉识别技术,具体涉及一种基于多尺度特征融合的孪生网络小目标跟踪方法。The invention relates to visual recognition technology, in particular to a twin network small target tracking method based on multi-scale feature fusion.
背景技术Background technique
运动物体跟踪是指在给定一段视频序列的第一帧的感兴趣目标的位置信息后,跟踪器能在后续的序列中继续精确地、实时地跟踪目标,返回位置信息。近几年,目标跟踪的理论方法发展非常迅速,它是计算机视觉领域的一个重要的研究方向,并且已经被成功地应用于视频监控、无人驾驶、语义分割等多个领域。深度学习方法的出现极大的促进了跟踪问题的发展,但是小目标跟踪问题仍是一个非常大的挑战,特别是在复杂的场境中如何实时地、精确地追踪小目标是研究的重点问题。Moving object tracking means that after the position information of the target of interest in the first frame of a video sequence is given, the tracker can continue to track the target accurately and in real time in the subsequent sequence and return the position information. In recent years, the theoretical method of object tracking has developed very rapidly. It is an important research direction in the field of computer vision, and has been successfully applied to many fields such as video surveillance, unmanned driving, and semantic segmentation. The emergence of deep learning methods has greatly promoted the development of tracking problems, but the problem of small target tracking is still a very big challenge, especially in complex environments, how to accurately track small targets in real time is a key issue of research. .
目前,小目标跟踪的挑战性主要来源于两方面:小目标物体随着神经网络深度的增加其特征非常难以获取,因此获取特征表示困难。另一方面,在跟踪过程中,由于镜头抖动,与正常尺寸的目标相比,小目标往往会发生突然的大幅度的漂移。目前的研究仅仅关注于通用数据集上的正常尺寸目标物体的跟踪结果,但是却忽略了小目标跟踪问题。At present, the challenges of small target tracking mainly come from two aspects: the features of small target objects are very difficult to obtain with the increase of the depth of the neural network, so it is difficult to obtain the feature representation. On the other hand, during tracking, small targets tend to drift abruptly and drastically compared to normal-sized targets due to lens shake. Current research only focuses on the tracking results of normal-sized target objects on general datasets, but ignores the problem of small target tracking.
现有的小目标跟踪算法都是基于传统的机器学习算法,无论是精度的提升或跟踪的实时性上都存在很大的局限性,而深度神经网络由于其较深的网络层数能够提取高层的语义信息从而更好的表达特征,但是对于小目标的物体来说,随着网络层数的加深,不断的卷积操作会导致网络逐渐丢失小目标的位置信息。The existing small target tracking algorithms are all based on traditional machine learning algorithms, which have great limitations in terms of accuracy improvement or real-time tracking, while deep neural networks can extract high-level layers due to their deep network layers. However, for objects with small targets, as the number of network layers deepens, continuous convolution operations will cause the network to gradually lose the location information of small targets.
因此,利用孪生网络的深度神经网络结构,可从多尺度特征融合角度出发,通过融合不同网络层互补的特征信息,实现复杂场景和环境下实时性、鲁棒性的小目标物体跟踪,但现有孪生网络的应用存在以下问题:如何有效融合不同网络层的多尺度的特征、现有深度神经网络目标位置模糊以及语义信息较少等,最终导致难以获取小目标特征。Therefore, using the deep neural network structure of the Siamese network, from the perspective of multi-scale feature fusion, by fusing the complementary feature information of different network layers, real-time and robust small target object tracking in complex scenes and environments can be achieved. The application of Siamese network has the following problems: how to effectively integrate multi-scale features of different network layers, fuzzy target position of existing deep neural network and less semantic information, etc., which ultimately lead to difficulty in obtaining small target features.
发明内容SUMMARY OF THE INVENTION
发明目的:本发明的目的在于解决现有技术中存在的不足,提供一种基于多尺度特征融合的孪生网络小目标跟踪方法,本发明中独创有类似特征金字塔的基于全卷积孪生网络的多尺度特征融合模块来增加网络对尺度变化的鲁棒性,使得端到端的训练从而实现精准地、实时地追踪小尺度目标。Purpose of the invention: The purpose of the present invention is to solve the deficiencies in the prior art, and to provide a twin network small target tracking method based on multi-scale feature fusion. The scale feature fusion module increases the robustness of the network to scale changes, enabling end-to-end training to track small-scale targets accurately and in real time.
技术方案:本发明一种基于多尺度特征融合的孪生网络小目标跟踪方法,包括以下步骤:Technical solution: A twin network small target tracking method based on multi-scale feature fusion of the present invention includes the following steps:
步骤(1)、对模板图像x和待搜索图像y两个图像分别依次进行修改尺寸和数据增广预处理,获得对应大小固定的裁剪后的训练样本对,分别输入孪生网络结构中模板分支和搜索分支;Step (1): Perform size modification and data augmentation preprocessing on the template image x and the image y to be searched in turn, to obtain a pair of cropped training samples with a fixed size, and input them into the template branch and the siamese network structure respectively. search branch;
步骤(2)、模板分支和搜索分支共享特征提取器,即使用多尺度特征融合模块来获取多尺度融合特征向量,包括从下到上提取特征和从上往下横向融合特征两个阶段;Step (2), the template branch and the search branch share the feature extractor, that is, the multi-scale feature fusion module is used to obtain the multi-scale fusion feature vector, including two stages of extracting features from bottom to top and horizontal fusion features from top to bottom;
从下到上提取特征时构建优化孪生网络结构,该优化孪生网络结构包括5个卷积层,每层的输出依次记为{C1,C2,C3,C4,C5};When extracting features from bottom to top, an optimized twin network structure is constructed. The optimized twin network structure includes 5 convolutional layers, and the output of each layer is recorded as {C1, C2, C3, C4, C5} in turn;
从上往下横向融合特征时先对高层特征进行上采样扩大尺寸后与较低一层的特征融合,然后迭代分别生成模板分支与待搜索分支的多尺度融合后的特征图;When fusing features horizontally from top to bottom, first upsampling and expanding the size of high-level features and then merging them with the features of the lower layer, and then iteratively generate the multi-scale fusion feature maps of the template branch and the branch to be searched respectively;
步骤(3)、将步骤(2)所得模板特征图和搜索特征图输入相似度函数,进行相关交叉操作获取响应图,响应图中分值较高的位置则被认定为两幅图像目标物体最相似的位置,从而确定目标所在位置,即是指待搜索图像(即需要跟踪的帧)中目标位置;In step (3), the template feature map and the search feature map obtained in step (2) are input into the similarity function, and the relevant cross operation is performed to obtain the response map. Similar positions, so as to determine the position of the target, that is, the target position in the image to be searched (that is, the frame to be tracked);
步骤(4)、将响应图扩大到原待搜索图像y尺寸(例如225*225),然后分析响应图得到最终跟踪结果,将得分最大的位置乘以优化孪生网络结构五层卷积的总步长,即可得到当前目标在待搜索图像上的位置信息。Step (4), expand the response map to the original y size of the image to be searched (for example, 225*225), then analyze the response map to obtain the final tracking result, multiply the position with the largest score by the total steps of the five-layer convolution of the optimized twin network structure long, the position information of the current target on the image to be searched can be obtained.
进一步的,所述步骤(1)中对模板图像x修改尺寸的具体方法如下:Further, in the described step (1), the specific method for modifying the size of the template image x is as follows:
目标跟踪过程中第一帧目标框的大小是已知的,设第一帧目标框的大小为(x_min,y_min,w,h);然后根据第一帧目标框来计算模板图像x的大小,即以需要追踪的目标为中心裁剪出一个正方形区域,计算公式如下:During the target tracking process, the size of the target frame of the first frame is known, and the size of the target frame of the first frame is set as (x_min, y_min, w, h); then the size of the template image x is calculated according to the target frame of the first frame, That is, a square area is cut out with the target to be tracked as the center. The calculation formula is as follows:
s(w+2p)×s(h+2p)=As(w+2p)×s(h+2p)=A
其中,(x_min,y_min)是指目标框的左下角的坐标值,w和h表示框的宽和高,s是指修改尺寸,A设定为127*127大小;通过以上操作将目标框大小扩展,然后修改尺寸到127*127大小以获得模板图像x。Among them, (x_min, y_min) refers to the coordinate value of the lower left corner of the target frame, w and h represent the width and height of the frame, s refers to the modified size, and A is set to 127*127 size; Expand, then resize to 127*127 to get the template image x.
本发明将一段视频帧中的第一帧叫做模板帧(即模板图像x),后续帧都是待搜索目标位置的(即待搜索图像y),位置均用左下角和宽高四个坐标表示。In the present invention, the first frame in a video frame is called a template frame (ie, template image x), and subsequent frames are all target positions to be searched (ie, image y to be searched), and the positions are represented by four coordinates of the lower left corner and width and height .
对待搜索图像y进行修改尺寸的具体方法:The specific method for modifying the size of the image y to be searched:
先根据上一帧预测的目标框的中心为裁剪中心,然后根据模板图像x裁剪出的正方形区域边长并按比例确定搜索框的边长;最后修改尺寸到255*255大小。First, the center of the target frame predicted according to the previous frame is the cropping center, and then the side length of the square area cropped from the template image x is determined in proportion to the side length of the search box; finally, the size is modified to 255*255.
进一步的,所述步骤(2)中构建优化孪生网络结构来从下到上提取特征,优化孪生网络结构设置如下:Further, in the step (2), an optimized twin network structure is constructed to extract features from bottom to top, and the optimized twin network structure is set as follows:
①、第一层为卷积层,使用11*11*96卷积核,步长为2,对图像进行卷积操作.然后使用3*3的最大池化操作和批标准化操作,输出C1;1. The first layer is a convolution layer, using a 11*11*96 convolution kernel with a stride of 2 to perform a convolution operation on the image. Then use a 3*3 maximum pooling operation and batch normalization operation to output C1;
②、第二层为卷积层,使用5*5*256,步长为1的卷积核使用两组GPU分别进行卷积操作,然后使用3*3的最大池化操作和批标准化操作来提取特征信息,输出C2;2. The second layer is a convolution layer, using a convolution kernel of 5*5*256 and a stride of 1, using two groups of GPUs to perform convolution operations respectively, and then using 3*3 maximum pooling operations and batch normalization operations to Extract feature information and output C2;
③、第三层为卷积层,使用3*3*192的卷积核分组进行卷积操作并继续批标准化操作,输出C3;3. The third layer is the convolution layer, which uses a 3*3*192 convolution kernel grouping to perform the convolution operation and continue the batch normalization operation to output C3;
④、第四层为卷积层,使用3*3*192的卷积核分组进行操作并继续批标准化操作,输出C4;④. The fourth layer is the convolution layer, which uses 3*3*192 convolution kernel grouping for operation and continues the batch normalization operation, and outputs C4;
⑤、第五层为卷积层,仅使用3*3*128的卷积操作,最后的输出256维的高层语义特征C5。⑤. The fifth layer is the convolution layer, which only uses 3*3*128 convolution operations, and finally outputs a 256-dimensional high-level semantic feature C5.
进一步的,所述步骤(2)中从上往下横向融合特征的具体方法为:Further, in described step (2), the specific method of horizontally fusing features from top to bottom is:
(A)、采用内插值法先在第五层的特征图像素基础上采用2倍上采样(最近邻上采样法)在像素之间插入新的元素,将其大小变为第四层的特征尺寸,从而扩大高层特征大小便于下一步融合;然后依次扩大第四层、第三层以及第二层的特征大小;(A), using the interpolation method, firstly, on the basis of the feature map pixels of the fifth layer, use 2 times upsampling (nearest neighbor upsampling method) to insert new elements between pixels, and change their size to the features of the fourth layer size, so as to expand the high-level feature size to facilitate the next step of fusion; then expand the feature size of the fourth layer, the third layer and the second layer in turn;
(B)、在C5层使用一个1*1的卷积操作得到低分辨率的特征P5,然后利用一个1×1的卷积核改变自下而上过程中生成的第四层特征图C4的通道数,将其通道统一固定为256-d,便于后续的特征融合接下来将第四层处理后的结果与第五层进行的采样后结果进行相加融合,并使用一个3*3的卷积核处理融合后的结果以解决上采样过程中可能产生的混叠效应,将最后得到的结果记作P4;(B), use a 1*1 convolution operation in the C5 layer to obtain the low-resolution feature P5, and then use a 1×1 convolution kernel to change the fourth layer feature map C4 generated in the bottom-up process. The number of channels is fixed to 256-d, which is convenient for subsequent feature fusion. Next, the result of the fourth layer processing and the sampled result of the fifth layer are added and fused, and a 3*3 volume is used. The fusion result is processed by the kernel to solve the aliasing effect that may occur in the upsampling process, and the final result is recorded as P4;
迭代上述(B)过程最终生成更加精确的特征图,分别获得模板分支与待搜索分支的多尺度融合后的特征图。Iterating the above (B) process finally generates a more accurate feature map, and obtains the feature map after multi-scale fusion of the template branch and the branch to be searched, respectively.
进一步的,所述步骤(3)将模板分支与待搜索分支对应的多尺度融合后的特征图,利用互相关操作获取响应图。互相关操作,具体操作来说,利用模板分支与待搜索分支对应多尺度融合后的特征,这两个特征的尺寸分别为22*22*256和6*6*256,接着将6*6*256作为一个卷积核在22*22*256的特征上进行卷积操作,获得一个17*17的响应图,此17*17的响应图上跟踪的目标位置处分值较高;Further, in the step (3), the multi-scale fusion feature map corresponding to the template branch and the branch to be searched is used to obtain a response map by using a cross-correlation operation. Cross-correlation operation. Specifically, the template branch and the branch to be searched are used to correspond to multi-scale fusion features. The sizes of these two features are 22*22*256 and 6*6*256 respectively, and then 6*6* 256 is used as a convolution kernel to perform a convolution operation on the features of 22*22*256 to obtain a 17*17 response map. The target position tracked on this 17*17 response map has a higher score;
在训练过程中,获得17*17响应图后接着确定正负样本:在搜索图像上只要距离目标的值小于R,则算为正样本,反之则认为负样本;In the training process, after obtaining the 17*17 response map, then determine the positive and negative samples: as long as the value of the distance target is less than R on the search image, it is regarded as a positive sample, otherwise, it is regarded as a negative sample;
最后,采用二分类交叉熵逻辑损失函数,利用随机梯度下降法,训练迭代次数设为50,最小批次设置为8,学习率从10-2衰减为10-8训练整个深度网络;Finally, the binary cross-entropy logistic loss function is used, and the stochastic gradient descent method is used, the number of training iterations is set to 50, the minimum batch is set to 8, and the learning rate is decayed from 10-2 to 10-8 to train the entire deep network;
上述相似度函数公式如下: The above similarity function formula is as follows:
其中,为卷积核,在上进行卷积,b1表示得分图上每个位置的取值。in, is the convolution kernel, in Convolution is performed on the score map, and b 1 represents the value of each position on the score map.
有益效果:本发明中设有多尺度融合特征模块,全面考虑到深度神经网络结构中低层有利于目标的精确位置,高层可以捕获目标的语义信息的优势,通过不同层次的有效融合,充分利用底层信息避免深层网络的卷积操作会将小目标的信息抛弃的问题。此外,本发明优化现有孪生网络结构,能够精确实时地追踪小目标物体的视觉目标跟踪方法。Beneficial effects: The present invention is provided with a multi-scale fusion feature module, which fully considers the precise position of the target in the lower layer of the deep neural network structure, and the upper layer can capture the advantage of the semantic information of the target. Through effective fusion at different levels, the bottom layer can be fully utilized. The information avoids the problem that the convolution operation of the deep network will discard the information of the small target. In addition, the present invention optimizes the existing twin network structure, and can accurately track the visual target tracking method of small target objects in real time.
综上所述,本发明能全面有效融合不同网络层特征的结构,解决了跟踪过程中的小目标挑战,从而实现了良好的跟踪效果。To sum up, the present invention can comprehensively and effectively integrate the structure of features of different network layers, solves the challenge of small targets in the tracking process, and thus achieves a good tracking effect.
附图说明Description of drawings
图1为本发明的整体流程示意图;Fig. 1 is the overall flow schematic diagram of the present invention;
图2为本发明一实施例中待搜索分支的多尺度特征融合模块结构示意图;2 is a schematic structural diagram of a multi-scale feature fusion module of a branch to be searched in an embodiment of the present invention;
图3为本发明中一实施例对比示意图;Fig. 3 is a comparative schematic diagram of an embodiment of the present invention;
其中,图3(a)为采用本发明所得可视化特征图,图3(b)为采用现有孪生网络所得可视化特征图。Among them, Fig. 3(a) is a visualized feature map obtained by using the present invention, and Fig. 3(b) is a visualized feature map obtained by using an existing twin network.
具体实施方式Detailed ways
下面对本发明技术方案进行详细说明,但是本发明的保护范围不局限于所述实施例。The technical solutions of the present invention are described in detail below, but the protection scope of the present invention is not limited to the embodiments.
在目标跟踪的实际应用中,存在中高度的海拔下对相机拍摄的目标进行跟踪的情况,在远距离的场景中如何持续精确地追踪目标一直是跟踪领域研究的难点问题。In the practical application of target tracking, there is a situation in which the target captured by the camera is tracked at medium and high altitudes. How to continuously and accurately track the target in the long-distance scene has always been a difficult problem in the field of tracking research.
本发明基于优化后的孪生网络,通过自上而下的多尺度融合方法进行特征融合,解决了现有技术中小物体跟踪困难的问题,如图1所示,本发明一种基于多尺度特征融合的孪生网络小目标跟踪方法,包括以下步骤:Based on the optimized twin network, the present invention performs feature fusion through a top-down multi-scale fusion method, which solves the problem of difficulty in tracking small objects in the prior art. As shown in FIG. 1, the present invention is based on multi-scale feature fusion. The Siamese network small target tracking method includes the following steps:
步骤(1)、对模板图像x和待搜索图像y两个图像分别依次进行修改尺寸数据增广的预处理,获得对应大小固定的裁剪后的训练样本对,分别输入孪生网络结构中模板分支和搜索分支;Step (1): Perform the preprocessing of modified size data augmentation on the template image x and the image to be searched y in turn to obtain a pair of cropped training samples with a corresponding fixed size, and input them into the template branch and search branch;
目标跟踪过程中,设第一帧目标框的大小为(x_min,y_min,w,h);然后根据第一帧目标框来计算模板图像x的大小,即以需要追踪的目标为中心裁剪出一个正方形区域,计算公式如下:During the target tracking process, set the size of the target frame of the first frame as (x_min, y_min, w, h); then calculate the size of the template image x according to the target frame of the first frame, that is, cut out a For a square area, the calculation formula is as follows:
s(w+2p)×s(h+2p)=As(w+2p)×s(h+2p)=A
其中,s是指修改尺寸,A设定为127*127大小;通过以上操作将目标框大小扩展,然后修改尺寸到127*127大小以获得模板图像x;Among them, s refers to the modified size, and A is set to 127*127 size; through the above operations, expand the size of the target frame, and then modify the size to 127*127 size to obtain the template image x;
训练中,对待搜索图像y修改尺寸的具体方法:During training, the specific method to modify the size of the search image y:
先根据上一帧预测的目标框的中心为裁剪中心,然后根据模板图像x裁剪出的正方形区域边长并按比例确定搜索框的边长;最后修改尺寸到255*255大小;First, the center of the target frame predicted in the previous frame is the cropping center, and then the side length of the square area cropped from the template image x is determined proportionally to the side length of the search box; finally, the size is modified to 255*255;
步骤(2)、模板分支和搜索分支共享特征提取器,即使用多尺度特征融合模块来获取多尺度融合特征向量,包括从下到上提取特征和从上往下横向融合特征两个阶段;Step (2), the template branch and the search branch share the feature extractor, that is, the multi-scale feature fusion module is used to obtain the multi-scale fusion feature vector, including two stages of extracting features from bottom to top and horizontal fusion features from top to bottom;
如图2所示,构建优化孪生网络结构来从下到上提取特征,优化孪生网络结构设置如下:As shown in Figure 2, an optimized Siamese network structure is constructed to extract features from bottom to top. The optimized Siamese network structure is set as follows:
①、第一层为卷积层,使用11*11*96卷积核,步长为2,对图像进行卷积操作.然后使用3*3的最大池化操作和批标准化操作,输出C1;1. The first layer is a convolution layer, using a 11*11*96 convolution kernel with a stride of 2 to perform a convolution operation on the image. Then use a 3*3 maximum pooling operation and batch normalization operation to output C1;
②、第二层为卷积层,使用5*5*256,步长为1的卷积核使用两组GPU分别进行卷积操作,然后使用3*3的最大池化操作和批标准化操作来提取特征信息,输出C2;2. The second layer is a convolution layer, using a convolution kernel of 5*5*256 and a stride of 1, using two groups of GPUs to perform convolution operations respectively, and then using 3*3 maximum pooling operations and batch normalization operations to Extract feature information and output C2;
③、第三层为卷积层,使用3*3*192的卷积核分组进行卷积操作并继续批标准化操作,输出C3;3. The third layer is the convolution layer, which uses a 3*3*192 convolution kernel grouping to perform the convolution operation and continue the batch normalization operation to output C3;
④、第四层为卷积层,使用3*3*192的卷积核分组进行操作并继续批标准化操作,输出C4;④. The fourth layer is the convolution layer, which uses 3*3*192 convolution kernel grouping for operation and continues the batch normalization operation, and outputs C4;
⑤、第五层为卷积层,仅使用3*3*128的卷积操作,最后的输出256维的高层语义特征C5。⑤. The fifth layer is the convolution layer, which only uses 3*3*128 convolution operations, and finally outputs a 256-dimensional high-level semantic feature C5.
从上往下横向融合特征的具体方法为:The specific method of horizontally fusing features from top to bottom is as follows:
(A)、采用内插值法先在第五层的特征图像素基础上采用2倍上采样(最近邻上采样法)在像素之间插入新的元素,将其大小变为第四层的特征尺寸,从而扩大高层特征大小便于下一步融合;然后依次扩大第四层、第三层以及第二层的特征大小;(A), using the interpolation method, firstly, on the basis of the feature map pixels of the fifth layer, use 2 times upsampling (nearest neighbor upsampling method) to insert new elements between pixels, and change their size to the features of the fourth layer size, so as to expand the high-level feature size to facilitate the next step of fusion; then expand the feature size of the fourth layer, the third layer and the second layer in turn;
(B)、在C5层使用一个1*1的卷积操作得到低分辨率的特征P5,然后利用一个1×1的卷积核改变自下而上过程中生成的第四层特征图C4的通道数,将其通道统一固定为256-d,便于后续的特征融合接下来将第四层处理后的结果与第五层进行的采样后结果进行相加融合,并使用一个3*3的卷积核处理融合后的结果以解决上采样过程中可能产生的混叠效应,将最后得到的结果记作P4;(B), use a 1*1 convolution operation in the C5 layer to obtain the low-resolution feature P5, and then use a 1×1 convolution kernel to change the fourth layer feature map C4 generated in the bottom-up process. The number of channels is fixed to 256-d, which is convenient for subsequent feature fusion. Next, the result of the fourth layer processing and the sampled result of the fifth layer are added and fused, and a 3*3 volume is used. The fusion result is processed by the kernel to solve the aliasing effect that may occur in the upsampling process, and the final result is recorded as P4;
迭代上述(B)过程最终生成更加精确的特征图,分别获得模板分支与待搜索分支的多尺度融合后的特征图;Iterating the above (B) process finally generates a more accurate feature map, and obtains the feature map after the multi-scale fusion of the template branch and the branch to be searched respectively;
步骤(3)、所述步骤(3)将模板分支与待搜索分支对应的多尺度融合后的特征图,利用互相关操作获取响应图。互相关操作,具体操作来说,利用模板分支与待搜索分支对应多尺度融合后的特征,这两个特征的尺寸分别为22*22*256和6*6*256,接着将6*6*256作为一个卷积核在22*22*256的特征上进行卷积操作,获得一个17*17的响应图,此17*17的响应图上跟踪的目标位置处分值较高;Step (3), in the step (3), the multi-scale fusion feature map corresponding to the template branch and the branch to be searched is used to obtain a response map by using a cross-correlation operation. Cross-correlation operation. Specifically, the template branch and the branch to be searched are used to correspond to multi-scale fusion features. The sizes of these two features are 22*22*256 and 6*6*256 respectively, and then 6*6* 256 is used as a convolution kernel to perform a convolution operation on the features of 22*22*256 to obtain a 17*17 response map. The target position tracked on this 17*17 response map has a higher score;
在训练过程中,获得响应图后需确定正负样本:在搜索图像上只要距离目标的值小于R,则算为正样本,反之则认为负样本;In the training process, after obtaining the response map, it is necessary to determine the positive and negative samples: on the search image, as long as the value of the distance to the target is less than R, it is regarded as a positive sample, otherwise, it is regarded as a negative sample;
最后,采用二分类交叉熵逻辑损失函数,利用随机梯度下降法,训练迭代次数设为50,最小批次设置为8,学习率从10-2衰减为10-8训练整个深度网络;Finally, the binary cross-entropy logistic loss function is used, and the stochastic gradient descent method is used, the number of training iterations is set to 50, the minimum batch is set to 8, and the learning rate is decayed from 10-2 to 10-8 to train the entire deep network;
上述相似度函数公式如下: The above similarity function formula is as follows:
其中,为卷积核,在上进行卷积,b1表示得分图上每个位置的取值;in, is the convolution kernel, in Convolution is performed on, b 1 represents the value of each position on the score map;
步骤(4)、将响应图扩大到原图像尺寸,然后分析响应图得到最终跟踪结果,将得分最大的位置乘以优化孪生网络结构五层卷积的总步长,即可得到当前目标在待搜索图像上的位置信息。Step (4), expand the response map to the original image size, and then analyze the response map to get the final tracking result, multiply the position with the largest score by the total step size of the five-layer convolution of the optimized twin network structure, and then the current target can be obtained. Search for location information on an image.
如图3所示,采用本发明方法所得目标定位精准,效果更清晰。As shown in FIG. 3 , the target positioning obtained by the method of the present invention is accurate, and the effect is clearer.
通过上述实施例可以看出,本发明将目标跟踪看作相似性度量问题的学习。对于模板图像x和待搜索图像y,我们将其输入到孪生网络结构中进行相同的变换,并设计多尺度特征融合模块分别获取对应特征向量,最后将模板特征图当成卷积核在搜索特征上进行互相关性操作,生成响应图从而比较两者之间的相似度,相似度较高的位置则返回一个高分值,即目标位置,否则返回一个低分值。It can be seen from the above embodiments that the present invention regards target tracking as the learning of the similarity measurement problem. For the template image x and the image to be searched y, we input them into the Siamese network structure for the same transformation, and design a multi-scale feature fusion module to obtain the corresponding feature vectors respectively, and finally use the template feature map as a convolution kernel on the search feature. Perform a cross-correlation operation to generate a response map to compare the similarity between the two. The position with higher similarity returns a high score, that is, the target position, otherwise it returns a low score.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110111717.3A CN112785626A (en) | 2021-01-27 | 2021-01-27 | Twin network small target tracking method based on multi-scale feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110111717.3A CN112785626A (en) | 2021-01-27 | 2021-01-27 | Twin network small target tracking method based on multi-scale feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112785626A true CN112785626A (en) | 2021-05-11 |
Family
ID=75758302
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110111717.3A Pending CN112785626A (en) | 2021-01-27 | 2021-01-27 | Twin network small target tracking method based on multi-scale feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112785626A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223053A (en) * | 2021-05-27 | 2021-08-06 | 广东技术师范大学 | Anchor-free target tracking method based on fusion of twin network and multilayer characteristics |
CN113627488A (en) * | 2021-07-13 | 2021-11-09 | 武汉大学 | Twin network online update-based single target tracking method and device |
CN113808166A (en) * | 2021-09-15 | 2021-12-17 | 西安电子科技大学 | Single-target tracking method based on clustering difference and depth twin convolutional neural network |
CN114372999A (en) * | 2021-12-20 | 2022-04-19 | 浙江大华技术股份有限公司 | Object detection method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN111179307A (en) * | 2019-12-16 | 2020-05-19 | 浙江工业大学 | Visual target tracking method for full-volume integral and regression twin network structure |
CN111291679A (en) * | 2020-02-06 | 2020-06-16 | 厦门大学 | A Siamese Network-Based Target Tracking Method for Target-Specific Response Attention |
CN111489361A (en) * | 2020-03-30 | 2020-08-04 | 中南大学 | Real-time visual object tracking method based on deep feature aggregation of Siamese network |
CN111681259A (en) * | 2020-05-17 | 2020-09-18 | 天津理工大学 | Vehicle tracking model establishment method based on detection network without anchor mechanism |
CN111898504A (en) * | 2020-07-20 | 2020-11-06 | 南京邮电大学 | A Target Tracking Method and System Based on Siamese Recurrent Neural Network |
CN112184752A (en) * | 2020-09-08 | 2021-01-05 | 北京工业大学 | Video target tracking method based on pyramid convolution |
-
2021
- 2021-01-27 CN CN202110111717.3A patent/CN112785626A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109191491A (en) * | 2018-08-03 | 2019-01-11 | 华中科技大学 | The method for tracking target and system of the twin network of full convolution based on multilayer feature fusion |
CN111179307A (en) * | 2019-12-16 | 2020-05-19 | 浙江工业大学 | Visual target tracking method for full-volume integral and regression twin network structure |
CN111291679A (en) * | 2020-02-06 | 2020-06-16 | 厦门大学 | A Siamese Network-Based Target Tracking Method for Target-Specific Response Attention |
CN111489361A (en) * | 2020-03-30 | 2020-08-04 | 中南大学 | Real-time visual object tracking method based on deep feature aggregation of Siamese network |
CN111681259A (en) * | 2020-05-17 | 2020-09-18 | 天津理工大学 | Vehicle tracking model establishment method based on detection network without anchor mechanism |
CN111898504A (en) * | 2020-07-20 | 2020-11-06 | 南京邮电大学 | A Target Tracking Method and System Based on Siamese Recurrent Neural Network |
CN112184752A (en) * | 2020-09-08 | 2021-01-05 | 北京工业大学 | Video target tracking method based on pyramid convolution |
Non-Patent Citations (4)
Title |
---|
崔洲涓 等: "面向无人机的轻量级Siamese注意力网络目标跟踪", 《光学学报》 * |
杨哲 等: "基于孪生网络融合多模板的目标跟踪算法", 《计算机工程与应用》 * |
武玉伟: "《深度学习基础与应用》", 30 April 2020, 北京:北京理工大学出版社 * |
董洪义: "《深度学习之PyTorch物体检测实战》", 31 January 2020, 北京:机械工业出版社 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223053A (en) * | 2021-05-27 | 2021-08-06 | 广东技术师范大学 | Anchor-free target tracking method based on fusion of twin network and multilayer characteristics |
CN113627488A (en) * | 2021-07-13 | 2021-11-09 | 武汉大学 | Twin network online update-based single target tracking method and device |
CN113627488B (en) * | 2021-07-13 | 2023-07-21 | 武汉大学 | Single target tracking method and device based on twin network online update |
CN113808166A (en) * | 2021-09-15 | 2021-12-17 | 西安电子科技大学 | Single-target tracking method based on clustering difference and depth twin convolutional neural network |
CN113808166B (en) * | 2021-09-15 | 2023-04-18 | 西安电子科技大学 | Single-target tracking method based on clustering difference and depth twin convolutional neural network |
CN114372999A (en) * | 2021-12-20 | 2022-04-19 | 浙江大华技术股份有限公司 | Object detection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110738207B (en) | Character detection method for fusing character area edge information in character image | |
CN112785626A (en) | Twin network small target tracking method based on multi-scale feature fusion | |
CN113744311A (en) | Twin neural network moving target tracking method based on full-connection attention module | |
CN114170410B (en) | Point cloud part classification method based on PointNet graph convolution and KNN search | |
CN110738673A (en) | Visual SLAM method based on example segmentation | |
CN111161317A (en) | Single-target tracking method based on multiple networks | |
CN113076871A (en) | Fish shoal automatic detection method based on target shielding compensation | |
CN109767456A (en) | A target tracking method based on SiameseFC framework and PFP neural network | |
CN115984969A (en) | Lightweight pedestrian tracking method in complex scene | |
Lu et al. | Cross stage partial connections based weighted bi-directional feature pyramid and enhanced spatial transformation network for robust object detection | |
CN111523447A (en) | Vehicle tracking method, device, electronic equipment and storage medium | |
CN110310305B (en) | A target tracking method and device based on BSSD detection and Kalman filtering | |
CN109242019B (en) | Rapid detection and tracking method for optical small target on water surface | |
CN116935486A (en) | Sign language identification method and system based on skeleton node and image mode fusion | |
CN109740552A (en) | A Target Tracking Method Based on Parallel Feature Pyramid Neural Network | |
CN113112450B (en) | A method for small target detection in remote sensing images guided by image pyramid | |
CN112990066B (en) | Remote sensing image solid waste identification method and system based on multi-strategy enhancement | |
CN111242003A (en) | Video salient object detection method based on multi-scale constrained self-attention mechanism | |
CN115713546A (en) | Lightweight target tracking algorithm for mobile terminal equipment | |
CN116468895A (en) | Similarity matrix guided few-sample semantic segmentation method and system | |
CN103093211B (en) | Based on the human body motion tracking method of deep nuclear information image feature | |
CN109034237A (en) | Winding detection method based on convolutional Neural metanetwork road sign and sequence search | |
CN110634160B (en) | 3D Keypoint Extraction Model Construction and Pose Recognition Method of Target in 2D Graphics | |
CN111444913A (en) | License plate real-time detection method based on edge-guided sparse attention mechanism | |
CN112801020B (en) | Pedestrian re-identification method and system based on background grayscale |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210511 |