CN114708306A

CN114708306A - A single target tracking method, device and storage medium

Info

Publication number: CN114708306A
Application number: CN202210240068.1A
Authority: CN
Inventors: 范保杰; 郭小宾; 蒋国平; 徐丰羽
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-07-05
Anticipated expiration: 2042-03-10
Also published as: CN114708306B

Abstract

The invention discloses a single-target tracking method, device and storage medium. The method includes: extracting features of a target area and a search area by using a transformer backbone network; using a transformer-based encoding and decoding architecture to fuse the features of the target area and the search area, To serve the subsequent prediction task, M (100) target pre-selection boxes based on the target position of the previous frame are added when the target area feature code and the search area feature code are sent to the transformer decoder together. IoU prediction module: Iteratively optimizes the target prediction frame output by the Transformer decoder structure decoding for N times to obtain the optimized prediction frame, and then calculates the IoU between it and the labeled frame, and selects the top three optimized prediction frames of the IoU to take the average as the final prediction result. . While improving accuracy, speed is also guaranteed.

Description

A single target tracking method, device and storage medium

技术领域technical field

本发明属于目标跟踪技术领域，涉及一种单目标追踪方法、装置及存储介质，具体涉及一种基于transformer进行IoU预测的单目标追踪方法、装置及存储介质。The invention belongs to the technical field of target tracking, relates to a single target tracking method, device and storage medium, and in particular relates to a single target tracking method, device and storage medium for IoU prediction based on a transformer.

背景技术Background technique

单目标追踪是计算机视觉任务中一项十分具有挑战性的任务。目标追踪任务目的在于当给定第一帧的目标信息，跟踪器需要自动确定后续帧中的目标状态。与目标检测任务不同，目标特征只在推理阶段才可以获得，这意味着没有任何关于目标先验信息，比如类别和周围环境。近年来得益于深度学习的快速发展，目标追踪领域取得了很多研究成果。Single object tracking is a very challenging task in computer vision tasks. The purpose of the target tracking task is that given the target information of the first frame, the tracker needs to automatically determine the target state in subsequent frames. Unlike object detection tasks, object features are only available at the inference stage, which means that there is no prior information about the object, such as class and surrounding environment. In recent years, thanks to the rapid development of deep learning, many research results have been achieved in the field of object tracking.

现有技术存在一下不足：但面对一个复杂的环境(例如，模糊、尺度变化、颜色偏移和快速运动等)，大多数现有的跟踪器还是不能很好地处理这些场景。Existing techniques suffer from one shortcoming: However, in the face of a complex environment (e.g., blur, scale changes, color shifts, and fast motion, etc.), most existing trackers still cannot handle these scenes well.

发明内容SUMMARY OF THE INVENTION

目的：为了克服现有技术中存在的不足，本发明提供一种单目标追踪方法、装置及存储介质；基于transformer进行IoU预测，充分利用了transformer强大的局部与全局信息相互整合能力，同时也避免了传统追踪器复杂的参数设计以及后处理，对目标特征的提取更加高效。Purpose: In order to overcome the deficiencies in the prior art, the present invention provides a single-target tracking method, device and storage medium; IoU prediction based on the transformer makes full use of the transformer's powerful local and global information integration capabilities, and also avoids the need for The complex parameter design and post-processing of traditional trackers are eliminated, and the extraction of target features is more efficient.

技术方案：为解决上述技术问题，本发明采用的技术方案为：Technical scheme: in order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:

第一方面，提供一种单目标追踪方法，包括：In a first aspect, a single-target tracking method is provided, including:

获取视频数据；get video data;

确定视频数据第一帧的目标区域，对目标区域经过transformer骨干网络进行特征提取、transformer编码器编码，得到目标区域特征编码；Determine the target area of the first frame of the video data, perform feature extraction on the target area through the transformer backbone network, and encode the transformer encoder to obtain the feature encoding of the target area;

根据视频数据第一帧的目标区域用高斯函数生成M个目标先验框，其中M为大于3的整数；According to the target area of the first frame of the video data, use a Gaussian function to generate M target a priori frames, where M is an integer greater than 3;

根据视频数据第一帧的目标区域确定搜索区域；Determine the search area according to the target area of the first frame of the video data;

对搜索区域进行transformer骨干网络特征提取、transformer编码器编码，得到搜索区域特征编码；Perform transformer backbone network feature extraction and transformer encoder encoding on the search area to obtain the search area feature encoding;

将目标区域特征编码、搜索区域特征编码和M个目标先验框输入transformer解码器进行解码，得到输出的M个目标预测框与标注框之间的交并比IoU值；Input the target area feature encoding, search area feature encoding and M target a priori boxes into the transformer decoder for decoding, and obtain the IoU value of the intersection ratio between the output M target prediction boxes and the annotation boxes;

IoU loss优化网络参数，采用梯度下降法对目标预测框进行N次迭代优化，得到优化后的优化预测框；IoU loss optimizes the network parameters, uses the gradient descent method to iteratively optimize the target prediction frame for N times, and obtains the optimized optimized prediction frame;

选择优化后的前三IoU值的优化预测框取均值作为最终的目标追踪框。Select the optimized prediction frame of the first three IoU values after optimization and take the mean value as the final target tracking frame.

在一些实施例中，所述确定视频数据第一帧的目标区域，包括；In some embodiments, the determining the target area of the first frame of the video data includes;

在视频数据的第一帧中通过矩形框对目标进行标注，所述目标区域包括目标位置与尺寸，分别由矩形框位置与尺寸确定。In the first frame of the video data, the target is marked by a rectangular frame, and the target area includes the target position and size, which are respectively determined by the position and size of the rectangular frame.

在一些实施例中，根据视频数据第一帧的目标区域确定搜索区域，包括：In some embodiments, determining the search area according to the target area of the first frame of the video data includes:

其中，crop_sz表示在当前帧中裁剪的搜索区域尺寸，w表示上一帧中目标框的宽，h表示上一帧中目标框的高，s表示搜索因子。Among them, crop_sz represents the size of the search area cropped in the current frame, w represents the width of the target frame in the previous frame, h represents the height of the target frame in the previous frame, and s represents the search factor.

在一些实施例中，如图2和图3所示，将目标区域特征编码、搜索区域特征编码和M个目标先验框输入transformer解码器进行解码，得到输出的M个目标预测框与标注框之间的交并比IoU值，包括：将目标区域特征编码与搜索区域特征编码在通道维度上进行拼接得到特征图，根据特征图和M个目标先验框得到M个目标预测框，之后经过前馈神经网络(FFN)得到M个目标预测框与标注框之间的IoU值。In some embodiments, as shown in FIG. 2 and FIG. 3 , the target region feature encoding, the search region feature encoding and M target a priori boxes are input into the transformer decoder for decoding, and the output M target prediction boxes and annotation boxes are obtained. The IoU value of the intersection and union ratio between them includes: splicing the target area feature code and the search area feature code in the channel dimension to obtain a feature map, and obtaining M target prediction frames according to the feature map and M target a priori frames, and then after Feedforward neural network (FFN) obtains the IoU value between M target prediction boxes and annotation boxes.

在一些实施例中，IoU loss优化网络参数，采用梯度下降法对目标预测框进行N次迭代优化，包括：In some embodiments, the IoU loss optimizes network parameters, and uses the gradient descent method to perform N iterations of optimization on the target prediction frame, including:

其中，A为预测框，B为标注框。Among them, A is the prediction frame, and B is the annotation frame.

在一些实施例中，M为100。In some embodiments, M is 100.

第二方面，本发明提供了一种单目标追踪装置，包括处理器及存储介质；In a second aspect, the present invention provides a single target tracking device, including a processor and a storage medium;

所述存储介质用于存储指令；the storage medium is used for storing instructions;

所述处理器用于根据所述指令进行操作以执行根据第一方面所述方法的步骤。The processor is operable according to the instructions to perform the steps of the method according to the first aspect.

第三方面，本发明提供了一种存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现第一方面所述方法的步骤。In a third aspect, the present invention provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the method in the first aspect.

有益效果：本发明提供的单目标追踪方法及系统，相对于其他基于transformer的目标追踪算法而言，在模板与搜索图像共同送入transformer解码器的同时加入基于前一帧目标位置的100个目标先验框，这一点的提出是考虑了在目标追踪任务中每一帧之间目标的状态变化不是很大，因此，100个目标先验框相当于给网络提供了一种目标位置的先验，从而提升精度的同时也保证了速度。有效利用模板帧信息给与模型适当的先验保证跟踪性能。此外，不仅能够有效避免传统基于锚框方法中需要精细设计锚框的情况，也消除了多数目标追踪算法中繁琐的后处理步骤，提升了模型的追踪精度的同时也满足实时性。Beneficial effect: The single target tracking method and system provided by the present invention, compared with other target tracking algorithms based on transformers, add 100 targets based on the target position of the previous frame when the template and the search image are sent to the transformer decoder together The a priori frame is proposed to consider that the state of the target between each frame in the target tracking task is not very large. Therefore, 100 target a priori frames are equivalent to providing the network with a priori of the target position , so as to improve the accuracy and also ensure the speed. Effective use of template frame information gives the model appropriate priors to ensure tracking performance. In addition, it can not only effectively avoid the need to finely design the anchor frame in the traditional anchor frame-based method, but also eliminate the tedious post-processing steps in most target tracking algorithms, improve the tracking accuracy of the model and meet the real-time performance.

附图说明Description of drawings

图1为根据本发明一实施例单目标追踪方法的流程图；1 is a flowchart of a single-target tracking method according to an embodiment of the present invention;

图2和图3为根据本发明一实施例中单目标追踪系统网络示意图。2 and 3 are schematic diagrams of a network of a single-target tracking system according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The present invention will be further described below with reference to the accompanying drawings and embodiments. The following examples are only used to illustrate the technical solutions of the present invention more clearly, and cannot be used to limit the protection scope of the present invention.

在本发明的描述中，若干的含义是一个以上，多个的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。In the description of the present invention, the meaning of several means one or more, the meaning of multiple means two or more, greater than, less than, exceeding, etc. are understood as not including this number, above, below, within, etc. are understood as including this number. If it is described that the first and the second are only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance, or indicating the number of the indicated technical features or the order of the indicated technical features. relation.

本发明的描述中，参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of the present invention, reference to the terms "one embodiment," "some embodiments," "exemplary embodiment," "example," "specific example," or "some examples" or the like is meant to be used in conjunction with the embodiment. A particular feature, structure, material or characteristic described or exemplified is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

实施例1Example 1

一种单目标追踪方法，包括：A single-target tracking method, comprising:

获取视频数据；get video data;

对搜索区域进行transformer骨干网络进行特征提取、transformer编码器编码，得到搜索区域特征编码；Perform feature extraction on the transformer backbone network for the search area, and encode the transformer encoder to obtain the feature code of the search area;

选择优化后的经过网络前向计算IoU值前三的优化预测框取均值作为最终的目标追踪框。Select the optimized prediction frame of the top three IoU values through the network forward calculation and take the mean value as the final target tracking frame.

初始帧图像经过骨干网络的同时需要经过位置编码(Position Embedding)模块，保存图像单个像素在序列中的相对或绝对位置。位置计算公式如下：When the initial frame image passes through the backbone network, it needs to go through a position encoding (Position Embedding) module to save the relative or absolute position of a single pixel of the image in the sequence. The position calculation formula is as follows:

PE_(pos，2i)＝sin(pos/10000^2i/d)PE _{(pos, 2i)} = sin(pos/10000 ^2i/d )

PE_(pos，2i+1)＝cos(pos/10000^2i/d)PE _{(pos, 2i+1} )=cos(pos/10000 ^2i/d )

所述的transformer编码器结构，具体包括六个相同的encoder。每个encoder都由Multi-Head Self-Attention、Norm层以及Feed-Forward Network组成，各结构公式表达如下：Self-Attention公式：The transformer encoder structure specifically includes six identical encoders. Each encoder is composed of Multi-Head Self-Attention, Norm layer and Feed-Forward Network. The structural formulas are expressed as follows: Self-Attention formula:

Multi-Head Self-Attention：Multi-Head Self-Attention:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)WMultiHead(Q, K, V) = Concat(head ₁ , ..., head _h )W

其中，head指每个Self-Attention的结果，h指head个数，W为加权矩阵。Among them, head refers to the result of each Self-Attention, h refers to the number of heads, and W is the weighting matrix.

Feed-Forward Network:Feed-Forward Network:

FFW(x)＝max(0，xW₁+b₁)W₂+b₂。FFW(x)=max(0, xW ₁ +b ₁ )W ₂ +b ₂ .

在一些实施例中根据视频数据第一帧的目标区域用高斯函数生成M个目标先验框，包括：采用的高斯函数如下：In some embodiments, according to the target area of the first frame of the video data, the Gaussian function is used to generate M target a priori frames, including: the Gaussian function adopted is as follows:

其中，μ＝0，σ分别取0.05和0.5。一些具体实施例中，M取100。生成相对于前一帧目标框的100个目标先验框。Among them, μ=0, σ takes 0.05 and 0.5 respectively. In some specific embodiments, M is 100. Generate 100 target a priori boxes relative to the target box from the previous frame.

进一步地，在一些实施例中确定搜索区域之后将图片用pytorch中的函数设置到固定320*320大小；之后送入transformer骨干网络进行特征提取。Further, in some embodiments, after the search area is determined, the image is set to a fixed size of 320*320 using a function in pytorch; then it is sent to the transformer backbone network for feature extraction.

在一些实施例中，将目标区域特征编码、搜索区域特征编码和M个预测目标先验框输入transformer解码器进行解码，得到输出的M个预测目标框与标注框之间的交并比IoU值，包括：将目标区域特征编码与搜索区域特征编码在通道维度上进行拼接得到特征图，根据特征图和M个目标先验框得到M个目标预测框，之后经过前馈神经网络(FFN)得到M个目标预测框与标注框之间的IoU值。In some embodiments, the target area feature code, the search area feature code, and the M prediction target a priori boxes are input into the transformer decoder for decoding, to obtain the output IoU value of the intersection ratio between the M prediction target frames and the annotation frame. , including: splicing the feature code of the target area and the feature code of the search area in the channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and M target a priori frames, and then obtaining M target prediction frames through a feedforward neural network (FFN) The IoU value between the M target prediction boxes and the annotation boxes.

采用梯度下降法对目标预测框进行N次迭代优化，在一些实施例中N取10。具体过程如下：The target prediction frame is optimized by N iterations using the gradient descent method, and N is 10 in some embodiments. The specific process is as follows:

Step1：IoU loss优化网络参数θ₀以及设定步长α＝1；Step1: IoU loss optimizes network parameters θ ₀ and sets step size α=1;

Step2:M个目标预测框记为

经由前馈神经网络得到相对于标准框之间的IoU loss记为L_IoU；Step2: M target prediction boxes are marked as

Obtaining the IoU loss relative to the standard frame via the feedforward neural network is denoted as L _IoU ;

Step3:计算

Step3: Calculate

Step4:更新M个目标预测框，

Step4: Update the M target prediction boxes,

Step5:迭代N次得到

Step5: Iterate N times to get

此优化过程只对目标预测框做迭代优化，不更新网络参数。This optimization process only iteratively optimizes the target prediction frame without updating the network parameters.

最终，选择N次优化之后经过网络前向计算IoU值前三的优化预测框取平均作为最终的预测框A_final，来进行目标跟踪。Finally, after N times of optimization, the top three optimized prediction frames of the IoU value are calculated forward by the network, and the average is taken as the final prediction frame A _final for target tracking.

实施例2Example 2

第二方面，本实施例提供了一种单目标追踪装置，包括处理器及存储介质；In a second aspect, this embodiment provides a single-target tracking device, including a processor and a storage medium;

所述处理器用于根据所述指令进行操作以执行根据实施例1所述方法的步骤。The processor is configured to operate in accordance with the instructions to perform the steps of the method according to Embodiment 1.

实施例3Example 3

第三方面，本实施例提供了一种存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现实施例1所述方法的步骤。In a third aspect, this embodiment provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the method in Embodiment 1.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

1. A method for single target tracking, comprising:

acquiring video data;

determining a target area of a first frame of video data, and performing feature extraction and transform encoder encoding on the target area through a transform backbone network to obtain target area feature encoding;

generating M target prior frames by using a Gaussian function according to a target area of a first frame of video data, wherein M is an integer greater than 3;

determining a search area according to a target area of a first frame of video data;

performing transformer backbone network feature extraction and transformer encoder encoding on the search area to obtain search area feature encoding;

inputting the target region feature coding, the search region feature coding and the M target prior frames into a transform decoder for decoding to obtain IoU values of intersection ratios between the output M target prediction frames and the output labeling frames;

the IoU loss optimizes network parameters, and a gradient descent method is adopted to carry out N times of iterative optimization on the target prediction frame to obtain an optimized prediction frame;

and selecting the average value of the optimized prediction boxes of the optimized first three IoU values as a final target tracking box.

2. The method of single-target tracking according to claim 1, wherein said determining a target region of a first frame of video data comprises;

and labeling the target in a first frame of the video data through a rectangular frame, wherein the target area comprises a target position and a target size which are respectively determined by the position and the size of the rectangular frame.

3. The single-target tracking method of claim 1 or 2, wherein determining the search area based on the target area of the first frame of video data comprises:

wherein crop _ sz represents the size of a search area cropped in the current frame, w represents the width of an object frame in the previous frame, h represents the height of the object frame in the previous frame, and s represents a search factor.

4. The single-target tracking method of claim 1, wherein the inputting the target region feature code, the search region feature code and the M target prior boxes into a transform decoder for decoding results in IoU values of intersection ratios between the output M target prediction boxes and the output annotation boxes comprises: and splicing the target region feature code and the search region feature code on a channel dimension to obtain a feature map, obtaining M target prediction frames according to the feature map and the M target prior frames, and then obtaining IoU values between the M target prediction frames and the labeling frames through a feedforward neural network.

5. The single-target tracking method according to claim 1, wherein the IoU loss optimizes network parameters and performs N times of iterative optimization on the target prediction frame by adopting a gradient descent method, and the method comprises the following steps:

wherein, A is a prediction box, and B is a marking box.

6. The method of claim 5, wherein M is 100.

7. A single target tracking device, comprising: comprising a processor and a storage medium;

the storage medium is to store instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 6.

8. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, implementing the steps of the method of any one of claims 1 to 6.