CN117649582B

CN117649582B - Single-stream single-stage network target tracking method and system based on cascaded attention

Info

Publication number: CN117649582B
Application number: CN202410106560.9A
Authority: CN
Inventors: 王员云; 司英振
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-04-19
Anticipated expiration: 2044-01-25
Also published as: CN117649582A

Abstract

The present invention proposes a single-stream single-stage network target tracking method and system based on cascaded attention, the method comprising: firstly forming a single-stream single-stage overall model, inputting a template image and a search image into the single-stream single-stage overall model, performing feature extraction to obtain local feature information, and using cascaded attention to aggregate local semantic information to achieve feature enhancement, and then performing cross-attention calculation to achieve communication, obtaining a result feature map, and repeatedly extracting the result feature map several times in an iterative manner to obtain a final result feature map to predict the target location and target state, and achieve target tracking according to the target location, and at the same time determine whether to use the predicted target state as an online template in the next stage of online tracking process according to the target state. The present invention can reduce the computational redundancy of multi-head attention while coping with changes in the appearance of objects in complex scenes, thereby improving the performance of target tracking.

Description

Single-stream single-stage network target tracking method and system based on cascaded attention

技术领域Technical Field

本发明涉及计算机视觉与图像处理技术领域，特别涉及一种基于级联注意力的单流单阶段网络目标跟踪方法与系统。The present invention relates to the technical field of computer vision and image processing, and in particular to a single-stream single-stage network target tracking method and system based on cascaded attention.

背景技术Background technique

在计算机视觉与图像处理领域中，视觉跟踪是计算机视觉中的一项基础研究任务，其重点是仅使用其初始外观作为参考，精确定位每个视频帧中的任意目标。它应用于各个领域，包括视觉定位、自动驾驶系统和智能城市技术。然而，由于现实世界场景中存在许多具有挑战性的因素，如部分遮挡、物体离开视野、背景杂乱、视点变化和比例变化，设计一个稳健的跟踪器仍然是一个重大挑战。Visual tracking is a fundamental research task in computer vision, which focuses on accurately locating an arbitrary target in each video frame using only its initial appearance as a reference. It is applied in various fields, including visual localization, autonomous driving systems, and smart city technologies. However, designing a robust tracker remains a major challenge due to many challenging factors in real-world scenes, such as partial occlusions, objects leaving the field of view, background clutter, viewpoint changes, and scale changes.

目前，跟踪模型通常采用双流双阶段的模型架构。在这种方法中，分别提取来自模板和搜索区域的特征。然而，这种方法有一定的缺点，主要归因于传统注意力机制的高计算复杂度。此外，在提取具有全局上下文的特征时，往往会忽略局部特征信息。最近，单流架构已经成为一种可行的替代方案。这些架构带来了更快的处理和增强的特征融合能力，在跟踪性能方面取得了显著成功。其有效性背后的原因在于模型架构能够在早期阶段在模板和搜索区域之间建立不受阻碍的信息流，特别是从原始图像对。这有助于提取目标特定特征，并防止判别信息的丢失。Currently, tracking models usually adopt a two-stream two-stage model architecture. In this approach, features from the template and the search region are extracted separately. However, this approach has certain disadvantages, mainly attributed to the high computational complexity of the traditional attention mechanism. In addition, local feature information is often ignored when extracting features with global context. Recently, single-stream architectures have emerged as a viable alternative. These architectures bring faster processing and enhanced feature fusion capabilities, achieving significant success in tracking performance. The reason behind their effectiveness lies in the model architecture's ability to establish an unhindered information flow between the template and the search region at an early stage, especially from the original image pair. This helps in extracting target-specific features and prevents the loss of discriminative information.

Transformer首次提出了一种用于自然语言处理的基于自注意机制的编码器-解码器模块。它通过计算三元组的注意力权重来探索序列中的长程依赖关系。基于出色的特征融合能力，Transformer结构已成功应用于视觉跟踪，并取得了令人鼓舞的效果。在基于Transformer的跟踪器中，全局上下文信息得到了充分的探索，然而，局部信息没有得到充分利用，为了改进注意力机制，提出了一个新的注意力模块，称为级联注意力。其核心思想是增强输入注意力头部的特征的多样性。Transformer first proposed an encoder-decoder module based on self-attention mechanism for natural language processing. It explores long-range dependencies in sequences by calculating the attention weights of triplets. Based on its excellent feature fusion ability, the Transformer structure has been successfully applied to visual tracking and achieved encouraging results. In the Transformer-based tracker, the global context information is fully explored, however, the local information is not fully utilized. In order to improve the attention mechanism, a new attention module called cascade attention is proposed. The core idea is to enhance the diversity of features of the input attention head.

发明内容Summary of the invention

鉴于上述状况，本发明的主要目的是为了提出一种基于级联注意力的单流单阶段网络目标跟踪方法与系统，以解决上述技术问题。In view of the above situation, the main purpose of the present invention is to propose a single-stream single-stage network target tracking method and system based on cascaded attention to solve the above technical problems.

本发明提出了一种基于级联注意力的单流单阶段网络目标跟踪方法，所述方法包括如下步骤：The present invention proposes a single-stream single-stage network target tracking method based on cascaded attention, the method comprising the following steps:

步骤1、在单流单阶段框架下，基于Transformer网络模型以及特征增强模块，构建得到主干特征提取与融合模块，主干特征提取与融合模块、头部角点模块、和分数头部预测模块构成单流单阶段整体模型；Step 1: In a single-stream single-stage framework, based on the Transformer network model and the feature enhancement module, a backbone feature extraction and fusion module is constructed. The backbone feature extraction and fusion module, the head corner module, and the fractional head prediction module constitute a single-stream single-stage overall model;

步骤2、获取模板图像以及搜索图片，模板图像包括包含有若干所需跟踪目标的初始模板和若干包含目标状态的在线模板；Step 2: Obtain a template image and a search image, wherein the template image includes an initial template containing a number of target objects to be tracked and a number of online templates containing target states;

步骤3、将模板图像以及搜索图片输入至单流单阶段整体模型中，通过主干特征提取与融合模块提取模板图像以及搜索图片对应的局部特征信息；Step 3: Input the template image and the search image into the single-stream single-stage overall model, and extract the local feature information corresponding to the template image and the search image through the backbone feature extraction and fusion module;

将局部特征信息输入特征增强模块中，利用级联注意力对局部语义信息进行聚合以实现特征增强，得到模板图像和搜索图片的全局上下文信息；The local feature information is input into the feature enhancement module, and the local semantic information is aggregated using cascaded attention to achieve feature enhancement, and the global context information of the template image and the search image is obtained;

对模板图像和搜索图片的全局上下文信息进行交叉注意力计算以实现通信，获得结果特征图；Perform cross-attention calculation on the global context information of the template image and the search image to achieve communication and obtain the resulting feature map;

步骤4、将结果特征图分割为模板图像和搜索图片，并作为下一阶段的输入，采用迭代的方式重复步骤3若干次，得到最终的结果特征图；Step 4: Segment the result feature map into a template image and a search image, and use them as the input for the next stage. Repeat step 3 several times in an iterative manner to obtain the final result feature map.

步骤5、将最终的结果特征图输入头部角点模块中预测每个目标位置的置信度得分，并根据置信度得分确定跟踪目标所在位置以实现目标跟踪；Step 5: Input the final result feature map into the head corner point module to predict the confidence score of each target position, and determine the position of the tracking target according to the confidence score to achieve target tracking;

并将结果特征图输入分数头部预测模块中，以预测每个目标状态的置信度得分，根据目标状态的置信度得分来确定是否将所预测的目标状态作为下一阶段在线跟踪过程中的在线模板；The resulting feature map is input into the score head prediction module to predict the confidence score of each target state, and the confidence score of the target state is used to determine whether to use the predicted target state as an online template in the next stage of online tracking.

步骤6、利用大规模数据集为基础重复步骤2至步骤4，对单流单阶段整体模型进行预训练以优化模型参数；Step 6: Repeat steps 2 to 4 based on a large-scale dataset to pre-train the single-stream single-stage overall model to optimize model parameters.

步骤7、利用训练好的单流单阶段整体模型对视频序列进行目标在线跟踪。Step 7: Use the trained single-stream single-stage overall model to track the target online in the video sequence.

本发明还提出一种基于级联注意力的单流单阶段网络目标跟踪系统，其中，所述系统应用如上所述的基于级联注意力的单流单阶段网络目标跟踪方法，所述系统包括：The present invention also proposes a single-stream single-stage network target tracking system based on cascaded attention, wherein the system applies the single-stream single-stage network target tracking method based on cascaded attention as described above, and the system comprises:

构建模块，用于：Building blocks for:

在单流单阶段框架下，基于Transformer网络模型以及特征增强模块，构建得到主干特征提取与融合模块，主干特征提取与融合模块、头部角点模块、和分数头部预测模块构成单流单阶段整体模型；In the single-stream single-stage framework, based on the Transformer network model and the feature enhancement module, a backbone feature extraction and fusion module is constructed. The backbone feature extraction and fusion module, the head corner module, and the fractional head prediction module constitute the single-stream single-stage overall model.

学习模块，用于：Learning modules for:

获取模板图像以及搜索图片，模板图像包括包含有若干所需跟踪目标的初始模板和若干包含目标状态的在线模板；Acquire a template image and a search image, wherein the template image includes an initial template containing a number of desired tracking targets and a number of online templates containing target states;

将模板图像以及搜索图片输入至单流单阶段整体模型中，通过主干特征提取与融合模块提取模板图像以及搜索图片对应的局部特征信息；The template image and the search image are input into the single-stream single-stage overall model, and the local feature information corresponding to the template image and the search image is extracted through the backbone feature extraction and fusion module;

提取模块，用于：Extraction module for:

将结果特征图分割为模板图像和搜索图片，并作为下一阶段的输入，采用迭代的方式重复特征提取若干次，得到最终的结果特征图；The resulting feature map is divided into a template image and a search image, and used as the input for the next stage. The feature extraction is repeated several times in an iterative manner to obtain the final result feature map.

计算模块，用于：Compute module for:

将最终的结果特征图输入头部角点模块中预测每个目标位置的置信度得分，并根据置信度得分确定跟踪目标所在位置以实现目标跟踪；The final result feature map is input into the head corner module to predict the confidence score of each target position, and the position of the tracking target is determined according to the confidence score to achieve target tracking;

预训练模块，用于：Pre-trained modules for:

利用大规模数据集为基础对单流单阶段整体模型进行预训练以优化模型参数；Pre-train the single-stream single-stage overall model based on large-scale datasets to optimize model parameters;

跟踪模块，用于：Tracking module for:

利用训练好的单流单阶段整体模型对视频序列进行目标在线跟踪。The trained single-stream single-stage holistic model is used to perform online object tracking in video sequences.

相较于现有技术，本发明的有益效果如下：Compared with the prior art, the beneficial effects of the present invention are as follows:

1、本发明利用级联注意力为每个头提供不同的输入分割，然后将输出特征级联到这些头上。这种方法不仅减少了多头注意力的计算冗余，而且通过增加网络深度来增强模型的容量。1. The present invention uses cascaded attention to provide different input splits to each head, and then cascades the output features to these heads. This method not only reduces the computational redundancy of multi-head attention, but also enhances the capacity of the model by increasing the network depth.

2、本发明引入了在线模板更新的分数头模块，在线根据搜索图片的预测得分来修正在线模板图片，可以在复杂场景中应对对象外观的变化，使其能够更好地处理跟踪过程中的严重遮挡、尺度变化和背景复杂等困难，有效捕捉时间信息和处理对象外观变化，进而提高目标跟踪的性能。2. The present invention introduces a score head module for online template updating, which corrects the online template image based on the predicted score of the search image. It can cope with changes in object appearance in complex scenes, enable it to better handle difficulties such as severe occlusion, scale changes and complex backgrounds in the tracking process, effectively capture temporal information and handle changes in object appearance, thereby improving target tracking performance.

本发明的附加方面与优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实施例了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description or learned through embodiments of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提出的一种基于级联注意力的单流单阶段网络目标跟踪方法的流程图；FIG1 is a flow chart of a single-stream single-stage network target tracking method based on cascaded attention proposed by the present invention;

图2为本发明提出的一种基于级联注意力模块的单流单阶段网络目标跟踪框架的结构图；FIG2 is a structural diagram of a single-stream single-stage network target tracking framework based on a cascaded attention module proposed by the present invention;

图3为本发明特征增强模块的原理图；FIG3 is a schematic diagram of a feature enhancement module of the present invention;

图4为本发明级联注意力的原理图；FIG4 is a schematic diagram of the cascaded attention of the present invention;

图5为本发明提出的一种基于级联注意力的单流单阶段网络目标跟踪系统的结构示意图。FIG5 is a schematic diagram of the structure of a single-stream single-stage network target tracking system based on cascaded attention proposed by the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.

参照下面的描述和附图，将清楚本发明的实施例的这些和其他方面。在这些描述和附图中，具体公开了本发明的实施例中的一些特定实施方式，来表示实施本发明的实施例的原理的一些方式，但是应当理解，本发明的实施例的范围不受此限制。These and other aspects of the embodiments of the present invention will be apparent with reference to the following description and accompanying drawings. In these descriptions and accompanying drawings, some specific implementations of the embodiments of the present invention are specifically disclosed to represent some ways of implementing the principles of the embodiments of the present invention, but it should be understood that the scope of the embodiments of the present invention is not limited thereto.

请参阅图1至图2，本实施例提供了一种基于级联注意力的单流单阶段网络目标跟踪方法，所述方法包括如下步骤：Referring to FIG. 1 and FIG. 2 , this embodiment provides a single-stream single-stage network target tracking method based on cascaded attention, and the method comprises the following steps:

将局部特征信息输入特征增强模块中，利用级联注意力对局部语义信息进行聚合以实现特征增强，得到模板图像和搜索图片的全局上下文信息；The local feature information is input into the feature enhancement module, and the local semantic information is aggregated using cascaded attention to achieve feature enhancement, thus obtaining the global context information of the template image and the search image.

在所述步骤3中，将模板图像以及搜索图片输入至单流单阶段整体模型中，通过主干特征提取与融合模块提取模板图像以及搜索图片对应的局部特征信息的方法具体包括如下步骤：In step 3, the template image and the search image are input into the single-stream single-stage overall model, and the method of extracting local feature information corresponding to the template image and the search image through the backbone feature extraction and fusion module specifically includes the following steps:

对于每个模板图像，其初始形状为状，首先对其进行词嵌入操作，再经过卷积操作，得到结果，其形状为/>，接着把它们从二维拉伸为一维，即重塑为，在此令/>，其中，/>表示模板令牌的长度，/>分别表示初始模板图像的长、宽、通道数。/>分别表示经过卷积后的模板图像的长、宽、通道数；For each template image, its initial shape is First, we perform word embedding operation on it, and then perform convolution operation to get the result, which has the shape of /> , and then stretch them from two dimensions to one dimension, that is, reshape them into , hereby order/> , where /> Indicates the length of the template token, /> Respectively represent the length, width, and number of channels of the initial template image. /> Respectively represent the length, width, and number of channels of the template image after convolution;

对于每个搜索图像，其初始形状为状，也同样的会对其进行词嵌入操作，再经过卷积操作，得到结果，其形状为/>状，接着把它们从二维拉伸为一维，即重塑为/>，在此令/>，其中，/>表示搜索令牌的长度，/>分别表示初始搜索图像的长、宽、通道数。/>分别表示经过卷积后的搜索图像的长、宽、通道数，且/>，/>，/>。For each search image, its initial shape is The word embedding operation is also performed on it, and then the convolution operation is performed to obtain the result, which has the shape of /> shape, and then stretch them from two dimensions to one dimension, that is, reshape them into/> , hereby order/> , where /> Indicates the length of the search token, /> Respectively represent the length, width, and number of channels of the initial search image. /> Respectively represent the length, width, and number of channels of the search image after convolution, and/> ,/> ,/> .

请参阅请参阅图3至图4，在所述步骤3中，利用级联注意力对局部语义信息进行聚合以实现特征增强，得到模板图像和搜索图片的全局上下文信息的方法具体包括如下步骤：Please refer to FIG. 3 and FIG. 4. In step 3, the method of aggregating local semantic information by using cascade attention to achieve feature enhancement and obtaining global context information of the template image and the search image specifically includes the following steps:

总输入令牌包括模板令牌和搜索令牌，模板令牌包括一个初始模板令牌和若干在线模板令牌，将模板令牌和搜索令牌进行拼接，拼接过程存在如下关系式：The total input token includes template tokens and search tokens. Template tokens include an initial template token and several online template tokens. The template tokens and search tokens are concatenated. The concatenation process has the following relationship:

； ;

其中，表示拼接函数，其功能默认按照通道的维度进行拼接，目的是拼接多个令牌得到最终的输入，/>表示总输入令牌，/>表示总输入令牌中的初始模板令牌，表示总输入令牌中的若干在线模板令牌，其形状为/>，/>表示总输入令牌中的搜索令牌，其形状为/>；in, Represents the concatenation function, which by default performs concatenation according to the channel dimension, with the purpose of concatenating multiple tokens to obtain the final input./> Represents the total input tokens, /> represents the initial template token among the total input tokens, Represents several online template tokens in the total input tokens, and its shape is/> ,/> Represents the search token in the total input tokens, and its shape is /> ;

将总输入令牌转换为二维图像，转换过程存在如下关系式：Convert the total input tokens into a two-dimensional image. The conversion process has the following relationship:

； ;

其中，表示第/>个二维图像，/>表示第/>个输入令牌，/>表示将一维向量转变为二维图片的函数；譬如X的形状为/>，经过改函数后形状变为/>，其中；/>；in, Indicates the first/> A two-dimensional image, /> Indicates the first/> input tokens, /> Represents a function that transforms a one-dimensional vector into a two-dimensional image; for example, the shape of X is/> , after changing the function, the shape becomes/> , where; /> ;

再将二维图像输入到自注意力增强函数中进行特征提取，得到每个图像对应的增强令牌，利用自注意力增强函数进行特征提取的过程存在如下关系式：Then the two-dimensional image is input into the self-attention enhancement function for feature extraction to obtain the enhanced token corresponding to each image. The process of feature extraction using the self-attention enhancement function has the following relationship:

； ;

其中，表示自注意力增强函数（Self Attention Enhancement Module），表示第/>个增强令牌；in, Represents the Self Attention Enhancement Module, Indicates the first/> Enhancement tokens;

将关于模板图像部分的增强令牌进行连接，增强令牌连接的过程存在如下关系式：The enhanced tokens of the template image part are connected. The process of enhancing the token connection has the following relationship:

； ;

其中，表示模板图像部分的令牌拼接所得的模板结果令牌，/>表示拼接函数，其功能默认按照通道的维度进行拼接，目的是拼接多个令牌得到最终的模板结果令牌。in, The template result token obtained by concatenating the tokens representing the template image part, /> Represents a concatenation function, which by default performs concatenation according to the channel dimension, with the goal of concatenating multiple tokens to obtain the final template result token.

在所述步骤3中，对模板图像和搜索图片的全局上下文信息进行交叉注意力计算以实现通信，获得结果特征图的方法具体包括如下步骤：In step 3, cross attention calculation is performed on the global context information of the template image and the search image to achieve communication, and the method for obtaining the result feature map specifically includes the following steps:

利用模板结果令牌和搜索图像对应的增强令牌生成查询（Query）、键(Key)、值(Value)，查询、键、值的生成过程存在如下关系式：The query, key, and value are generated using the template result token and the enhanced token corresponding to the search image. The generation process of query, key, and value has the following relationship:

； ;

其中，表示关于增强令牌的查询、键、值，/>分别表示对关于增强令牌的查询、键、值的卷积操作；in, Represents the query, key, value about the enhanced token, /> denote the convolution operations on the query, key, and value of the enhanced token, respectively;

对查询、键、值进行交叉注意力计算，交叉注意力计算过程存在如下关系式：Cross-attention calculation is performed on query, key, and value. The cross-attention calculation process has the following relationship:

； ;

其中，表示经过交叉注意力后的搜索令牌，/>表示键的维度，/>表示矩阵转置状态，/>表示用于计算注意力权重的softmax函数，以将原始的注意力分数转化为概率分布，确保所有位置的权重都在0到1之间，并且它们的总和等于1；in, represents the search token after cross attention, /> Indicates the dimension of the key, /> Indicates the matrix transposed state, /> represents the softmax function used to calculate the attention weights to transform the original attention scores into probability distributions, ensuring that the weights of all positions are between 0 and 1 and their sum is equal to 1;

将交叉注意力计算结果与模板结果令牌进行拼接，得到总令牌，将交叉注意力计算结果与模板结果令牌进行拼接的过程存在如下关系式：The cross attention calculation result is concatenated with the template result token to obtain the total token. The process of concatenating the cross attention calculation result with the template result token has the following relationship:

； ;

其中，表示经过交叉注意力后的总令牌；in, represents the total tokens after cross attention;

将总令牌依次通过层归一化和多层感知器，得到结果特征图，将总令牌依次通过层归一化和多层感知器的过程存在如下关系式：The total tokens are passed through layer normalization and multi-layer perceptron in turn to obtain the resulting feature map. The process of passing the total tokens through layer normalization and multi-layer perceptron in turn has the following relationship:

； ;

其中，表示暂存结果状态，/>表示当前计算部分的输出，/>表示多层感知器，/>表示层归一化函数（Layer Norm）。in, Indicates the temporary storage result status, /> Indicates the output of the current calculation part, /> represents a multilayer perceptron, /> Represents the layer normalization function (Layer Norm).

进一步的，将二维图像输入到自注意力增强函数中进行特征提取，得到每个图像对应的增强令牌的方法具体包括如下步骤：Furthermore, the two-dimensional image is input into the self-attention enhancement function for feature extraction, and the method for obtaining the enhanced token corresponding to each image specifically includes the following steps:

将第个二维图像/>和第/>个注意力头输出令牌/>相加作为新的第/>个二维图像/>，其过程表达式如下：The first A two-dimensional image/> and the /> Attention heads output tokens /> Add as the new /> A two-dimensional image/> , the process expression is as follows:

； ;

其中，表示第/>个注意力头，/>表示新的第/>个二维图像；in, Indicates the first/> Attention head, /> Indicates the new /> A two-dimensional image;

采用自注意力的方式将第个二维图像/>作为新的输入来计算新的第/>个注意力头，以下记为/>，其过程表达式如下：Using self-attention method, A two-dimensional image/> As new input to calculate the new / > Attention head, denoted as/> , the process expression is as follows:

； ;

其中，表示自注意力函数，/>分别表示对关于第i个增强令牌的查询、键、值的卷积操作，/>表示键的维度；in, represents the self-attention function,/> Respectively represent the convolution operations on the query, key, and value of the i -th enhanced token, /> represents the dimension of the key;

在以卷积运算的形式连接所有新的注意力头的输出之后，再应用普通卷积操作，来增强特征的局部信息，得到每个图像对应的增强令牌，其过程表达式如下：After connecting the outputs of all new attention heads in the form of convolution operations, we apply normal convolution operations to enhance the local information of the features and obtain the enhanced tokens corresponding to each image. The process expression is as follows:

； ;

其中，表示普通卷积操作。in, Represents a normal convolution operation.

在本步骤中，将第个二维图像/>和第/>个注意力头输出令牌/>相加作为新的第/>个二维图像/>，采用自注意力的方式将第/>个二维图像/>作为新的输入来计算新的第/>个注意力头。再以卷积运算的形式连接所有头之后，也应用/>即普通卷积操作，来增强特征的局部信息。这使得自注意力机制能够全面捕捉局部和全局关系，进一步增强特征表示。In this step, A two-dimensional image/> and the /> Attention heads output tokens /> Add as the new /> A two-dimensional image/> , using the self-attention method to A two-dimensional image/> As new input to calculate the new / > After connecting all the heads in the form of convolution operation, we also apply /> That is, ordinary convolution operations are used to enhance the local information of features. This enables the self-attention mechanism to fully capture local and global relationships and further enhance feature representation.

将结果特征图输入分数头部预测模块中，以预测每个目标状态的置信度得分，根据目标状态的置信度得分来确定是否将所预测的目标状态作为下一阶段在线跟踪过程中的在线模板的方法具体包括如下步骤：The result feature map is input into the score head prediction module to predict the confidence score of each target state. The method of determining whether to use the predicted target state as an online template in the next stage of online tracking process according to the confidence score of the target state specifically includes the following steps:

将学习的分数令牌生成参与搜索感兴趣区域令牌的查询，其过程表达式如下：The fraction token that will be learned Generate a query to search for region of interest tokens. The process expression is as follows:

； ;

其中，表示对/>的查询卷积操作，/>表示关于搜索感兴趣区域令牌的查询，/>表示学习的分数令牌，其中形状为1×C；in, Expressing support for/> The query convolution operation, /> Represents a query to search for region of interest tokens, /> represents a learned fraction token, where the shape is 1×C;

再从结果特征图中自适应的提取重要区域特征，并利用重要区域特征生成键、值，其过程表达式如下：Then, the important regional features are adaptively extracted from the result feature map, and the key and value are generated using the important regional features. The process expression is as follows:

； ;

其中，表示结果特征图，/>、/>分别表示关于重要区域特征键、值，/>表示用于自适应提取重要区域特征的ROI函数，/>表示重要区域特征，/>表示对重要区域特征的键卷积操作，/>表示对重要区域特征的值卷积操作；in, Represents the result feature map, /> 、/> Respectively represent the key and value of important regional features,/> Represents the ROI function used to adaptively extract important regional features,/> Indicates important regional features, /> Represents the key convolution operation on important regional features,/> Represents the value convolution operation on the important regional features;

再对利用查询、键、值计算注意力权重，并将注意力权重依次通过多层感知器和Sigmoid激活函数，得到预测得分，其过程表达式如下：Then, the attention weight is calculated using the query, key, and value, and the attention weight is passed through the multi-layer perceptron and the Sigmoid activation function in turn to obtain the prediction score. The process expression is as follows:

； ;

其中，表示注意力输出的注意力权重，/>表示用于生成0到1得分的激活函数，/>表示置信度得分；in, represents the attention weight of the attention output,/> represents the activation function used to generate a score from 0 to 1, /> represents the confidence score;

当置信度得分低于0.5时，该在线模板将被视为阴性，表示不会被更新，否则，视为阳性，表示对在线模板进行更新，将所预测的目标状态作为下一阶段在线跟踪过程中的在线模板。When the confidence score is lower than 0.5, the online template will be regarded as negative, indicating that it will not be updated. Otherwise, it will be regarded as positive, indicating that the online template will be updated and the predicted target state will be used as the online template in the next stage of online tracking process.

在本步骤中，可学习的分数令牌被用作参与搜索ROI令牌的查询，使得分令牌能够对挖掘的目标信息进行编码。接下来，得分令牌关注初始目标令牌的所有位置，以隐式地将挖掘的目标与第一目标进行比较。最后，得分是由MLP层和Sigmoid激活函数产生并根据得分更新在线模板。通过这种方式，我们能够有效地筛选和更新在线模板，提高跟踪系统的准确性和稳定性。In this step, the fractional tokens that can be learned is used as a query to participate in the search for ROI tokens, so that the score token can encode the mined target information. Next, the score token focuses on all positions of the initial target token to implicitly compare the mined target with the first target. Finally, the score is generated by the MLP layer and the Sigmoid activation function and the online template is updated according to the score. In this way, we are able to effectively screen and update the online template and improve the accuracy and stability of the tracking system.

请参照图5，本实施例还提供一种基于级联注意力的单流单阶段网络目标跟踪系统，其中，所述系统应用如上所述的基于级联注意力的单流单阶段网络目标跟踪方法，所述系统包括：Referring to FIG. 5 , this embodiment further provides a single-stream single-stage network target tracking system based on cascaded attention, wherein the system applies the single-stream single-stage network target tracking method based on cascaded attention as described above, and the system includes:

构建模块，用于：Building blocks for:

学习模块，用于：Learning modules for:

提取模块，用于：Extraction module for:

计算模块，用于：Compute module for:

预训练模块，用于：Pre-trained modules for:

跟踪模块，用于：Tracking module for:

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、 “示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representation of the above terms does not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described can be combined in any one or more embodiments or examples in a suitable manner.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation methods of the present invention, and the description thereof is relatively specific and detailed, but it cannot be understood as limiting the scope of the patent of the present invention. It should be pointed out that, for ordinary technicians in this field, several variations and improvements can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention shall be subject to the attached claims.

Claims

1.A method for single-stream single-stage network target tracking based on cascade attention, the method comprising the steps of:

Step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;

Step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;

Step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;

inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;

Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;

Step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;

step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;

inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;

step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;

Step 7, performing target online tracking on the video sequence by using the trained single-stream single-stage integral model;

In the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:

for each template image, its initial shape is Firstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape is/>They are then stretched from two dimensions to one dimension, i.e. remoldedLet/>, hereWherein/>Representing the length of the template token,/>Respectively representing the length, width and channel number of the initial template image,/>Respectively representing the length, width and channel number of the template image after convolution;

for each search image, its initial shape is Similarly, word embedding operation is performed on the obtained product, and convolution operation is performed to obtain a result, wherein the shape is/>They are then stretched from two dimensions to one dimension, i.e. remolded/>Let/>, hereWherein/>Representing the length of the search token,/>Respectively representing the length, width and channel number of the initial search image,/>Respectively representing the length, width and channel number of the convolved search template image, and/>，/>，/>；

In the step 3, the method for aggregating the local semantic information by using the cascade attention to realize feature enhancement and obtaining the global context information of the template image and the search picture specifically comprises the following steps:

The total input tokens comprise template tokens and search tokens, the template tokens comprise an initial template token and a plurality of online template tokens, the template tokens and the search tokens are spliced, and the splicing process has the following relational expression:

；

wherein, Representing a stitching function,/>Representing the total input token,/>Representing an initial template token in the total input token,/>Representing several online template tokens in the total input token, whose shape is/>，/>Representing a search token of the total input tokens, the shape of which is/>；

The total input token is converted into a two-dimensional image, and the conversion process has the following relation:

；

wherein, Represents the/>Two-dimensional image,/>Represents the/>Input tokens,/>A function representing converting the one-dimensional vector into a two-dimensional picture;

Inputting the two-dimensional images into the self-attention enhancing function to perform feature extraction, obtaining an enhancing token corresponding to each image, wherein the process of performing feature extraction by using the self-attention enhancing function has the following relation:

；

wherein, Representing a self-attention enhancing function,/>Represents the/>A plurality of enhancement tokens;

Connecting the enhancement tokens related to the template image part, wherein the enhancement token connection process has the following relation:

；

wherein, Template result token obtained by token stitching representing a template image portion,/>Representing a splicing function;

The method for carrying out cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining the result feature map specifically comprises the following steps:

generating a query, a key and a value by using the template result token and the enhanced token corresponding to the search image, wherein the generation process of the query, the key and the value has the following relational expression

；

Wherein,Representing queries, keys, values, and/or the like about enhanced tokensRespectively representing convolution operations on queries, keys, values about the enhanced token;

The cross attention calculation is carried out on the query, the key and the value, and the cross attention calculation process has the following relation:

；

wherein, Representing search tokens after cross-attention,/>Representing the dimensions of the key,/>Represents the transposed state of the matrix,Representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;

splicing the cross attention computing result and the template result token to obtain a total token, wherein the process of splicing the cross attention computing result and the template result token has the following relational expression:

；

wherein, Representing the total token after cross-attention;

sequentially passing the total tokens through the layer normalization and the multi-layer perceptron to obtain a result feature map, wherein the process of sequentially passing the total tokens through the layer normalization and the multi-layer perceptron has the following relational expression:

；

wherein, Representing the status of temporary storage results,/>Representing the output of the currently calculated part,/>Representing a multi-layer perceptron,/>Representing a layer normalization function;

The method for inputting the two-dimensional images into the self-attention enhancement function to extract the characteristics and obtaining the enhancement tokens corresponding to each image specifically comprises the following steps:

Will be the first Two-dimensional image/>And/>Individual attention header output tokens/>Add as new/>Two-dimensional image/>The process expression is as follows:

；

wherein, Represents the/>Attention header,/>Represents a new/>A two-dimensional image;

Self-attentive mode is adopted to make the first Two-dimensional image/>Calculate a new/>, as a new inputAttention heads, hereinafter referred to as/>The process expression is as follows:

；

wherein, Representing a self-attention function,/>Respectively representing convolution operations on queries, keys, values with respect to the ith enhanced token,/>Representing the dimensions of the key;

after the output of all new attention heads is connected in a convolution operation mode, the common convolution operation is applied to strengthen the local information of the features, and an enhanced token corresponding to each image is obtained, wherein the process expression is as follows:

；

wherein, Representing a normal convolution operation;

in the step 5, the result feature map is input into a score head prediction module to predict the confidence score of each target state, and the method for determining whether to take the predicted target state as an online template in the next stage online tracking process according to the confidence score of the target state specifically comprises the following steps:

score tokens to be learned A query participating in searching for a region of interest token is generated, the process expression of which is as follows:

；

wherein, Representation pair/>Is a query convolution operation,/>Representing a query about searching for a region of interest token,/>A score token representing learning, wherein the shape is 1 xc;

And extracting important region features from the result feature map in a self-adaptive manner, and generating keys and values by using the important region features, wherein the process expression is as follows:

；

wherein, Representing the resulting feature map,/>、/>Respectively representing characteristic keys and values related to important areas,/>Representing an ROI function for adaptively extracting important region features,/>Representing important regional features,/>Key convolution operations representing features of important regions,/>A value convolution operation representing a feature of the region of interest;

And calculating attention weight by using query, key and value, and sequentially passing the attention weight through a multi-layer perceptron and a Sigmoid activation function to obtain a prediction score, wherein the process expression is as follows:

；

wherein, Attention weight representing attention output,/>Representing an activation function for generating a 0 to 1 score,/>Representing a confidence score;

When the confidence score is lower than 0.5, the online template is considered negative, which indicates that the online template is not updated, otherwise, the online template is considered positive, which indicates that the online template is updated, and the predicted target state is taken as the online template in the online tracking process of the next stage.

2. A cascade attention-based single-flow single-phase network object tracking system, wherein the system applies the cascade attention-based single-flow single-phase network object tracking method as claimed in claim 1, the system comprising:

A construction module for:

Under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;

a learning module for:

acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;

inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;

An extraction module for:

dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;

A calculation module for:

inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;

a pre-training module for:

Pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;

A tracking module for:

and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.