CN117649582A

CN117649582A - Single-flow single-stage network target tracking method and system based on cascade attention

Info

Publication number: CN117649582A
Application number: CN202410106560.9A
Authority: CN
Inventors: 王员云; 司英振
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-03-05
Anticipated expiration: 2044-01-25
Also published as: CN117649582B

Abstract

The invention provides a single-flow single-stage network target tracking method and system based on cascade attention, the method comprises the steps of firstly forming a single-flow single-stage integral model, inputting a template image and a search picture into the single-flow single-stage integral model, carrying out feature extraction to obtain local feature information, carrying out aggregation on the local semantic information by utilizing the cascade attention to realize feature enhancement, then carrying out cross attention calculation to realize communication, obtaining a result feature map, repeatedly extracting the result feature map for a plurality of times in an iterative mode to obtain a final result feature map so as to predict the position of a target and the target state, realizing target tracking according to the position of the target, and determining whether the predicted target state is used as an online template in the next stage of online tracking according to the target state. The invention can reduce the calculation redundancy of the multi-head attention and simultaneously can cope with the change of the appearance of the object in the complex scene, thereby improving the target tracking performance.

Description

Single-flow single-stage network target tracking method and system based on cascade attention

Technical Field

The invention relates to the technical field of computer vision and image processing, in particular to a single-flow single-stage network target tracking method and system based on cascade attention.

Background

In the field of computer vision and image processing, vision tracking is a fundamental research task in computer vision, with the emphasis on precisely locating any target in each video frame using only its initial appearance as a reference. It is applied in various fields including visual positioning, automatic driving system and intelligent city technology. However, due to many challenging factors in real world scenes, such as partial occlusion, object out of view, background clutter, viewpoint changes, and scale changes, designing a robust tracker remains a significant challenge.

Currently, tracking models typically employ a dual-stream, dual-stage model architecture. In this approach, features from the template and the search area are extracted separately. However, this approach has certain drawbacks, mainly due to the high computational complexity of the traditional attention mechanisms. In addition, local feature information is often ignored when extracting features with global context. Recently, single-stream architecture has become a viable alternative. These architectures lead to faster processing and enhanced feature fusion capabilities with significant success in tracking performance. The reason behind its effectiveness is that the model architecture is able to establish an unobstructed information flow between the template and the search area at an early stage, in particular from the original image pair. This helps to extract target specific features and prevents loss of discrimination information.

The transducer proposes for the first time a self-attention-based encoder-decoder module for natural language processing. It explores long-range dependencies in a sequence by computing the attention weights of triples. Based on the excellent feature fusion capability, the transducer structure has been successfully applied to visual tracking with encouraging results. In a transducer-based tracker, global context information is fully explored, however, local information is not fully utilized, and in order to improve the attention mechanism, a new attention module, called cascade attention, is proposed. The core idea is to enhance the diversity of features of the input attention header.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a method and a system for single-stream single-stage network target tracking based on cascade attention, so as to solve the above technical problems.

The invention provides a single-flow single-stage network target tracking method based on cascade attention, which comprises the following steps:

step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;

step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;

step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;

inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;

performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;

step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;

step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;

inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;

step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;

and 7, carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.

The invention also provides a single-flow single-stage network target tracking system based on cascade attention, wherein the system applies the single-flow single-stage network target tracking method based on cascade attention, and the system comprises the following steps:

a construction module for:

under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;

a learning module for:

acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;

inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;

an extraction module for:

dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;

a calculation module for:

inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;

a pre-training module for:

pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;

a tracking module for:

and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.

Compared with the prior art, the invention has the following beneficial effects:

1. the present invention utilizes cascading attention to provide different input splits for each head and then cascades output features onto those heads. This approach not only reduces computational redundancy for multi-head attention, but also enhances the capacity of the model by increasing network depth.

2. The invention introduces the fractional head module for updating the online template, corrects the online template picture on line according to the prediction score of the search picture, can cope with the change of the appearance of the object in a complex scene, can better handle the difficulties of serious shielding, scale change, complex background and the like in the tracking process, effectively captures time information and processes the change of the appearance of the object, and further improves the performance of target tracking.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a single-flow single-stage network target tracking method based on cascade attention;

FIG. 2 is a block diagram of a single-flow single-phase network target tracking framework based on a cascade attention module according to the present invention;

FIG. 3 is a schematic diagram of a feature enhancement module of the present invention;

FIG. 4 is a schematic diagram of the cascade attention of the present invention;

fig. 5 is a schematic structural diagram of a single-flow single-stage network target tracking system based on cascade attention according to the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly.

Referring to fig. 1 to 2, the present embodiment provides a single-flow single-stage network target tracking method based on cascade attention, which includes the following steps:

in the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:

for each template image, its initial shape isFirstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remoldedHere let->Wherein->Representing the length of the template token, +.>Respectively representing the length, width and channel number of the initial template image. />Respectively representing the length, width and channel number of the template image after convolution;

for each search image, its initial shape isSimilarly, word embedding operation is performed on the same, and convolution operation is performed to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remolded +.>Here let->Wherein->Representing the length of the search token>Respectively representing the length, width and channel number of the initial search image. />Respectively representing the length, width and channel number of the convolved search image, and +.>，/>，/>。

Referring to fig. 3 to 4, in the step 3, the method for aggregating local semantic information by using cascade attention to achieve feature enhancement and obtaining global context information of a template image and a search picture specifically includes the following steps:

the total input tokens comprise template tokens and search tokens, the template tokens comprise an initial template token and a plurality of online template tokens, the template tokens and the search tokens are spliced, and the splicing process has the following relational expression:

；

wherein,representing a concatenation function whose function defaults to concatenation according to the dimensions of the channel, with the aim of concatenating multiple tokens to obtain the final input,/->Representing the total input token, < >>Representing the initial template token in the total input token,representing several online template tokens of the total input tokens, the shape of which is +.>，/>Representing a search token of the total input tokens, the shape of which is +.>；

The total input token is converted into a two-dimensional image, and the conversion process has the following relation:

；

wherein,indicate->Two-dimensional image->Indicate->Input tokens->A function representing converting the one-dimensional vector into a two-dimensional picture; for example X is shaped as +.>The shape becomes +.>Wherein; />；

Inputting the two-dimensional images into the self-attention enhancing function to perform feature extraction, obtaining an enhancing token corresponding to each image, wherein the process of performing feature extraction by using the self-attention enhancing function has the following relation:

；

wherein,represents a self-attention enhancing function (Self Attention Enhancement Module),indicate->A plurality of enhancement tokens;

connecting the enhancement tokens related to the template image part, wherein the enhancement token connection process has the following relation:

；

wherein,token concatenation of tokens representing template image portions resulting in a template result token, a template image portion>And representing a splicing function, wherein the function defaults to splice according to the dimension of the channel, and aims to splice a plurality of tokens to obtain a final template result token.

In the step 3, the method for performing cross attention calculation on global context information of the template image and the search picture to realize communication and obtaining the result feature map specifically comprises the following steps:

generating a Query (Query), a Key (Key) and a Value (Value) by using the template result token and the enhancement token corresponding to the search image, wherein the generation process of the Query, the Key and the Value has the following relational expression:

；

wherein,representing a query, key, value, about an enhanced token, < >>Respectively representing convolution operations on queries, keys, values about the enhanced token;

the cross attention calculation is carried out on the query, the key and the value, and the cross attention calculation process has the following relation:

；

wherein,representing search tokens after cross-attention, < - > and->Representing the dimensions of the key>Representing the transposed state of the matrix>Representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;

splicing the cross attention computing result and the template result token to obtain a total token, wherein the process of splicing the cross attention computing result and the template result token has the following relational expression:

；

wherein,representing the total token after cross-attention;

sequentially passing the total tokens through the layer normalization and the multi-layer perceptron to obtain a result feature map, wherein the process of sequentially passing the total tokens through the layer normalization and the multi-layer perceptron has the following relational expression:

；

wherein,representing the status of temporary storage results->Representing the output of the current calculation part,/->Representing a multi-layer sensor->Represents a Layer normalization function (Layer Norm).

Further, the method for inputting the two-dimensional images into the self-attention enhancing function to extract the features and obtain the enhancing tokens corresponding to each image specifically comprises the following steps:

will be the firstTwo-dimensional image->And->The individual attention head outputs a token->Add as new->Two-dimensional image->The process expression is as follows:

；

wherein,indicate->Attention head, head>Representing a new->A two-dimensional image;

self-attentive mode is adopted to make the firstTwo-dimensional image->Calculating a new +.>Attention head, hereinafter denoted +.>The process expression is as follows:

；

wherein,representing a self-attention function, +.>Respectively represent pairs about the firstiEnquiry, key, convolution operation of values of individual enhanced tokens,>representing the dimensions of the key;

after the output of all new attention heads is connected in a convolution operation mode, the common convolution operation is applied to strengthen the local information of the features, and an enhanced token corresponding to each image is obtained, wherein the process expression is as follows:

；

wherein,representing a normal convolution operation.

In this step, the firstTwo-dimensional image->And->The individual attention head outputs a token->Add as new->Two-dimensional image->Self-attentive mode is adopted to make +.>Two-dimensional image->Calculating a new +.>And a plurality of attention heads. After all heads have been connected in the form of a convolution operation, also apply +.>I.e., a normal convolution operation, to enhance the local information of the feature. This enables the self-attention mechanism to comprehensively capture local and global relationships, further enhancing the feature representation.

the method for inputting the result feature map into the score head prediction module to predict the confidence score of each target state and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state specifically comprises the following steps:

score tokens to be learnedA query participating in searching for a region of interest token is generated, the process expression of which is as follows:

；

wherein,representation pair->Is a query convolution operation of->Representing a query about searching for a region of interest token, < ->A score token representing learning, wherein the shape is 1 xc;

and extracting important region features from the result feature map in a self-adaptive manner, and generating keys and values by using the important region features, wherein the process expression is as follows:

；

wherein,representing the result feature map, < >>、/>Respectively representing characteristic keys, values, and +.>Representing an ROI function for adaptively extracting important region features,/->Representing important regional features->Key convolution operation representing characteristics of important areas, < ->A value convolution operation representing a feature of the region of interest;

and calculating attention weight by using query, key and value, and sequentially passing the attention weight through a multi-layer perceptron and a Sigmoid activation function to obtain a prediction score, wherein the process expression is as follows:

；

wherein,attention weight representing attention output, +.>Representing an activation function for generating a 0 to 1 score,/->Representing a confidence score;

when the confidence score is lower than 0.5, the online template is considered negative, which indicates that the online template is not updated, otherwise, the online template is considered positive, which indicates that the online template is updated, and the predicted target state is taken as the online template in the online tracking process of the next stage.

In this step, a learnable score tokenIs used as a query to participate in searching for the ROI tokens so that the scoring tokens can encode mined target information. Next, the scoring token focuses on all locations of the initial target token to implicitly compare the mined target with the first target. Finally, the score is generated by the MLP layer and Sigmoid activation function and the online template is updated according to the score. By the method, online templates can be effectively screened and updated, and accuracy and stability of a tracking system are improved.

Referring to fig. 5, the present embodiment further provides a single-flow single-stage network target tracking system based on cascade attention, where the system applies the single-flow single-stage network target tracking method based on cascade attention as described above, and the system includes:

a construction module for:

a learning module for:

an extraction module for:

a calculation module for:

a pre-training module for:

a tracking module for:

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method for single-stream single-stage network target tracking based on cascade attention, the method comprising the steps of:

2. The method for tracking the single-stream single-stage network target based on the cascade attention according to claim 1, wherein in the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:

for each template image, its initial shape isFirstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remoldedHere let->Wherein->Representing the length of the template token, +.>Respectively representing the length, width and channel number of the initial template image,/->Respectively representing the length, width and channel number of the template image after convolution;

for each search image, its initial shape isSimilarly, word embedding operation is performed on the same, and convolution operation is performed to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remolded +.>Here let->Wherein->Representing the length of the search token>Representing the length, width, number of channels of the initial search image, respectively,/->Respectively representing the length, width and channel number of the convolved search template image, and +.>，/>，/>。

3. The method for single-stream single-stage network object tracking based on cascade attention according to claim 2, wherein in the step 3, the method for aggregating local semantic information by cascade attention to realize feature enhancement, and obtaining global context information of template images and search pictures specifically comprises the following steps:

；

wherein,representing a stitching function->Representing the total input token, < >>Representing an initial template token in the total input token, < +.>Representing several online template tokens of the total input tokens, the shape of which is +.>，/>Representing a search token of the total input tokens, the shape of which is +.>；

；

wherein,indicate->Two-dimensional image->Indicate->Input tokens->A function representing converting the one-dimensional vector into a two-dimensional picture;

；

wherein,representing a self-attention enhancing function,/->Indicate->A plurality of enhancement tokens;

；

wherein,token concatenation of tokens representing template image portions resulting in a template result token, a template image portion>Representing the stitching function.

4. The method for single-stream single-stage network object tracking based on cascade attention as recited in claim 3, wherein in said step 3, a cross attention calculation is performed on global context information of a template image and a search picture to realize communication, and a method for obtaining a result feature map specifically comprises the steps of:

generating a query, a key and a value by using the template result token and the enhancement token corresponding to the search image, wherein the generation process of the query, the key and the value has the following relational expression:

；

wherein,representing search tokens after cross-attention, < - > and->Representing the dimensions of the key>Represents the transposed state of the matrix,representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;

；

wherein,representing the total token after cross-attention;

；

wherein,representing the status of temporary storage results->Representation ofOutput of the current calculation section, +.>Representing a multi-layer sensor->Representing a layer normalization function.

5. The method for tracking the single-stream single-stage network target based on the cascade attention as recited in claim 4, wherein the method for inputting the two-dimensional images into the self-attention enhancement function for feature extraction and obtaining the enhancement tokens corresponding to each image specifically comprises the following steps:

；

wherein,representing a normal convolution operation.

6. The method for single-stream single-stage network object tracking based on cascade attention as set forth in claim 5, wherein in said step 5, the method for inputting the result feature map into the score head prediction module to predict the confidence score of each object state and determining whether to use the predicted object state as an online template in the next stage of online tracking according to the confidence score of the object state comprises the steps of:

；

wherein the method comprises the steps ofRepresentation pair->Is a query convolution operation of->Representing a query about searching for a region of interest token, < ->A score token representing learning, wherein the shape is 1 xc;

；

7. A cascade attention-based single-flow single-phase network object tracking system, wherein the system applies the cascade attention-based single-flow single-phase network object tracking method according to any one of claims 1 to 6, the system comprising:

a construction module for:

a learning module for:

an extraction module for:

a calculation module for:

a pre-training module for:

a tracking module for: