CN117649582A - Single-flow single-stage network target tracking method and system based on cascade attention - Google Patents

Single-flow single-stage network target tracking method and system based on cascade attention Download PDF

Info

Publication number
CN117649582A
CN117649582A CN202410106560.9A CN202410106560A CN117649582A CN 117649582 A CN117649582 A CN 117649582A CN 202410106560 A CN202410106560 A CN 202410106560A CN 117649582 A CN117649582 A CN 117649582A
Authority
CN
China
Prior art keywords
representing
attention
template
token
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410106560.9A
Other languages
Chinese (zh)
Other versions
CN117649582B (en
Inventor
王员云
司英振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202410106560.9A priority Critical patent/CN117649582B/en
Publication of CN117649582A publication Critical patent/CN117649582A/en
Application granted granted Critical
Publication of CN117649582B publication Critical patent/CN117649582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a single-flow single-stage network target tracking method and system based on cascade attention, the method comprises the steps of firstly forming a single-flow single-stage integral model, inputting a template image and a search picture into the single-flow single-stage integral model, carrying out feature extraction to obtain local feature information, carrying out aggregation on the local semantic information by utilizing the cascade attention to realize feature enhancement, then carrying out cross attention calculation to realize communication, obtaining a result feature map, repeatedly extracting the result feature map for a plurality of times in an iterative mode to obtain a final result feature map so as to predict the position of a target and the target state, realizing target tracking according to the position of the target, and determining whether the predicted target state is used as an online template in the next stage of online tracking according to the target state. The invention can reduce the calculation redundancy of the multi-head attention and simultaneously can cope with the change of the appearance of the object in the complex scene, thereby improving the target tracking performance.

Description

Single-flow single-stage network target tracking method and system based on cascade attention
Technical Field
The invention relates to the technical field of computer vision and image processing, in particular to a single-flow single-stage network target tracking method and system based on cascade attention.
Background
In the field of computer vision and image processing, vision tracking is a fundamental research task in computer vision, with the emphasis on precisely locating any target in each video frame using only its initial appearance as a reference. It is applied in various fields including visual positioning, automatic driving system and intelligent city technology. However, due to many challenging factors in real world scenes, such as partial occlusion, object out of view, background clutter, viewpoint changes, and scale changes, designing a robust tracker remains a significant challenge.
Currently, tracking models typically employ a dual-stream, dual-stage model architecture. In this approach, features from the template and the search area are extracted separately. However, this approach has certain drawbacks, mainly due to the high computational complexity of the traditional attention mechanisms. In addition, local feature information is often ignored when extracting features with global context. Recently, single-stream architecture has become a viable alternative. These architectures lead to faster processing and enhanced feature fusion capabilities with significant success in tracking performance. The reason behind its effectiveness is that the model architecture is able to establish an unobstructed information flow between the template and the search area at an early stage, in particular from the original image pair. This helps to extract target specific features and prevents loss of discrimination information.
The transducer proposes for the first time a self-attention-based encoder-decoder module for natural language processing. It explores long-range dependencies in a sequence by computing the attention weights of triples. Based on the excellent feature fusion capability, the transducer structure has been successfully applied to visual tracking with encouraging results. In a transducer-based tracker, global context information is fully explored, however, local information is not fully utilized, and in order to improve the attention mechanism, a new attention module, called cascade attention, is proposed. The core idea is to enhance the diversity of features of the input attention header.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a method and a system for single-stream single-stage network target tracking based on cascade attention, so as to solve the above technical problems.
The invention provides a single-flow single-stage network target tracking method based on cascade attention, which comprises the following steps:
step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;
step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;
step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;
and 7, carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
The invention also provides a single-flow single-stage network target tracking system based on cascade attention, wherein the system applies the single-flow single-stage network target tracking method based on cascade attention, and the system comprises the following steps:
a construction module for:
under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;
a learning module for:
acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
an extraction module for:
dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;
a calculation module for:
inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
a pre-training module for:
pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;
a tracking module for:
and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
Compared with the prior art, the invention has the following beneficial effects:
1. the present invention utilizes cascading attention to provide different input splits for each head and then cascades output features onto those heads. This approach not only reduces computational redundancy for multi-head attention, but also enhances the capacity of the model by increasing network depth.
2. The invention introduces the fractional head module for updating the online template, corrects the online template picture on line according to the prediction score of the search picture, can cope with the change of the appearance of the object in a complex scene, can better handle the difficulties of serious shielding, scale change, complex background and the like in the tracking process, effectively captures time information and processes the change of the appearance of the object, and further improves the performance of target tracking.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a single-flow single-stage network target tracking method based on cascade attention;
FIG. 2 is a block diagram of a single-flow single-phase network target tracking framework based on a cascade attention module according to the present invention;
FIG. 3 is a schematic diagram of a feature enhancement module of the present invention;
FIG. 4 is a schematic diagram of the cascade attention of the present invention;
fig. 5 is a schematic structural diagram of a single-flow single-stage network target tracking system based on cascade attention according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly.
Referring to fig. 1 to 2, the present embodiment provides a single-flow single-stage network target tracking method based on cascade attention, which includes the following steps:
step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;
step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
in the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:
for each template image, its initial shape isFirstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remoldedHere let->Wherein->Representing the length of the template token, +.>Respectively representing the length, width and channel number of the initial template image. />Respectively representing the length, width and channel number of the template image after convolution;
for each search image, its initial shape isSimilarly, word embedding operation is performed on the same, and convolution operation is performed to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remolded +.>Here let->Wherein->Representing the length of the search token>Respectively representing the length, width and channel number of the initial search image. />Respectively representing the length, width and channel number of the convolved search image, and +.>,/>,/>
Referring to fig. 3 to 4, in the step 3, the method for aggregating local semantic information by using cascade attention to achieve feature enhancement and obtaining global context information of a template image and a search picture specifically includes the following steps:
the total input tokens comprise template tokens and search tokens, the template tokens comprise an initial template token and a plurality of online template tokens, the template tokens and the search tokens are spliced, and the splicing process has the following relational expression:
wherein,representing a concatenation function whose function defaults to concatenation according to the dimensions of the channel, with the aim of concatenating multiple tokens to obtain the final input,/->Representing the total input token, < >>Representing the initial template token in the total input token,representing several online template tokens of the total input tokens, the shape of which is +.>,/>Representing a search token of the total input tokens, the shape of which is +.>
The total input token is converted into a two-dimensional image, and the conversion process has the following relation:
wherein,indicate->Two-dimensional image->Indicate->Input tokens->A function representing converting the one-dimensional vector into a two-dimensional picture; for example X is shaped as +.>The shape becomes +.>Wherein; />
Inputting the two-dimensional images into the self-attention enhancing function to perform feature extraction, obtaining an enhancing token corresponding to each image, wherein the process of performing feature extraction by using the self-attention enhancing function has the following relation:
wherein,represents a self-attention enhancing function (Self Attention Enhancement Module),indicate->A plurality of enhancement tokens;
connecting the enhancement tokens related to the template image part, wherein the enhancement token connection process has the following relation:
wherein,token concatenation of tokens representing template image portions resulting in a template result token, a template image portion>And representing a splicing function, wherein the function defaults to splice according to the dimension of the channel, and aims to splice a plurality of tokens to obtain a final template result token.
In the step 3, the method for performing cross attention calculation on global context information of the template image and the search picture to realize communication and obtaining the result feature map specifically comprises the following steps:
generating a Query (Query), a Key (Key) and a Value (Value) by using the template result token and the enhancement token corresponding to the search image, wherein the generation process of the Query, the Key and the Value has the following relational expression:
wherein,representing a query, key, value, about an enhanced token, < >>Respectively representing convolution operations on queries, keys, values about the enhanced token;
the cross attention calculation is carried out on the query, the key and the value, and the cross attention calculation process has the following relation:
wherein,representing search tokens after cross-attention, < - > and->Representing the dimensions of the key>Representing the transposed state of the matrix>Representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;
splicing the cross attention computing result and the template result token to obtain a total token, wherein the process of splicing the cross attention computing result and the template result token has the following relational expression:
wherein,representing the total token after cross-attention;
sequentially passing the total tokens through the layer normalization and the multi-layer perceptron to obtain a result feature map, wherein the process of sequentially passing the total tokens through the layer normalization and the multi-layer perceptron has the following relational expression:
wherein,representing the status of temporary storage results->Representing the output of the current calculation part,/->Representing a multi-layer sensor->Represents a Layer normalization function (Layer Norm).
Further, the method for inputting the two-dimensional images into the self-attention enhancing function to extract the features and obtain the enhancing tokens corresponding to each image specifically comprises the following steps:
will be the firstTwo-dimensional image->And->The individual attention head outputs a token->Add as new->Two-dimensional image->The process expression is as follows:
wherein,indicate->Attention head, head>Representing a new->A two-dimensional image;
self-attentive mode is adopted to make the firstTwo-dimensional image->Calculating a new +.>Attention head, hereinafter denoted +.>The process expression is as follows:
wherein,representing a self-attention function, +.>Respectively represent pairs about the firstiEnquiry, key, convolution operation of values of individual enhanced tokens,>representing the dimensions of the key;
after the output of all new attention heads is connected in a convolution operation mode, the common convolution operation is applied to strengthen the local information of the features, and an enhanced token corresponding to each image is obtained, wherein the process expression is as follows:
wherein,representing a normal convolution operation.
In this step, the firstTwo-dimensional image->And->The individual attention head outputs a token->Add as new->Two-dimensional image->Self-attentive mode is adopted to make +.>Two-dimensional image->Calculating a new +.>And a plurality of attention heads. After all heads have been connected in the form of a convolution operation, also apply +.>I.e., a normal convolution operation, to enhance the local information of the feature. This enables the self-attention mechanism to comprehensively capture local and global relationships, further enhancing the feature representation.
Step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;
step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
the method for inputting the result feature map into the score head prediction module to predict the confidence score of each target state and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state specifically comprises the following steps:
score tokens to be learnedA query participating in searching for a region of interest token is generated, the process expression of which is as follows:
wherein,representation pair->Is a query convolution operation of->Representing a query about searching for a region of interest token, < ->A score token representing learning, wherein the shape is 1 xc;
and extracting important region features from the result feature map in a self-adaptive manner, and generating keys and values by using the important region features, wherein the process expression is as follows:
wherein,representing the result feature map, < >>、/>Respectively representing characteristic keys, values, and +.>Representing an ROI function for adaptively extracting important region features,/->Representing important regional features->Key convolution operation representing characteristics of important areas, < ->A value convolution operation representing a feature of the region of interest;
and calculating attention weight by using query, key and value, and sequentially passing the attention weight through a multi-layer perceptron and a Sigmoid activation function to obtain a prediction score, wherein the process expression is as follows:
wherein,attention weight representing attention output, +.>Representing an activation function for generating a 0 to 1 score,/->Representing a confidence score;
when the confidence score is lower than 0.5, the online template is considered negative, which indicates that the online template is not updated, otherwise, the online template is considered positive, which indicates that the online template is updated, and the predicted target state is taken as the online template in the online tracking process of the next stage.
In this step, a learnable score tokenIs used as a query to participate in searching for the ROI tokens so that the scoring tokens can encode mined target information. Next, the scoring token focuses on all locations of the initial target token to implicitly compare the mined target with the first target. Finally, the score is generated by the MLP layer and Sigmoid activation function and the online template is updated according to the score. By the method, online templates can be effectively screened and updated, and accuracy and stability of a tracking system are improved.
Step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;
and 7, carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
Referring to fig. 5, the present embodiment further provides a single-flow single-stage network target tracking system based on cascade attention, where the system applies the single-flow single-stage network target tracking method based on cascade attention as described above, and the system includes:
a construction module for:
under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;
a learning module for:
acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
an extraction module for:
dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;
a calculation module for:
inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
a pre-training module for:
pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;
a tracking module for:
and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (7)

1. A method for single-stream single-stage network target tracking based on cascade attention, the method comprising the steps of:
step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;
step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;
step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;
and 7, carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
2. The method for tracking the single-stream single-stage network target based on the cascade attention according to claim 1, wherein in the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:
for each template image, its initial shape isFirstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remoldedHere let->Wherein->Representing the length of the template token, +.>Respectively representing the length, width and channel number of the initial template image,/->Respectively representing the length, width and channel number of the template image after convolution;
for each search image, its initial shape isSimilarly, word embedding operation is performed on the same, and convolution operation is performed to obtain a result, wherein the shape of the result is +.>They are then stretched from two dimensions to one dimension, i.e. remolded +.>Here let->Wherein->Representing the length of the search token>Representing the length, width, number of channels of the initial search image, respectively,/->Respectively representing the length, width and channel number of the convolved search template image, and +.>,/>,/>
3. The method for single-stream single-stage network object tracking based on cascade attention according to claim 2, wherein in the step 3, the method for aggregating local semantic information by cascade attention to realize feature enhancement, and obtaining global context information of template images and search pictures specifically comprises the following steps:
the total input tokens comprise template tokens and search tokens, the template tokens comprise an initial template token and a plurality of online template tokens, the template tokens and the search tokens are spliced, and the splicing process has the following relational expression:
wherein,representing a stitching function->Representing the total input token, < >>Representing an initial template token in the total input token, < +.>Representing several online template tokens of the total input tokens, the shape of which is +.>,/>Representing a search token of the total input tokens, the shape of which is +.>
The total input token is converted into a two-dimensional image, and the conversion process has the following relation:
wherein,indicate->Two-dimensional image->Indicate->Input tokens->A function representing converting the one-dimensional vector into a two-dimensional picture;
inputting the two-dimensional images into the self-attention enhancing function to perform feature extraction, obtaining an enhancing token corresponding to each image, wherein the process of performing feature extraction by using the self-attention enhancing function has the following relation:
wherein,representing a self-attention enhancing function,/->Indicate->A plurality of enhancement tokens;
connecting the enhancement tokens related to the template image part, wherein the enhancement token connection process has the following relation:
wherein,token concatenation of tokens representing template image portions resulting in a template result token, a template image portion>Representing the stitching function.
4. The method for single-stream single-stage network object tracking based on cascade attention as recited in claim 3, wherein in said step 3, a cross attention calculation is performed on global context information of a template image and a search picture to realize communication, and a method for obtaining a result feature map specifically comprises the steps of:
generating a query, a key and a value by using the template result token and the enhancement token corresponding to the search image, wherein the generation process of the query, the key and the value has the following relational expression:
wherein,representing a query, key, value, about an enhanced token, < >>Respectively representing convolution operations on queries, keys, values about the enhanced token;
the cross attention calculation is carried out on the query, the key and the value, and the cross attention calculation process has the following relation:
wherein,representing search tokens after cross-attention, < - > and->Representing the dimensions of the key>Represents the transposed state of the matrix,representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;
splicing the cross attention computing result and the template result token to obtain a total token, wherein the process of splicing the cross attention computing result and the template result token has the following relational expression:
wherein,representing the total token after cross-attention;
sequentially passing the total tokens through the layer normalization and the multi-layer perceptron to obtain a result feature map, wherein the process of sequentially passing the total tokens through the layer normalization and the multi-layer perceptron has the following relational expression:
wherein,representing the status of temporary storage results->Representation ofOutput of the current calculation section, +.>Representing a multi-layer sensor->Representing a layer normalization function.
5. The method for tracking the single-stream single-stage network target based on the cascade attention as recited in claim 4, wherein the method for inputting the two-dimensional images into the self-attention enhancement function for feature extraction and obtaining the enhancement tokens corresponding to each image specifically comprises the following steps:
will be the firstTwo-dimensional image->And->The individual attention head outputs a token->Add as new->Two-dimensional image->The process expression is as follows:
wherein,indicate->Attention head, head>Representing a new->A two-dimensional image;
self-attentive mode is adopted to make the firstTwo-dimensional image->Calculating a new +.>Attention head, hereinafter denoted +.>The process expression is as follows:
wherein,representing a self-attention function, +.>Respectively represent pairs about the firstiEnquiry, key, convolution operation of values of individual enhanced tokens,>representing the dimensions of the key;
after the output of all new attention heads is connected in a convolution operation mode, the common convolution operation is applied to strengthen the local information of the features, and an enhanced token corresponding to each image is obtained, wherein the process expression is as follows:
wherein,representing a normal convolution operation.
6. The method for single-stream single-stage network object tracking based on cascade attention as set forth in claim 5, wherein in said step 5, the method for inputting the result feature map into the score head prediction module to predict the confidence score of each object state and determining whether to use the predicted object state as an online template in the next stage of online tracking according to the confidence score of the object state comprises the steps of:
score tokens to be learnedA query participating in searching for a region of interest token is generated, the process expression of which is as follows:
wherein the method comprises the steps ofRepresentation pair->Is a query convolution operation of->Representing a query about searching for a region of interest token, < ->A score token representing learning, wherein the shape is 1 xc;
and extracting important region features from the result feature map in a self-adaptive manner, and generating keys and values by using the important region features, wherein the process expression is as follows:
wherein,representing the result feature map, < >>、/>Respectively representing characteristic keys, values, and +.>Representing an ROI function for adaptively extracting important region features,/->Representing important regional features->Key convolution operation representing characteristics of important areas, < ->A value convolution operation representing a feature of the region of interest;
and calculating attention weight by using query, key and value, and sequentially passing the attention weight through a multi-layer perceptron and a Sigmoid activation function to obtain a prediction score, wherein the process expression is as follows:
wherein,attention weight representing attention output, +.>Representing an activation function for generating a 0 to 1 score,/->Representing a confidence score;
when the confidence score is lower than 0.5, the online template is considered negative, which indicates that the online template is not updated, otherwise, the online template is considered positive, which indicates that the online template is updated, and the predicted target state is taken as the online template in the online tracking process of the next stage.
7. A cascade attention-based single-flow single-phase network object tracking system, wherein the system applies the cascade attention-based single-flow single-phase network object tracking method according to any one of claims 1 to 6, the system comprising:
a construction module for:
under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;
a learning module for:
acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
an extraction module for:
dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;
a calculation module for:
inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
a pre-training module for:
pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;
a tracking module for:
and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
CN202410106560.9A 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention Active CN117649582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410106560.9A CN117649582B (en) 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410106560.9A CN117649582B (en) 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention

Publications (2)

Publication Number Publication Date
CN117649582A true CN117649582A (en) 2024-03-05
CN117649582B CN117649582B (en) 2024-04-19

Family

ID=90049767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410106560.9A Active CN117649582B (en) 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention

Country Status (1)

Country Link
CN (1) CN117649582B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115619822A (en) * 2022-09-14 2023-01-17 浙江工业大学 Tracking method based on object-level transformation neural network
CN116030097A (en) * 2023-02-28 2023-04-28 南昌工程学院 Target tracking method and system based on dual-attention feature fusion network
CN116109678A (en) * 2023-04-10 2023-05-12 南昌工程学院 Method and system for tracking target based on context self-attention learning depth network
US20230184927A1 (en) * 2021-12-15 2023-06-15 Anhui University Contextual visual-based sar target detection method and apparatus, and storage medium
CN116485839A (en) * 2023-04-06 2023-07-25 常州工学院 Visual tracking method based on attention self-adaptive selection of transducer
CN117036770A (en) * 2023-05-19 2023-11-10 北京交通大学 Detection model training and target detection method and system based on cascade attention
CN117274883A (en) * 2023-11-20 2023-12-22 南昌工程学院 Target tracking method and system based on multi-head attention optimization feature fusion network
CN117315293A (en) * 2023-09-26 2023-12-29 杭州电子科技大学 Transformer-based space-time context target tracking method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230184927A1 (en) * 2021-12-15 2023-06-15 Anhui University Contextual visual-based sar target detection method and apparatus, and storage medium
CN115619822A (en) * 2022-09-14 2023-01-17 浙江工业大学 Tracking method based on object-level transformation neural network
CN116030097A (en) * 2023-02-28 2023-04-28 南昌工程学院 Target tracking method and system based on dual-attention feature fusion network
CN116485839A (en) * 2023-04-06 2023-07-25 常州工学院 Visual tracking method based on attention self-adaptive selection of transducer
CN116109678A (en) * 2023-04-10 2023-05-12 南昌工程学院 Method and system for tracking target based on context self-attention learning depth network
CN117036770A (en) * 2023-05-19 2023-11-10 北京交通大学 Detection model training and target detection method and system based on cascade attention
CN117315293A (en) * 2023-09-26 2023-12-29 杭州电子科技大学 Transformer-based space-time context target tracking method and system
CN117274883A (en) * 2023-11-20 2023-12-22 南昌工程学院 Target tracking method and system based on multi-head attention optimization feature fusion network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TIANLING BIAN等: "VTT: Long-term Visual Tracking with Transformers", 《2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》, 15 January 2021 (2021-01-15) *
杨康等: "基于双重注意力孪生网络的实时视觉跟踪", 《计算机应用》, no. 06, 15 January 2019 (2019-01-15) *
王员云等: "基于核扩展字典学习的目标跟踪算法研究", 《南昌工程学院学报》, vol. 41, no. 4, 31 August 2022 (2022-08-31) *

Also Published As

Publication number Publication date
CN117649582B (en) 2024-04-19

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Tian et al. Cctrans: Simplifying and improving crowd counting with transformer
Chandio et al. Precise single-stage detector
Zhang et al. Recent progresses on object detection: a brief review
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
Xu et al. Aligning correlation information for domain adaptation in action recognition
Shen et al. Vehicle detection in aerial images based on lightweight deep convolutional network and generative adversarial network
CN112070768B (en) Anchor-Free based real-time instance segmentation method
Wang et al. Multiscale deep alternative neural network for large-scale video classification
CN113744311A (en) Twin neural network moving target tracking method based on full-connection attention module
Gao et al. Co-saliency detection with co-attention fully convolutional network
CN115222998B (en) Image classification method
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN116862949A (en) Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement
CN112766378A (en) Cross-domain small sample image classification model method focusing on fine-grained identification
Wang et al. TF-SOD: a novel transformer framework for salient object detection
Liu et al. Dunhuang murals contour generation network based on convolution and self-attention fusion
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN117011515A (en) Interactive image segmentation model based on attention mechanism and segmentation method thereof
CN117649582B (en) Single-flow single-stage network target tracking method and system based on cascade attention
Huang et al. Bidirectional tracking scheme for visual object tracking based on recursive orthogonal least squares
CN116403133A (en) Improved vehicle detection algorithm based on YOLO v7
CN113869154B (en) Video actor segmentation method according to language description

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant