CN117649582B - Single-flow single-stage network target tracking method and system based on cascade attention - Google Patents

Single-flow single-stage network target tracking method and system based on cascade attention Download PDF

Info

Publication number
CN117649582B
CN117649582B CN202410106560.9A CN202410106560A CN117649582B CN 117649582 B CN117649582 B CN 117649582B CN 202410106560 A CN202410106560 A CN 202410106560A CN 117649582 B CN117649582 B CN 117649582B
Authority
CN
China
Prior art keywords
representing
attention
template
token
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410106560.9A
Other languages
Chinese (zh)
Other versions
CN117649582A (en
Inventor
王员云
司英振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202410106560.9A priority Critical patent/CN117649582B/en
Publication of CN117649582A publication Critical patent/CN117649582A/en
Application granted granted Critical
Publication of CN117649582B publication Critical patent/CN117649582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a single-flow single-stage network target tracking method and system based on cascade attention, the method comprises the steps of firstly forming a single-flow single-stage integral model, inputting a template image and a search picture into the single-flow single-stage integral model, carrying out feature extraction to obtain local feature information, carrying out aggregation on the local semantic information by utilizing the cascade attention to realize feature enhancement, then carrying out cross attention calculation to realize communication, obtaining a result feature map, repeatedly extracting the result feature map for a plurality of times in an iterative mode to obtain a final result feature map so as to predict the position of a target and the target state, realizing target tracking according to the position of the target, and determining whether the predicted target state is used as an online template in the next stage of online tracking according to the target state. The invention can reduce the calculation redundancy of the multi-head attention and simultaneously can cope with the change of the appearance of the object in the complex scene, thereby improving the target tracking performance.

Description

Single-flow single-stage network target tracking method and system based on cascade attention
Technical Field
The invention relates to the technical field of computer vision and image processing, in particular to a single-flow single-stage network target tracking method and system based on cascade attention.
Background
In the field of computer vision and image processing, vision tracking is a fundamental research task in computer vision, with the emphasis on precisely locating any target in each video frame using only its initial appearance as a reference. It is applied in various fields including visual positioning, automatic driving system and intelligent city technology. However, due to many challenging factors in real world scenes, such as partial occlusion, object out of view, background clutter, viewpoint changes, and scale changes, designing a robust tracker remains a significant challenge.
Currently, tracking models typically employ a dual-stream, dual-stage model architecture. In this approach, features from the template and the search area are extracted separately. However, this approach has certain drawbacks, mainly due to the high computational complexity of the traditional attention mechanisms. In addition, local feature information is often ignored when extracting features with global context. Recently, single-stream architecture has become a viable alternative. These architectures lead to faster processing and enhanced feature fusion capabilities with significant success in tracking performance. The reason behind its effectiveness is that the model architecture is able to establish an unobstructed information flow between the template and the search area at an early stage, in particular from the original image pair. This helps to extract target specific features and prevents loss of discrimination information.
The transducer proposes for the first time a self-attention-based encoder-decoder module for natural language processing. It explores long-range dependencies in a sequence by computing the attention weights of triples. Based on the excellent feature fusion capability, the transducer structure has been successfully applied to visual tracking with encouraging results. In a transducer-based tracker, global context information is fully explored, however, local information is not fully utilized, and in order to improve the attention mechanism, a new attention module, called cascade attention, is proposed. The core idea is to enhance the diversity of features of the input attention header.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide a method and a system for single-stream single-stage network target tracking based on cascade attention, so as to solve the above technical problems.
The invention provides a single-flow single-stage network target tracking method based on cascade attention, which comprises the following steps:
Step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;
Step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
Step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
Step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;
step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;
and 7, carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
The invention also provides a single-flow single-stage network target tracking system based on cascade attention, wherein the system applies the single-flow single-stage network target tracking method based on cascade attention, and the system comprises the following steps:
A construction module for:
Under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;
a learning module for:
acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
An extraction module for:
dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;
A calculation module for:
inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
a pre-training module for:
Pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;
A tracking module for:
and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
Compared with the prior art, the invention has the following beneficial effects:
1. The present invention utilizes cascading attention to provide different input splits for each head and then cascades output features onto those heads. This approach not only reduces computational redundancy for multi-head attention, but also enhances the capacity of the model by increasing network depth.
2. The invention introduces the fractional head module for updating the online template, corrects the online template picture on line according to the prediction score of the search picture, can cope with the change of the appearance of the object in a complex scene, can better handle the difficulties of serious shielding, scale change, complex background and the like in the tracking process, effectively captures time information and processes the change of the appearance of the object, and further improves the performance of target tracking.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a single-flow single-stage network target tracking method based on cascade attention;
FIG. 2 is a block diagram of a single-flow single-phase network target tracking framework based on a cascade attention module according to the present invention;
FIG. 3 is a schematic diagram of a feature enhancement module of the present invention;
FIG. 4 is a schematic diagram of the cascade attention of the present invention;
Fig. 5 is a schematic structural diagram of a single-flow single-stage network target tracking system based on cascade attention according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly.
Referring to fig. 1 to 2, the present embodiment provides a single-flow single-stage network target tracking method based on cascade attention, which includes the following steps:
Step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;
Step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
Step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
In the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:
for each template image, its initial shape is Firstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape is/>They are then stretched from two dimensions to one dimension, i.e. remoldedLet/>, hereWherein/>Representing the length of the template token,/>Respectively representing the length, width and channel number of the initial template image. /(I)Respectively representing the length, width and channel number of the template image after convolution;
for each search image, its initial shape is Similarly, word embedding operation is performed on the obtained product, and convolution operation is performed to obtain a result, wherein the shape is/>They are then stretched from two dimensions to one dimension, i.e. remolded/>Let/>, hereWherein/>Representing the length of the search token,/>Respectively representing the length, width and channel number of the initial search image. /(I)Respectively representing the length, width and channel number of the convolved search image, and/>,/>,/>
Referring to fig. 3 to 4, in the step 3, the method for aggregating local semantic information by using cascade attention to achieve feature enhancement and obtaining global context information of a template image and a search picture specifically includes the following steps:
The total input tokens comprise template tokens and search tokens, the template tokens comprise an initial template token and a plurality of online template tokens, the template tokens and the search tokens are spliced, and the splicing process has the following relational expression:
wherein, Representing a splicing function, the function of which is spliced according to the dimension of a channel by default, with the aim of splicing a plurality of tokens to obtain a final input,/>Representing the total input token,/>Representing the initial template token in the total input token,Representing several online template tokens in the total input token, whose shape is/>,/>Representing a search token of the total input tokens, the shape of which is/>
The total input token is converted into a two-dimensional image, and the conversion process has the following relation:
wherein, Represents the/>Two-dimensional image,/>Represents the/>Input tokens,/>A function representing converting the one-dimensional vector into a two-dimensional picture; for example, X has a shape of/>The shape becomes/>, after the function is changedWherein; /(I)
Inputting the two-dimensional images into the self-attention enhancing function to perform feature extraction, obtaining an enhancing token corresponding to each image, wherein the process of performing feature extraction by using the self-attention enhancing function has the following relation:
wherein, Represents a self-attention enhancing function (Self Attention Enhancement Module),Represents the/>A plurality of enhancement tokens;
Connecting the enhancement tokens related to the template image part, wherein the enhancement token connection process has the following relation:
wherein, Template result token obtained by token stitching representing a template image portion,/>And representing a splicing function, wherein the function defaults to splice according to the dimension of the channel, and aims to splice a plurality of tokens to obtain a final template result token.
In the step 3, the method for performing cross attention calculation on global context information of the template image and the search picture to realize communication and obtaining the result feature map specifically comprises the following steps:
generating a Query (Query), a Key (Key) and a Value (Value) by using the template result token and the enhancement token corresponding to the search image, wherein the generation process of the Query, the Key and the Value has the following relational expression:
wherein, Representing queries, keys, values, and/or the like about enhanced tokensRespectively representing convolution operations on queries, keys, values about the enhanced token;
The cross attention calculation is carried out on the query, the key and the value, and the cross attention calculation process has the following relation:
wherein, Representing search tokens after cross-attention,/>Representing the dimensions of the key,/>Representing matrix transposed state,/>Representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;
splicing the cross attention computing result and the template result token to obtain a total token, wherein the process of splicing the cross attention computing result and the template result token has the following relational expression:
wherein, Representing the total token after cross-attention;
sequentially passing the total tokens through the layer normalization and the multi-layer perceptron to obtain a result feature map, wherein the process of sequentially passing the total tokens through the layer normalization and the multi-layer perceptron has the following relational expression:
wherein, Representing the status of temporary storage results,/>Representing the output of the currently calculated part,/>Representing a multi-layer perceptron,/>Represents a Layer normalization function (Layer Norm).
Further, the method for inputting the two-dimensional images into the self-attention enhancing function to extract the features and obtain the enhancing tokens corresponding to each image specifically comprises the following steps:
Will be the first Two-dimensional image/>And/>Individual attention header output tokens/>Add as new/>Two-dimensional image/>The process expression is as follows:
wherein, Represents the/>Attention header,/>Represents a new/>A two-dimensional image;
Self-attentive mode is adopted to make the first Two-dimensional image/>Calculate a new/>, as a new inputAttention heads, hereinafter referred to as/>The process expression is as follows:
wherein, Representing a self-attention function,/>Respectively representing convolution operations on queries, keys, values with respect to the ith enhanced token,/>Representing the dimensions of the key;
after the output of all new attention heads is connected in a convolution operation mode, the common convolution operation is applied to strengthen the local information of the features, and an enhanced token corresponding to each image is obtained, wherein the process expression is as follows:
wherein, Representing a normal convolution operation.
In this step, the firstTwo-dimensional image/>And/>Individual attention header output tokens/>Add as new/>Two-dimensional image/>The/>, is self-attentively performedTwo-dimensional image/>Calculate a new/>, as a new inputAnd a plurality of attention heads. After all the heads are connected in the form of convolution operation, the/>I.e., a normal convolution operation, to enhance the local information of the feature. This enables the self-attention mechanism to comprehensively capture local and global relationships, further enhancing the feature representation.
Step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;
step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
The method for inputting the result feature map into the score head prediction module to predict the confidence score of each target state and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state specifically comprises the following steps:
score tokens to be learned A query participating in searching for a region of interest token is generated, the process expression of which is as follows:
wherein, Representation pair/>Is a query convolution operation,/>Representing a query about searching for a region of interest token,/>A score token representing learning, wherein the shape is 1 xc;
And extracting important region features from the result feature map in a self-adaptive manner, and generating keys and values by using the important region features, wherein the process expression is as follows:
wherein, Representing the resulting feature map,/>、/>Respectively representing characteristic keys and values related to important areas,/>Representing an ROI function for adaptively extracting important region features,/>Representing important regional features,/>Key convolution operations representing features of important regions,/>A value convolution operation representing a feature of the region of interest;
And calculating attention weight by using query, key and value, and sequentially passing the attention weight through a multi-layer perceptron and a Sigmoid activation function to obtain a prediction score, wherein the process expression is as follows:
wherein, Attention weight representing attention output,/>Representing an activation function for generating a 0 to 1 score,/>Representing a confidence score;
When the confidence score is lower than 0.5, the online template is considered negative, which indicates that the online template is not updated, otherwise, the online template is considered positive, which indicates that the online template is updated, and the predicted target state is taken as the online template in the online tracking process of the next stage.
In this step, a learnable score tokenIs used as a query to participate in searching for the ROI tokens so that the scoring tokens can encode mined target information. Next, the scoring token focuses on all locations of the initial target token to implicitly compare the mined target with the first target. Finally, the score is generated by the MLP layer and Sigmoid activation function and the online template is updated according to the score. By the method, online templates can be effectively screened and updated, and accuracy and stability of a tracking system are improved.
Step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;
and 7, carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
Referring to fig. 5, the present embodiment further provides a single-flow single-stage network target tracking system based on cascade attention, where the system applies the single-flow single-stage network target tracking method based on cascade attention as described above, and the system includes:
A construction module for:
Under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;
a learning module for:
acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
An extraction module for:
dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;
A calculation module for:
inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
a pre-training module for:
Pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;
A tracking module for:
and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (2)

1.A method for single-stream single-stage network target tracking based on cascade attention, the method comprising the steps of:
Step 1, under a single-stream single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, a head corner module and a fractional head prediction module form a single-stream single-stage integral model;
Step 2, obtaining a template image and searching pictures, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
Step 3, inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
Step 4, dividing the result feature map into a template image and a search picture, and repeating the step 3 for a plurality of times in an iterative mode to obtain a final result feature map;
step 5, inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of the tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
step 6, repeating the steps 2 to 4 by using the large-scale data set as a basis, and pre-training the single-flow single-stage integral model to optimize model parameters;
Step 7, performing target online tracking on the video sequence by using the trained single-stream single-stage integral model;
In the step 3, the template image and the search picture are input into a single-stream single-stage integral model, and the method for extracting the local feature information corresponding to the template image and the search picture through the trunk feature extraction and fusion module specifically comprises the following steps:
for each template image, its initial shape is Firstly, word embedding operation is carried out on the shape, and then convolution operation is carried out to obtain a result, wherein the shape is/>They are then stretched from two dimensions to one dimension, i.e. remoldedLet/>, hereWherein/>Representing the length of the template token,/>Respectively representing the length, width and channel number of the initial template image,/>Respectively representing the length, width and channel number of the template image after convolution;
for each search image, its initial shape is Similarly, word embedding operation is performed on the obtained product, and convolution operation is performed to obtain a result, wherein the shape is/>They are then stretched from two dimensions to one dimension, i.e. remolded/>Let/>, hereWherein/>Representing the length of the search token,/>Respectively representing the length, width and channel number of the initial search image,/>Respectively representing the length, width and channel number of the convolved search template image, and/>,/>,/>
In the step 3, the method for aggregating the local semantic information by using the cascade attention to realize feature enhancement and obtaining the global context information of the template image and the search picture specifically comprises the following steps:
The total input tokens comprise template tokens and search tokens, the template tokens comprise an initial template token and a plurality of online template tokens, the template tokens and the search tokens are spliced, and the splicing process has the following relational expression:
wherein, Representing a stitching function,/>Representing the total input token,/>Representing an initial template token in the total input token,/>Representing several online template tokens in the total input token, whose shape is/>,/>Representing a search token of the total input tokens, the shape of which is/>
The total input token is converted into a two-dimensional image, and the conversion process has the following relation:
wherein, Represents the/>Two-dimensional image,/>Represents the/>Input tokens,/>A function representing converting the one-dimensional vector into a two-dimensional picture;
Inputting the two-dimensional images into the self-attention enhancing function to perform feature extraction, obtaining an enhancing token corresponding to each image, wherein the process of performing feature extraction by using the self-attention enhancing function has the following relation:
wherein, Representing a self-attention enhancing function,/>Represents the/>A plurality of enhancement tokens;
Connecting the enhancement tokens related to the template image part, wherein the enhancement token connection process has the following relation:
wherein, Template result token obtained by token stitching representing a template image portion,/>Representing a splicing function;
The method for carrying out cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining the result feature map specifically comprises the following steps:
generating a query, a key and a value by using the template result token and the enhanced token corresponding to the search image, wherein the generation process of the query, the key and the value has the following relational expression
Wherein,Representing queries, keys, values, and/or the like about enhanced tokensRespectively representing convolution operations on queries, keys, values about the enhanced token;
The cross attention calculation is carried out on the query, the key and the value, and the cross attention calculation process has the following relation:
wherein, Representing search tokens after cross-attention,/>Representing the dimensions of the key,/>Represents the transposed state of the matrix,Representing a softmax function for calculating the attention weights to convert the original attention score into a probability distribution, ensuring that the weights for all locations are between 0 and 1 and that their sum equals 1;
splicing the cross attention computing result and the template result token to obtain a total token, wherein the process of splicing the cross attention computing result and the template result token has the following relational expression:
wherein, Representing the total token after cross-attention;
sequentially passing the total tokens through the layer normalization and the multi-layer perceptron to obtain a result feature map, wherein the process of sequentially passing the total tokens through the layer normalization and the multi-layer perceptron has the following relational expression:
wherein, Representing the status of temporary storage results,/>Representing the output of the currently calculated part,/>Representing a multi-layer perceptron,/>Representing a layer normalization function;
The method for inputting the two-dimensional images into the self-attention enhancement function to extract the characteristics and obtaining the enhancement tokens corresponding to each image specifically comprises the following steps:
Will be the first Two-dimensional image/>And/>Individual attention header output tokens/>Add as new/>Two-dimensional image/>The process expression is as follows:
wherein, Represents the/>Attention header,/>Represents a new/>A two-dimensional image;
Self-attentive mode is adopted to make the first Two-dimensional image/>Calculate a new/>, as a new inputAttention heads, hereinafter referred to as/>The process expression is as follows:
wherein, Representing a self-attention function,/>Respectively representing convolution operations on queries, keys, values with respect to the ith enhanced token,/>Representing the dimensions of the key;
after the output of all new attention heads is connected in a convolution operation mode, the common convolution operation is applied to strengthen the local information of the features, and an enhanced token corresponding to each image is obtained, wherein the process expression is as follows:
wherein, Representing a normal convolution operation;
in the step 5, the result feature map is input into a score head prediction module to predict the confidence score of each target state, and the method for determining whether to take the predicted target state as an online template in the next stage online tracking process according to the confidence score of the target state specifically comprises the following steps:
score tokens to be learned A query participating in searching for a region of interest token is generated, the process expression of which is as follows:
wherein, Representation pair/>Is a query convolution operation,/>Representing a query about searching for a region of interest token,/>A score token representing learning, wherein the shape is 1 xc;
And extracting important region features from the result feature map in a self-adaptive manner, and generating keys and values by using the important region features, wherein the process expression is as follows:
wherein, Representing the resulting feature map,/>、/>Respectively representing characteristic keys and values related to important areas,/>Representing an ROI function for adaptively extracting important region features,/>Representing important regional features,/>Key convolution operations representing features of important regions,/>A value convolution operation representing a feature of the region of interest;
And calculating attention weight by using query, key and value, and sequentially passing the attention weight through a multi-layer perceptron and a Sigmoid activation function to obtain a prediction score, wherein the process expression is as follows:
wherein, Attention weight representing attention output,/>Representing an activation function for generating a 0 to 1 score,/>Representing a confidence score;
When the confidence score is lower than 0.5, the online template is considered negative, which indicates that the online template is not updated, otherwise, the online template is considered positive, which indicates that the online template is updated, and the predicted target state is taken as the online template in the online tracking process of the next stage.
2. A cascade attention-based single-flow single-phase network object tracking system, wherein the system applies the cascade attention-based single-flow single-phase network object tracking method as claimed in claim 1, the system comprising:
A construction module for:
Under a single-flow single-stage framework, constructing and obtaining a trunk feature extraction and fusion module based on a Transformer network model and a feature enhancement module, wherein the trunk feature extraction and fusion module, the head corner module and the fractional head prediction module form a single-flow single-stage integral model;
a learning module for:
acquiring a template image and a search picture, wherein the template image comprises an initial template containing a plurality of required tracking targets and a plurality of online templates containing target states;
inputting the template image and the search picture into a single-stream single-stage integral model, and extracting local feature information corresponding to the template image and the search picture through a trunk feature extraction and fusion module;
inputting the local feature information into a feature enhancement module, and aggregating the local semantic information by using cascade attention to realize feature enhancement, so as to obtain global context information of a template image and a search picture;
Performing cross attention calculation on global context information of the template image and the search picture to realize communication, and obtaining a result feature map;
An extraction module for:
dividing the result feature map into a template image and a search picture, and repeating feature extraction for a plurality of times in an iterative mode to obtain a final result feature map;
A calculation module for:
inputting the final result feature map into a head corner module to predict the confidence score of each target position, and determining the position of a tracking target according to the confidence score so as to realize target tracking;
inputting the result feature map into a score head prediction module to predict the confidence score of each target state, and determining whether to take the predicted target state as an online template in the online tracking process of the next stage according to the confidence score of the target state;
a pre-training module for:
Pre-training the single-flow single-stage integral model by using the large-scale data set as a base to optimize model parameters;
A tracking module for:
and carrying out target online tracking on the video sequence by utilizing the trained single-stream single-stage integral model.
CN202410106560.9A 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention Active CN117649582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410106560.9A CN117649582B (en) 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410106560.9A CN117649582B (en) 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention

Publications (2)

Publication Number Publication Date
CN117649582A CN117649582A (en) 2024-03-05
CN117649582B true CN117649582B (en) 2024-04-19

Family

ID=90049767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410106560.9A Active CN117649582B (en) 2024-01-25 2024-01-25 Single-flow single-stage network target tracking method and system based on cascade attention

Country Status (1)

Country Link
CN (1) CN117649582B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118691852A (en) * 2024-08-28 2024-09-24 南昌工程学院 Single-flow single-stage target tracking method and system based on double softmax attention

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115619822A (en) * 2022-09-14 2023-01-17 浙江工业大学 Tracking method based on object-level transformation neural network
CN116030097A (en) * 2023-02-28 2023-04-28 南昌工程学院 Target tracking method and system based on dual-attention feature fusion network
CN116109678A (en) * 2023-04-10 2023-05-12 南昌工程学院 Method and system for tracking target based on context self-attention learning depth network
CN116485839A (en) * 2023-04-06 2023-07-25 常州工学院 Visual tracking method based on attention self-adaptive selection of transducer
CN117036770A (en) * 2023-05-19 2023-11-10 北京交通大学 Detection model training and target detection method and system based on cascade attention
CN117274883A (en) * 2023-11-20 2023-12-22 南昌工程学院 Target tracking method and system based on multi-head attention optimization feature fusion network
CN117315293A (en) * 2023-09-26 2023-12-29 杭州电子科技大学 Transformer-based space-time context target tracking method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202696B (en) * 2021-12-15 2023-01-24 安徽大学 SAR target detection method and device based on context vision and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115619822A (en) * 2022-09-14 2023-01-17 浙江工业大学 Tracking method based on object-level transformation neural network
CN116030097A (en) * 2023-02-28 2023-04-28 南昌工程学院 Target tracking method and system based on dual-attention feature fusion network
CN116485839A (en) * 2023-04-06 2023-07-25 常州工学院 Visual tracking method based on attention self-adaptive selection of transducer
CN116109678A (en) * 2023-04-10 2023-05-12 南昌工程学院 Method and system for tracking target based on context self-attention learning depth network
CN117036770A (en) * 2023-05-19 2023-11-10 北京交通大学 Detection model training and target detection method and system based on cascade attention
CN117315293A (en) * 2023-09-26 2023-12-29 杭州电子科技大学 Transformer-based space-time context target tracking method and system
CN117274883A (en) * 2023-11-20 2023-12-22 南昌工程学院 Target tracking method and system based on multi-head attention optimization feature fusion network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
VTT: Long-term Visual Tracking with Transformers;Tianling Bian等;《2020 25th International Conference on Pattern Recognition (ICPR)》;20210115;全文 *
基于双重注意力孪生网络的实时视觉跟踪;杨康等;《计算机应用》;20190115(第06期);全文 *
基于核扩展字典学习的目标跟踪算法研究;王员云等;《南昌工程学院学报》;20220831;第41卷(第4期);全文 *

Also Published As

Publication number Publication date
CN117649582A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Zhang et al. Recent progresses on object detection: a brief review
CN117649582B (en) Single-flow single-stage network target tracking method and system based on cascade attention
Wang et al. Multiscale deep alternative neural network for large-scale video classification
CN115222998B (en) Image classification method
CN115908517B (en) Low-overlapping point cloud registration method based on optimization of corresponding point matching matrix
Ma et al. Relative-position embedding based spatially and temporally decoupled Transformer for action recognition
Wang et al. TF-SOD: a novel transformer framework for salient object detection
CN116862949A (en) Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement
CN117994623A (en) Image feature vector acquisition method
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
Yang et al. An effective and lightweight hybrid network for object detection in remote sensing images
CN117876679A (en) Remote sensing image scene segmentation method based on convolutional neural network
Sajol et al. A ConvNeXt V2 Approach to Document Image Analysis: Enhancing High-Accuracy Classification
CN110942463B (en) Video target segmentation method based on generation countermeasure network
Huang et al. Bidirectional tracking scheme for visual object tracking based on recursive orthogonal least squares
CN116403133A (en) Improved vehicle detection algorithm based on YOLO v7
CN114972851A (en) Remote sensing image-based ship target intelligent detection method
CN113869154B (en) Video actor segmentation method according to language description
CN113627245B (en) CRTS target detection method
Liu et al. Adversarial erasing attention for person re-identification in camera networks under complex environments
Huang et al. SOAda-YOLOR: Small Object Adaptive YOLOR Algorithm for Road Object Detection
Wang et al. Visual tracking using transformer with a combination of convolution and attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant