CN116030097A - Target tracking method and system based on dual-attention feature fusion network - Google Patents
Target tracking method and system based on dual-attention feature fusion network Download PDFInfo
- Publication number
- CN116030097A CN116030097A CN202310172562.3A CN202310172562A CN116030097A CN 116030097 A CN116030097 A CN 116030097A CN 202310172562 A CN202310172562 A CN 202310172562A CN 116030097 A CN116030097 A CN 116030097A
- Authority
- CN
- China
- Prior art keywords
- attention
- frame
- features
- target
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000004044 response Effects 0.000 claims abstract description 9
- 230000008859 change Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 14
- 230000009977 dual effect Effects 0.000 claims description 13
- 238000012546 transfer Methods 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 238000007670 refining Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a transducer; learning the features in the template feature map through the encoder to obtain a high confidence value target suggestion frame; inputting the target suggestion frame into a decoder, and carrying out learning fusion on the features of the search area to obtain the target suggestion frame with the highest confidence value; focusing attention to the region of interest rapidly, capturing structured spatial information and local information, and exploring global context information by utilizing the structured spatial information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area for tracking. The invention makes the tracker well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and realizes more accurate and robust tracking.
Description
Technical Field
The invention relates to the technical field of computer vision and image processing, in particular to a target tracking method and system based on a dual-attention feature fusion network.
Background
Video tracking is an important computer vision task and has wide application in the aspects of automatic driving, vision positioning, video monitoring, pedestrian tracking and the like. The purpose of video tracking is to predict the location of an object of interest in subsequent frames, with the first frame initialized. Video tracking remains a very challenging task due to limited training data and challenges of a large number of real-world scenes, such as occlusion, deformation, background complexity, and scale changes.
At present, a twin network tracker based on a convolutional neural network is widely applied to the field of vision tracking. The convolutional neural network aims at extracting characteristics of a target through a specific network, and then carrying out subsequent processing, such as classification, detection and the like, on the target according to the extracted characteristics. In a tracker based on a twin network, the use of a convolutional neural network greatly improves the performance of the tracker. Furthermore, context information appears to be critical for many computer vision tasks, such as object tracking. The transducer can explore the rich context information between successive frames by using the attention in the encoder-decoder architecture, thus achieving better tracking performance.
However, the tracking algorithm using the transform structure, since each point needs to capture global context information, it is intangible that many key local information is lost, which has a great influence on the performance of the tracker. Therefore, how to efficiently explore the context information between consecutive frames without losing a large amount of useful local information becomes a key factor for improving the tracker performance.
Disclosure of Invention
In view of the above situation, a main object of the present invention is to solve the problem that in the prior art, a part of visual tracking algorithm ignores the close relation between global context information and local information, so that a large amount of local information is lost, and a large amount of redundant calculation is caused by the use of self-attention, so that the influence caused by complex appearance change, occlusion and the like is difficult to process.
The embodiment of the invention provides a target tracking method based on a dual-attention feature fusion network, wherein the method comprises the following steps of:
step one, initializing convolution:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
step two, feature learning:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
step three, network training:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
step four, learning and aggregation:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
step five, calculating a target frame:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
step six, target tracking:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
The invention also provides a target tracking system based on the dual-attention feature fusion network, wherein the system executes the target tracking method based on the dual-attention feature fusion network, and the system comprises the following steps:
an initialization convolution module for:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
the feature learning module is used for:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
the network training module is used for:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
a learning aggregation module for:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
the target frame calculation module is used for:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
a target tracking module for:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a Transformer through selection frame attention and example attention under a twin network framework; learning the features in the template feature map by a frame attention-based encoder to obtain a high confidence value target suggestion frame containing multiple scales; inputting the target suggestion frame in the encoder into a decoder based on example attention, simultaneously learning the features in the search area feature map, and fusing the template features and the search area features to obtain the target suggestion frame with the highest confidence value; pre-training a multi-scale feature fusion network based on a transducer, utilizing the trained multi-scale feature fusion network, enabling an encoder to rapidly focus attention to a region of interest, capturing a large amount of structured space information and local information, and enabling a decoder to explore global context information by utilizing the structured space information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area, so as to track. The method and the device fully combine the advantages of frame attention and example attention to construct the multi-scale feature fusion network based on the Transformer, so that the tracker can well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and can realize more accurate and robust tracking.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a target tracking method based on a dual attention feature fusion network according to the present invention;
FIG. 2 is a schematic block diagram of a target tracking method based on a dual attention feature fusion network according to the present invention;
fig. 3 is a schematic structural diagram of a target tracking system based on a dual attention feature fusion network according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.
Referring to fig. 1, the present invention provides a target tracking method based on a dual-attention feature fusion network, wherein the method includes the following steps:
s101, initializing convolution:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network.
S102, feature learning:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value.
In this step, the calculation formula of the frame attention is expressed as:
wherein ,represent the firstiFrame attention of individual head->Representing a box attention function +.>Representing a query vector->Representing key vectors +_>Representing a value vector +_>Representing a normalization function->Representing a transpose operation->,/>,/>,/>Representing real number set,/->Values representing the high-by-wide of the input profile,/->Side length of the grid feature map, +.>Indicates the number of channels>Representing the dimension of a head time feature.
S103, network training:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer.
S104, learning and aggregation:
and learning the local areas of the target features on the template branch image and the target features on the search area image by using the trained multi-scale feature fusion network based on the Transformer so as to respectively obtain corresponding local semantic information, and then respectively aggregating the local semantic information through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information.
S105, calculating a target frame:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score.
Wherein the frame attention samples a grid in each candidate frame and calculates the attention weight of the sampled feature in the grid feature.
In particular, as shown in fig. 2, the generation principle of the multi-headed frame attention module and the multi-headed example attention module of the present invention can be seen from fig. 2. In this embodiment, the multi-headed frame attention module and the multi-headed example attention module in which the firstThe method for calculating the frame attention of each attention head comprises the following steps:
s1051, give an investigation ofPolling vectorIs->Use bilinear interpolation from the block of interest +.>The extract has a size of +.>Grid feature map->。/>
It should be noted that the above query vectorIs inquiry->Obtained through linear projection transformation.
S1052, using the position attention module to map the grid characteristicsConverting into a region of interest to adapt the region of interest to the change in appearance of the target;
s1053, by calculating the query vectorAnd key vector->Matrix multiplication between to generate resulting frame attention coefficients;
s1054, calculating the frame attention coefficients using the softmax function to obtain the query vectorAnd key vector->Similarity score between->By calculating the similarity score +.>And grid feature map->Linear transformation matrix>To obtain the final box attention +.>。
To supplement, for the grid characteristic diagramFor example, grid feature map->The following properties are satisfied:
in calculating the frame attention, the location attention module may focus the frame attention on the necessary regions, thereby more accurately predicting the high confidence value target suggestion frame. The location attention module may vector the queryReference window +.>By translating or scaling the geometric transformation into a region of interest, it can use spatial information to predict the target suggestion box in the grid feature map. The using method of the position attention module comprises the following steps:
by means ofRepresenting query vector +.>Reference window of->, wherein ,/>Respectively representing the abscissa and the ordinate of the central position of the reference window, ">The width and the height of the reference window are respectively represented;
using a first transfer functionFor reference window->Performing a conversion, a first conversion function->Query vector +.>And reference Window->As an input, for moving the center position of the reference window;
using a second transfer functionFor reference window->Make adjustments, second transfer function->Query vector +.>And reference Window->As an input for adjusting the size of the reference window.
In the present embodiment, the first conversion functionThe corresponding calculation formula is expressed as:
wherein ,representing reference Window->The abscissa offset of the center position of +.>Representing reference Window->A vertical offset of the center position of (2);
wherein ,representing reference Window->Width size offset of +.>Representing reference Window->Height-size offset of (2).
Furthermore, the offset parameterBy querying vector->Is realized by linear projection of the corresponding calculation formula is expressed as: />
wherein ,representing the abscissaxLinear projection parameters of>Representing the ordinateyLinear projection parameters of>Linear projection parameters representing the width of the reference window, +.>Linear projection parameters representing the reference window height, < ->Representing the abscissaxBias of linear projection of ∈10->Representing the ordinateyBias of linear projection of ∈10->Linear projection offset representing the reference window width, +.>Linear projection offset representing reference window height, +.>Representing a temperature parameter.
Reference windowIs defined by a first conversion function>And a second transfer function->Together, the corresponding calculation formula is expressed as:
wherein ,representing the converted reference window +.>Representing the transfer function operation.
In the present invention, for the multi-head example attention module, the operation method of the multi-head example attention module includes the following steps:
generating a high confidence value target suggestion box containing multiple scales by using the box attention in a box attention-based transform encoder and sending to a transform decoder;
refining the high confidence value target suggestion box by utilizing instance attentions in the fransformer decoders, wherein each decoder layer in the fransformer decoders contains instance attentions, and one added instance normalization layer with a residual structure is connected at each forward propagation layer;
according to the high confidence value target suggestion box in the transducer decoder, the example attention takes grid features in the high confidence value target suggestion box as input, so that the target suggestion box with the highest confidence value is obtained.
S106, target tracking:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
Referring to fig. 3, the present invention further provides a target tracking system based on a dual-attention feature fusion network, wherein the system executes the target tracking method based on the dual-attention feature fusion network, the system includes:
an initialization convolution module for:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
the feature learning module is used for:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
the network training module is used for:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
a learning aggregation module for:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
the target frame calculation module is used for:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
a target tracking module for:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a Transformer through selection frame attention and example attention under a twin network framework; learning the features in the template feature map by a frame attention-based encoder to obtain a high confidence value target suggestion frame containing multiple scales; inputting the target suggestion frame in the encoder into a decoder based on example attention, simultaneously learning the features in the search area feature map, and fusing the template features and the search area features to obtain the target suggestion frame with the highest confidence value; pre-training a multi-scale feature fusion network based on a transducer, utilizing the trained multi-scale feature fusion network, enabling an encoder to rapidly focus attention to a region of interest, capturing a large amount of structured space information and local information, and enabling a decoder to explore global context information by utilizing the structured space information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area, so as to track. The method and the device fully combine the advantages of frame attention and example attention to construct the multi-scale feature fusion network based on the Transformer, so that the tracker can well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and can realize more accurate and robust tracking.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
Claims (10)
1. A method for tracking a target based on a dual attention feature fusion network, the method comprising the steps of:
step one, initializing convolution:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
step two, feature learning:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
step three, network training:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
step four, learning and aggregation:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
step five, calculating a target frame:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
step six, target tracking:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
2. The method of claim 1, wherein in the second step, the calculation formula of the frame attention is expressed as:
wherein ,represent the firstiFrame attention of individual head->Representing a box attention function +.>The query vector is represented as a result of which,representing key vectors +_>Representing a value vector +_>Representing a normalization function->Representing a transpose operation->,,/>,/>Representing real number set,/->Values representing the high-by-wide of the input profile,/->Side length of the grid feature map, +.>Indicates the number of channels>Representing the dimension of a head time feature.
3. The dual attention feature fusion network based object tracking method of claim 2 wherein in the multi-headed frame attention module, the firstThe method for calculating the frame attention of each attention head comprises the following steps:
given query vectorIs->Use bilinear interpolation from the block of interest +.>The extract has a size of +.>Grid feature map->;
Grid feature map with position attention moduleConverting into a region of interest to adapt the region of interest to the change in appearance of the target;
by calculating query vectorsAnd key vector->Matrix multiplication between to generate resulting frame attention coefficients;
5. the method for tracking an object based on a dual attention feature fusion network of claim 4, wherein the method for using the location attention module when calculating the frame attention comprises the steps of:
by means ofRepresenting query vector +.>Reference window of->, wherein ,/>Respectively representing the abscissa and the ordinate of the central position of the reference window, ">The width and the height of the reference window are respectively represented;
using a first transfer functionFor reference window->Performing a conversion, a first conversion function->Query vector +.>And reference Window->As an input, for moving the center position of the reference window;
6. The dual attention feature fusion network based object tracking method of claim 5 wherein the first transformation functionThe corresponding calculation formula is expressed as:
wherein ,representing reference Window->The abscissa offset of the center position of +.>Representing reference Window->A vertical offset of the center position of (2);
7. The dual attention feature fusion network based target tracking method of claim 6 wherein the offset parameterBy querying vector->Is realized by linear projection of the corresponding calculation formula is expressed as: />
wherein ,representing the abscissaxLinear projection parameters of>Representing the ordinateyLinear projection parameters of>Linear projection parameters representing the width of the reference window, +.>Linear projection parameters representing the reference window height, < ->Representing the abscissaxBias of linear projection of ∈10->Representing the ordinateyBias of linear projection of ∈10->A linear projection offset representing the width of the reference window,linear projection offset representing reference window height, +.>Representing a temperature parameter.
8. The dual attention feature fusion network based target tracking method of claim 7 wherein the reference windowIs defined by a first conversion function>And a second transfer function->Together, the corresponding calculation formula is expressed as:
9. The method for tracking an object based on a dual attention feature fusion network of claim 8, wherein the method for operating the multi-headed example attention module comprises the steps of:
generating a high confidence value target suggestion box containing multiple scales by using the box attention in a box attention-based transform encoder and sending to a transform decoder;
refining the high confidence value target suggestion box by utilizing instance attentions in the fransformer decoders, wherein each decoder layer in the fransformer decoders contains instance attentions, and one added instance normalization layer with a residual structure is connected at each forward propagation layer;
according to the high confidence value target suggestion box in the transducer decoder, the example attention takes grid features in the high confidence value target suggestion box as input, so that the target suggestion box with the highest confidence value is obtained.
10. A dual attention feature fusion network based object tracking system, wherein the system performs the dual attention feature fusion network based object tracking method as claimed in any one of claims 1 to 9, the system comprising:
an initialization convolution module for:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
the feature learning module is used for:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
the network training module is used for:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
a learning aggregation module for:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
the target frame calculation module is used for:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
a target tracking module for:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310172562.3A CN116030097B (en) | 2023-02-28 | 2023-02-28 | Target tracking method and system based on dual-attention feature fusion network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310172562.3A CN116030097B (en) | 2023-02-28 | 2023-02-28 | Target tracking method and system based on dual-attention feature fusion network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116030097A true CN116030097A (en) | 2023-04-28 |
CN116030097B CN116030097B (en) | 2023-05-30 |
Family
ID=86079674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310172562.3A Active CN116030097B (en) | 2023-02-28 | 2023-02-28 | Target tracking method and system based on dual-attention feature fusion network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116030097B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116403006A (en) * | 2023-06-07 | 2023-07-07 | 南京军拓信息科技有限公司 | Real-time visual target tracking method, device and storage medium |
CN116664624A (en) * | 2023-06-01 | 2023-08-29 | 中国石油大学(华东) | Target tracking method and tracker based on decoupling classification and regression characteristics |
CN117649582A (en) * | 2024-01-25 | 2024-03-05 | 南昌工程学院 | Single-flow single-stage network target tracking method and system based on cascade attention |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210081673A1 (en) * | 2019-09-12 | 2021-03-18 | Nec Laboratories America, Inc | Action recognition with high-order interaction through spatial-temporal object tracking |
CN112560695A (en) * | 2020-12-17 | 2021-03-26 | 中国海洋大学 | Underwater target tracking method, system, storage medium, equipment, terminal and application |
CN113705588A (en) * | 2021-10-28 | 2021-11-26 | 南昌工程学院 | Twin network target tracking method and system based on convolution self-attention module |
CN115063445A (en) * | 2022-08-18 | 2022-09-16 | 南昌工程学院 | Target tracking method and system based on multi-scale hierarchical feature representation |
-
2023
- 2023-02-28 CN CN202310172562.3A patent/CN116030097B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210081673A1 (en) * | 2019-09-12 | 2021-03-18 | Nec Laboratories America, Inc | Action recognition with high-order interaction through spatial-temporal object tracking |
CN112560695A (en) * | 2020-12-17 | 2021-03-26 | 中国海洋大学 | Underwater target tracking method, system, storage medium, equipment, terminal and application |
CN113705588A (en) * | 2021-10-28 | 2021-11-26 | 南昌工程学院 | Twin network target tracking method and system based on convolution self-attention module |
CN115063445A (en) * | 2022-08-18 | 2022-09-16 | 南昌工程学院 | Target tracking method and system based on multi-scale hierarchical feature representation |
Non-Patent Citations (2)
Title |
---|
MING GAO等: "a novel visual tracking convnet for autonomous vehicles", 《IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS》 * |
王军等: "基于孪生神经网络的目标跟踪算法综述", 《万方平台》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116664624A (en) * | 2023-06-01 | 2023-08-29 | 中国石油大学(华东) | Target tracking method and tracker based on decoupling classification and regression characteristics |
CN116664624B (en) * | 2023-06-01 | 2023-10-27 | 中国石油大学(华东) | Target tracking method and tracker based on decoupling classification and regression characteristics |
CN116403006A (en) * | 2023-06-07 | 2023-07-07 | 南京军拓信息科技有限公司 | Real-time visual target tracking method, device and storage medium |
CN116403006B (en) * | 2023-06-07 | 2023-08-29 | 南京军拓信息科技有限公司 | Real-time visual target tracking method, device and storage medium |
CN117649582A (en) * | 2024-01-25 | 2024-03-05 | 南昌工程学院 | Single-flow single-stage network target tracking method and system based on cascade attention |
CN117649582B (en) * | 2024-01-25 | 2024-04-19 | 南昌工程学院 | Single-flow single-stage network target tracking method and system based on cascade attention |
Also Published As
Publication number | Publication date |
---|---|
CN116030097B (en) | 2023-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116030097B (en) | Target tracking method and system based on dual-attention feature fusion network | |
CN110097568B (en) | Video object detection and segmentation method based on space-time dual-branch network | |
Liu et al. | Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting | |
Zhao et al. | Object detection with deep learning: A review | |
Sirohi et al. | Efficientlps: Efficient lidar panoptic segmentation | |
CN113705588B (en) | Twin network target tracking method and system based on convolution self-attention module | |
CN110070074B (en) | Method for constructing pedestrian detection model | |
Zhao et al. | Transformer3D-Det: Improving 3D object detection by vote refinement | |
CN113673425B (en) | Multi-view target detection method and system based on Transformer | |
CN110781262B (en) | Semantic map construction method based on visual SLAM | |
CN109493364A (en) | A kind of target tracking algorism of combination residual error attention and contextual information | |
WO2023154320A1 (en) | Thermal anomaly identification on building envelopes as well as image classification and object detection | |
CN113177549B (en) | Few-sample target detection method and system based on dynamic prototype feature fusion | |
CN116109678B (en) | Method and system for tracking target based on context self-attention learning depth network | |
CN113628244A (en) | Target tracking method, system, terminal and medium based on label-free video training | |
He et al. | Learning scene dynamics from point cloud sequences | |
CN115908908A (en) | Remote sensing image gathering type target identification method and device based on graph attention network | |
CN112418235A (en) | Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement | |
Zhao et al. | Similarity-aware fusion network for 3d semantic segmentation | |
Huang et al. | Task-wise sampling convolutions for arbitrary-oriented object detection in aerial images | |
Kalash et al. | Relative saliency and ranking: Models, metrics, data and benchmarks | |
Gao et al. | Spatio-temporal contextual learning for single object tracking on point clouds | |
Tian et al. | TSRN: two-stage refinement network for temporal action segmentation | |
CN112668543B (en) | Isolated word sign language recognition method based on hand model perception | |
Tong et al. | DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |