CN116030097A - Target tracking method and system based on dual-attention feature fusion network - Google Patents

Target tracking method and system based on dual-attention feature fusion network Download PDF

Info

Publication number
CN116030097A
CN116030097A CN202310172562.3A CN202310172562A CN116030097A CN 116030097 A CN116030097 A CN 116030097A CN 202310172562 A CN202310172562 A CN 202310172562A CN 116030097 A CN116030097 A CN 116030097A
Authority
CN
China
Prior art keywords
attention
frame
features
target
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310172562.3A
Other languages
Chinese (zh)
Other versions
CN116030097B (en
Inventor
王军
赖昌旺
王员云
秦永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202310172562.3A priority Critical patent/CN116030097B/en
Publication of CN116030097A publication Critical patent/CN116030097A/en
Application granted granted Critical
Publication of CN116030097B publication Critical patent/CN116030097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a transducer; learning the features in the template feature map through the encoder to obtain a high confidence value target suggestion frame; inputting the target suggestion frame into a decoder, and carrying out learning fusion on the features of the search area to obtain the target suggestion frame with the highest confidence value; focusing attention to the region of interest rapidly, capturing structured spatial information and local information, and exploring global context information by utilizing the structured spatial information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area for tracking. The invention makes the tracker well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and realizes more accurate and robust tracking.

Description

Target tracking method and system based on dual-attention feature fusion network
Technical Field
The invention relates to the technical field of computer vision and image processing, in particular to a target tracking method and system based on a dual-attention feature fusion network.
Background
Video tracking is an important computer vision task and has wide application in the aspects of automatic driving, vision positioning, video monitoring, pedestrian tracking and the like. The purpose of video tracking is to predict the location of an object of interest in subsequent frames, with the first frame initialized. Video tracking remains a very challenging task due to limited training data and challenges of a large number of real-world scenes, such as occlusion, deformation, background complexity, and scale changes.
At present, a twin network tracker based on a convolutional neural network is widely applied to the field of vision tracking. The convolutional neural network aims at extracting characteristics of a target through a specific network, and then carrying out subsequent processing, such as classification, detection and the like, on the target according to the extracted characteristics. In a tracker based on a twin network, the use of a convolutional neural network greatly improves the performance of the tracker. Furthermore, context information appears to be critical for many computer vision tasks, such as object tracking. The transducer can explore the rich context information between successive frames by using the attention in the encoder-decoder architecture, thus achieving better tracking performance.
However, the tracking algorithm using the transform structure, since each point needs to capture global context information, it is intangible that many key local information is lost, which has a great influence on the performance of the tracker. Therefore, how to efficiently explore the context information between consecutive frames without losing a large amount of useful local information becomes a key factor for improving the tracker performance.
Disclosure of Invention
In view of the above situation, a main object of the present invention is to solve the problem that in the prior art, a part of visual tracking algorithm ignores the close relation between global context information and local information, so that a large amount of local information is lost, and a large amount of redundant calculation is caused by the use of self-attention, so that the influence caused by complex appearance change, occlusion and the like is difficult to process.
The embodiment of the invention provides a target tracking method based on a dual-attention feature fusion network, wherein the method comprises the following steps of:
step one, initializing convolution:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
step two, feature learning:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
step three, network training:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
step four, learning and aggregation:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
step five, calculating a target frame:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
step six, target tracking:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
The invention also provides a target tracking system based on the dual-attention feature fusion network, wherein the system executes the target tracking method based on the dual-attention feature fusion network, and the system comprises the following steps:
an initialization convolution module for:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
the feature learning module is used for:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
the network training module is used for:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
a learning aggregation module for:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
the target frame calculation module is used for:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
a target tracking module for:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a Transformer through selection frame attention and example attention under a twin network framework; learning the features in the template feature map by a frame attention-based encoder to obtain a high confidence value target suggestion frame containing multiple scales; inputting the target suggestion frame in the encoder into a decoder based on example attention, simultaneously learning the features in the search area feature map, and fusing the template features and the search area features to obtain the target suggestion frame with the highest confidence value; pre-training a multi-scale feature fusion network based on a transducer, utilizing the trained multi-scale feature fusion network, enabling an encoder to rapidly focus attention to a region of interest, capturing a large amount of structured space information and local information, and enabling a decoder to explore global context information by utilizing the structured space information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area, so as to track. The method and the device fully combine the advantages of frame attention and example attention to construct the multi-scale feature fusion network based on the Transformer, so that the tracker can well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and can realize more accurate and robust tracking.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a target tracking method based on a dual attention feature fusion network according to the present invention;
FIG. 2 is a schematic block diagram of a target tracking method based on a dual attention feature fusion network according to the present invention;
fig. 3 is a schematic structural diagram of a target tracking system based on a dual attention feature fusion network according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.
Referring to fig. 1, the present invention provides a target tracking method based on a dual-attention feature fusion network, wherein the method includes the following steps:
s101, initializing convolution:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network.
S102, feature learning:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value.
In this step, the calculation formula of the frame attention is expressed as:
Figure SMS_1
wherein ,
Figure SMS_3
represent the firstiFrame attention of individual head->
Figure SMS_7
Representing a box attention function +.>
Figure SMS_11
Representing a query vector->
Figure SMS_4
Representing key vectors +_>
Figure SMS_9
Representing a value vector +_>
Figure SMS_13
Representing a normalization function->
Figure SMS_16
Representing a transpose operation->
Figure SMS_2
,/>
Figure SMS_8
,/>
Figure SMS_12
,/>
Figure SMS_15
Representing real number set,/->
Figure SMS_5
Values representing the high-by-wide of the input profile,/->
Figure SMS_6
Side length of the grid feature map, +.>
Figure SMS_10
Indicates the number of channels>
Figure SMS_14
Representing the dimension of a head time feature.
S103, network training:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer.
S104, learning and aggregation:
and learning the local areas of the target features on the template branch image and the target features on the search area image by using the trained multi-scale feature fusion network based on the Transformer so as to respectively obtain corresponding local semantic information, and then respectively aggregating the local semantic information through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information.
S105, calculating a target frame:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score.
Wherein the frame attention samples a grid in each candidate frame and calculates the attention weight of the sampled feature in the grid feature.
In particular, as shown in fig. 2, the generation principle of the multi-headed frame attention module and the multi-headed example attention module of the present invention can be seen from fig. 2. In this embodiment, the multi-headed frame attention module and the multi-headed example attention module in which the first
Figure SMS_17
The method for calculating the frame attention of each attention head comprises the following steps:
s1051, give an investigation ofPolling vector
Figure SMS_18
Is->
Figure SMS_19
Use bilinear interpolation from the block of interest +.>
Figure SMS_20
The extract has a size of +.>
Figure SMS_21
Grid feature map->
Figure SMS_22
。/>
It should be noted that the above query vector
Figure SMS_23
Is inquiry->
Figure SMS_24
Obtained through linear projection transformation.
S1052, using the position attention module to map the grid characteristics
Figure SMS_25
Converting into a region of interest to adapt the region of interest to the change in appearance of the target;
s1053, by calculating the query vector
Figure SMS_26
And key vector->
Figure SMS_27
Matrix multiplication between to generate resulting frame attention coefficients;
s1054, calculating the frame attention coefficients using the softmax function to obtain the query vector
Figure SMS_28
And key vector->
Figure SMS_29
Similarity score between->
Figure SMS_30
By calculating the similarity score +.>
Figure SMS_31
And grid feature map->
Figure SMS_32
Linear transformation matrix>
Figure SMS_33
To obtain the final box attention +.>
Figure SMS_34
To supplement, for the grid characteristic diagram
Figure SMS_35
For example, grid feature map->
Figure SMS_36
The following properties are satisfied:
Figure SMS_37
attention to frame
Figure SMS_38
For the sake of frame attention->
Figure SMS_39
The following properties are satisfied:
Figure SMS_40
in calculating the frame attention, the location attention module may focus the frame attention on the necessary regions, thereby more accurately predicting the high confidence value target suggestion frame. The location attention module may vector the query
Figure SMS_41
Reference window +.>
Figure SMS_42
By translating or scaling the geometric transformation into a region of interest, it can use spatial information to predict the target suggestion box in the grid feature map. The using method of the position attention module comprises the following steps:
by means of
Figure SMS_43
Representing query vector +.>
Figure SMS_44
Reference window of->
Figure SMS_45
, wherein ,/>
Figure SMS_46
Respectively representing the abscissa and the ordinate of the central position of the reference window, ">
Figure SMS_47
The width and the height of the reference window are respectively represented;
using a first transfer function
Figure SMS_48
For reference window->
Figure SMS_49
Performing a conversion, a first conversion function->
Figure SMS_50
Query vector +.>
Figure SMS_51
And reference Window->
Figure SMS_52
As an input, for moving the center position of the reference window;
using a second transfer function
Figure SMS_53
For reference window->
Figure SMS_54
Make adjustments, second transfer function->
Figure SMS_55
Query vector +.>
Figure SMS_56
And reference Window->
Figure SMS_57
As an input for adjusting the size of the reference window.
In the present embodiment, the first conversion function
Figure SMS_58
The corresponding calculation formula is expressed as:
Figure SMS_59
wherein ,
Figure SMS_60
representing reference Window->
Figure SMS_61
The abscissa offset of the center position of +.>
Figure SMS_62
Representing reference Window->
Figure SMS_63
A vertical offset of the center position of (2);
second transfer function
Figure SMS_64
The corresponding calculation formula is expressed as:
Figure SMS_65
wherein ,
Figure SMS_66
representing reference Window->
Figure SMS_67
Width size offset of +.>
Figure SMS_68
Representing reference Window->
Figure SMS_69
Height-size offset of (2).
Furthermore, the offset parameter
Figure SMS_70
By querying vector->
Figure SMS_71
Is realized by linear projection of the corresponding calculation formula is expressed as: />
Figure SMS_72
Figure SMS_73
Figure SMS_74
Figure SMS_75
wherein ,
Figure SMS_77
representing the abscissaxLinear projection parameters of>
Figure SMS_80
Representing the ordinateyLinear projection parameters of>
Figure SMS_82
Linear projection parameters representing the width of the reference window, +.>
Figure SMS_78
Linear projection parameters representing the reference window height, < ->
Figure SMS_81
Representing the abscissaxBias of linear projection of ∈10->
Figure SMS_83
Representing the ordinateyBias of linear projection of ∈10->
Figure SMS_84
Linear projection offset representing the reference window width, +.>
Figure SMS_76
Linear projection offset representing reference window height, +.>
Figure SMS_79
Representing a temperature parameter.
Reference window
Figure SMS_85
Is defined by a first conversion function>
Figure SMS_86
And a second transfer function->
Figure SMS_87
Together, the corresponding calculation formula is expressed as:
Figure SMS_88
wherein ,
Figure SMS_89
representing the converted reference window +.>
Figure SMS_90
Representing the transfer function operation.
In the present invention, for the multi-head example attention module, the operation method of the multi-head example attention module includes the following steps:
generating a high confidence value target suggestion box containing multiple scales by using the box attention in a box attention-based transform encoder and sending to a transform decoder;
refining the high confidence value target suggestion box by utilizing instance attentions in the fransformer decoders, wherein each decoder layer in the fransformer decoders contains instance attentions, and one added instance normalization layer with a residual structure is connected at each forward propagation layer;
according to the high confidence value target suggestion box in the transducer decoder, the example attention takes grid features in the high confidence value target suggestion box as input, so that the target suggestion box with the highest confidence value is obtained.
S106, target tracking:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
Referring to fig. 3, the present invention further provides a target tracking system based on a dual-attention feature fusion network, wherein the system executes the target tracking method based on the dual-attention feature fusion network, the system includes:
an initialization convolution module for:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
the feature learning module is used for:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
the network training module is used for:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
a learning aggregation module for:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
the target frame calculation module is used for:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
a target tracking module for:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a Transformer through selection frame attention and example attention under a twin network framework; learning the features in the template feature map by a frame attention-based encoder to obtain a high confidence value target suggestion frame containing multiple scales; inputting the target suggestion frame in the encoder into a decoder based on example attention, simultaneously learning the features in the search area feature map, and fusing the template features and the search area features to obtain the target suggestion frame with the highest confidence value; pre-training a multi-scale feature fusion network based on a transducer, utilizing the trained multi-scale feature fusion network, enabling an encoder to rapidly focus attention to a region of interest, capturing a large amount of structured space information and local information, and enabling a decoder to explore global context information by utilizing the structured space information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area, so as to track. The method and the device fully combine the advantages of frame attention and example attention to construct the multi-scale feature fusion network based on the Transformer, so that the tracker can well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and can realize more accurate and robust tracking.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (10)

1. A method for tracking a target based on a dual attention feature fusion network, the method comprising the steps of:
step one, initializing convolution:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
step two, feature learning:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
step three, network training:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
step four, learning and aggregation:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
step five, calculating a target frame:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
step six, target tracking:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
2. The method of claim 1, wherein in the second step, the calculation formula of the frame attention is expressed as:
Figure QLYQS_1
wherein ,
Figure QLYQS_2
represent the firstiFrame attention of individual head->
Figure QLYQS_6
Representing a box attention function +.>
Figure QLYQS_10
The query vector is represented as a result of which,
Figure QLYQS_5
representing key vectors +_>
Figure QLYQS_9
Representing a value vector +_>
Figure QLYQS_12
Representing a normalization function->
Figure QLYQS_15
Representing a transpose operation->
Figure QLYQS_3
Figure QLYQS_8
,/>
Figure QLYQS_13
,/>
Figure QLYQS_16
Representing real number set,/->
Figure QLYQS_4
Values representing the high-by-wide of the input profile,/->
Figure QLYQS_7
Side length of the grid feature map, +.>
Figure QLYQS_11
Indicates the number of channels>
Figure QLYQS_14
Representing the dimension of a head time feature.
3. The dual attention feature fusion network based object tracking method of claim 2 wherein in the multi-headed frame attention module, the first
Figure QLYQS_17
The method for calculating the frame attention of each attention head comprises the following steps:
given query vector
Figure QLYQS_18
Is->
Figure QLYQS_19
Use bilinear interpolation from the block of interest +.>
Figure QLYQS_20
The extract has a size of +.>
Figure QLYQS_21
Grid feature map->
Figure QLYQS_22
Grid feature map with position attention module
Figure QLYQS_23
Converting into a region of interest to adapt the region of interest to the change in appearance of the target;
by calculating query vectors
Figure QLYQS_24
And key vector->
Figure QLYQS_25
Matrix multiplication between to generate resulting frame attention coefficients;
computing frame attention coefficients using softmax functions to obtain query vectors
Figure QLYQS_26
And key vector->
Figure QLYQS_27
Similarity score between
Figure QLYQS_28
By calculating the similarity score +.>
Figure QLYQS_29
And grid feature map->
Figure QLYQS_30
Linear transformation matrix>
Figure QLYQS_31
To obtain the final box attention +.>
Figure QLYQS_32
4. A dual attention feature fusion network based object tracking method as defined in claim 3 in which the grid feature map is
Figure QLYQS_33
For example, grid feature map->
Figure QLYQS_34
The following properties are satisfied:
Figure QLYQS_35
attention to the frame
Figure QLYQS_36
For the sake of frame attention->
Figure QLYQS_37
The following properties are satisfied:
Figure QLYQS_38
5. the method for tracking an object based on a dual attention feature fusion network of claim 4, wherein the method for using the location attention module when calculating the frame attention comprises the steps of:
by means of
Figure QLYQS_39
Representing query vector +.>
Figure QLYQS_40
Reference window of->
Figure QLYQS_41
, wherein ,/>
Figure QLYQS_42
Respectively representing the abscissa and the ordinate of the central position of the reference window, ">
Figure QLYQS_43
The width and the height of the reference window are respectively represented;
using a first transfer function
Figure QLYQS_44
For reference window->
Figure QLYQS_45
Performing a conversion, a first conversion function->
Figure QLYQS_46
Query vector +.>
Figure QLYQS_47
And reference Window->
Figure QLYQS_48
As an input, for moving the center position of the reference window;
using a second transfer function
Figure QLYQS_49
For reference window->
Figure QLYQS_50
Make adjustments, second transfer function->
Figure QLYQS_51
Query vector +.>
Figure QLYQS_52
And reference Window->
Figure QLYQS_53
As an input for adjusting the size of the reference window.
6. The dual attention feature fusion network based object tracking method of claim 5 wherein the first transformation function
Figure QLYQS_54
The corresponding calculation formula is expressed as:
Figure QLYQS_55
wherein ,
Figure QLYQS_56
representing reference Window->
Figure QLYQS_57
The abscissa offset of the center position of +.>
Figure QLYQS_58
Representing reference Window->
Figure QLYQS_59
A vertical offset of the center position of (2);
second transfer function
Figure QLYQS_60
The corresponding calculation formula is expressed as:
Figure QLYQS_61
wherein ,
Figure QLYQS_62
representing reference Window->
Figure QLYQS_63
Width size offset of +.>
Figure QLYQS_64
Representing reference Window->
Figure QLYQS_65
Height-size offset of (2).
7. The dual attention feature fusion network based target tracking method of claim 6 wherein the offset parameter
Figure QLYQS_66
By querying vector->
Figure QLYQS_67
Is realized by linear projection of the corresponding calculation formula is expressed as: />
Figure QLYQS_68
Figure QLYQS_69
Figure QLYQS_70
Figure QLYQS_71
wherein ,
Figure QLYQS_73
representing the abscissaxLinear projection parameters of>
Figure QLYQS_76
Representing the ordinateyLinear projection parameters of>
Figure QLYQS_79
Linear projection parameters representing the width of the reference window, +.>
Figure QLYQS_74
Linear projection parameters representing the reference window height, < ->
Figure QLYQS_75
Representing the abscissaxBias of linear projection of ∈10->
Figure QLYQS_78
Representing the ordinateyBias of linear projection of ∈10->
Figure QLYQS_80
A linear projection offset representing the width of the reference window,
Figure QLYQS_72
linear projection offset representing reference window height, +.>
Figure QLYQS_77
Representing a temperature parameter.
8. The dual attention feature fusion network based target tracking method of claim 7 wherein the reference window
Figure QLYQS_81
Is defined by a first conversion function>
Figure QLYQS_82
And a second transfer function->
Figure QLYQS_83
Together, the corresponding calculation formula is expressed as:
Figure QLYQS_84
wherein ,
Figure QLYQS_85
representing the converted reference window +.>
Figure QLYQS_86
Representation turnAnd (5) function changing operation.
9. The method for tracking an object based on a dual attention feature fusion network of claim 8, wherein the method for operating the multi-headed example attention module comprises the steps of:
generating a high confidence value target suggestion box containing multiple scales by using the box attention in a box attention-based transform encoder and sending to a transform decoder;
refining the high confidence value target suggestion box by utilizing instance attentions in the fransformer decoders, wherein each decoder layer in the fransformer decoders contains instance attentions, and one added instance normalization layer with a residual structure is connected at each forward propagation layer;
according to the high confidence value target suggestion box in the transducer decoder, the example attention takes grid features in the high confidence value target suggestion box as input, so that the target suggestion box with the highest confidence value is obtained.
10. A dual attention feature fusion network based object tracking system, wherein the system performs the dual attention feature fusion network based object tracking method as claimed in any one of claims 1 to 9, the system comprising:
an initialization convolution module for:
under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;
the feature learning module is used for:
constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;
learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;
inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;
the network training module is used for:
training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;
a learning aggregation module for:
the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;
the target frame calculation module is used for:
performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;
a target tracking module for:
and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.
CN202310172562.3A 2023-02-28 2023-02-28 Target tracking method and system based on dual-attention feature fusion network Active CN116030097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310172562.3A CN116030097B (en) 2023-02-28 2023-02-28 Target tracking method and system based on dual-attention feature fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310172562.3A CN116030097B (en) 2023-02-28 2023-02-28 Target tracking method and system based on dual-attention feature fusion network

Publications (2)

Publication Number Publication Date
CN116030097A true CN116030097A (en) 2023-04-28
CN116030097B CN116030097B (en) 2023-05-30

Family

ID=86079674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310172562.3A Active CN116030097B (en) 2023-02-28 2023-02-28 Target tracking method and system based on dual-attention feature fusion network

Country Status (1)

Country Link
CN (1) CN116030097B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403006A (en) * 2023-06-07 2023-07-07 南京军拓信息科技有限公司 Real-time visual target tracking method, device and storage medium
CN116664624A (en) * 2023-06-01 2023-08-29 中国石油大学(华东) Target tracking method and tracker based on decoupling classification and regression characteristics
CN117649582A (en) * 2024-01-25 2024-03-05 南昌工程学院 Single-flow single-stage network target tracking method and system based on cascade attention

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081673A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Action recognition with high-order interaction through spatial-temporal object tracking
CN112560695A (en) * 2020-12-17 2021-03-26 中国海洋大学 Underwater target tracking method, system, storage medium, equipment, terminal and application
CN113705588A (en) * 2021-10-28 2021-11-26 南昌工程学院 Twin network target tracking method and system based on convolution self-attention module
CN115063445A (en) * 2022-08-18 2022-09-16 南昌工程学院 Target tracking method and system based on multi-scale hierarchical feature representation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081673A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Action recognition with high-order interaction through spatial-temporal object tracking
CN112560695A (en) * 2020-12-17 2021-03-26 中国海洋大学 Underwater target tracking method, system, storage medium, equipment, terminal and application
CN113705588A (en) * 2021-10-28 2021-11-26 南昌工程学院 Twin network target tracking method and system based on convolution self-attention module
CN115063445A (en) * 2022-08-18 2022-09-16 南昌工程学院 Target tracking method and system based on multi-scale hierarchical feature representation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MING GAO等: "a novel visual tracking convnet for autonomous vehicles", 《IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS》 *
王军等: "基于孪生神经网络的目标跟踪算法综述", 《万方平台》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664624A (en) * 2023-06-01 2023-08-29 中国石油大学(华东) Target tracking method and tracker based on decoupling classification and regression characteristics
CN116664624B (en) * 2023-06-01 2023-10-27 中国石油大学(华东) Target tracking method and tracker based on decoupling classification and regression characteristics
CN116403006A (en) * 2023-06-07 2023-07-07 南京军拓信息科技有限公司 Real-time visual target tracking method, device and storage medium
CN116403006B (en) * 2023-06-07 2023-08-29 南京军拓信息科技有限公司 Real-time visual target tracking method, device and storage medium
CN117649582A (en) * 2024-01-25 2024-03-05 南昌工程学院 Single-flow single-stage network target tracking method and system based on cascade attention
CN117649582B (en) * 2024-01-25 2024-04-19 南昌工程学院 Single-flow single-stage network target tracking method and system based on cascade attention

Also Published As

Publication number Publication date
CN116030097B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN116030097B (en) Target tracking method and system based on dual-attention feature fusion network
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
Liu et al. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting
Zhao et al. Object detection with deep learning: A review
Sirohi et al. Efficientlps: Efficient lidar panoptic segmentation
CN113705588B (en) Twin network target tracking method and system based on convolution self-attention module
CN110070074B (en) Method for constructing pedestrian detection model
Zhao et al. Transformer3D-Det: Improving 3D object detection by vote refinement
CN113673425B (en) Multi-view target detection method and system based on Transformer
CN110781262B (en) Semantic map construction method based on visual SLAM
CN109493364A (en) A kind of target tracking algorism of combination residual error attention and contextual information
WO2023154320A1 (en) Thermal anomaly identification on building envelopes as well as image classification and object detection
CN113177549B (en) Few-sample target detection method and system based on dynamic prototype feature fusion
CN116109678B (en) Method and system for tracking target based on context self-attention learning depth network
CN113628244A (en) Target tracking method, system, terminal and medium based on label-free video training
He et al. Learning scene dynamics from point cloud sequences
CN115908908A (en) Remote sensing image gathering type target identification method and device based on graph attention network
CN112418235A (en) Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement
Zhao et al. Similarity-aware fusion network for 3d semantic segmentation
Huang et al. Task-wise sampling convolutions for arbitrary-oriented object detection in aerial images
Kalash et al. Relative saliency and ranking: Models, metrics, data and benchmarks
Gao et al. Spatio-temporal contextual learning for single object tracking on point clouds
Tian et al. TSRN: two-stage refinement network for temporal action segmentation
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
Tong et al. DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant