CN116030097A

CN116030097A - Target tracking method and system based on dual-attention feature fusion network

Info

Publication number: CN116030097A
Application number: CN202310172562.3A
Authority: CN
Inventors: 王军; 赖昌旺; 王员云; 秦永
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-04-28
Anticipated expiration: 2043-02-28
Also published as: CN116030097B

Abstract

The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a transducer; learning the features in the template feature map through the encoder to obtain a high confidence value target suggestion frame; inputting the target suggestion frame into a decoder, and carrying out learning fusion on the features of the search area to obtain the target suggestion frame with the highest confidence value; focusing attention to the region of interest rapidly, capturing structured spatial information and local information, and exploring global context information by utilizing the structured spatial information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area for tracking. The invention makes the tracker well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and realizes more accurate and robust tracking.

Description

Target tracking method and system based on dual-attention feature fusion network

Technical Field

The invention relates to the technical field of computer vision and image processing, in particular to a target tracking method and system based on a dual-attention feature fusion network.

Background

Video tracking is an important computer vision task and has wide application in the aspects of automatic driving, vision positioning, video monitoring, pedestrian tracking and the like. The purpose of video tracking is to predict the location of an object of interest in subsequent frames, with the first frame initialized. Video tracking remains a very challenging task due to limited training data and challenges of a large number of real-world scenes, such as occlusion, deformation, background complexity, and scale changes.

At present, a twin network tracker based on a convolutional neural network is widely applied to the field of vision tracking. The convolutional neural network aims at extracting characteristics of a target through a specific network, and then carrying out subsequent processing, such as classification, detection and the like, on the target according to the extracted characteristics. In a tracker based on a twin network, the use of a convolutional neural network greatly improves the performance of the tracker. Furthermore, context information appears to be critical for many computer vision tasks, such as object tracking. The transducer can explore the rich context information between successive frames by using the attention in the encoder-decoder architecture, thus achieving better tracking performance.

However, the tracking algorithm using the transform structure, since each point needs to capture global context information, it is intangible that many key local information is lost, which has a great influence on the performance of the tracker. Therefore, how to efficiently explore the context information between consecutive frames without losing a large amount of useful local information becomes a key factor for improving the tracker performance.

Disclosure of Invention

In view of the above situation, a main object of the present invention is to solve the problem that in the prior art, a part of visual tracking algorithm ignores the close relation between global context information and local information, so that a large amount of local information is lost, and a large amount of redundant calculation is caused by the use of self-attention, so that the influence caused by complex appearance change, occlusion and the like is difficult to process.

The embodiment of the invention provides a target tracking method based on a dual-attention feature fusion network, wherein the method comprises the following steps of:

step one, initializing convolution:

under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network;

step two, feature learning:

constructing a multi-scale feature fusion network based on a transducer through frame attention and example attention;

learning the template image features through a frame attention-based transducer encoder to obtain a multi-scale high-confidence-value target suggestion frame;

inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value;

step three, network training:

training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer;

step four, learning and aggregation:

the trained multi-scale feature fusion network based on the Transformer is utilized to learn local areas of target features on the template branch image and target features on the search area image so as to respectively obtain corresponding local semantic information, and then the local semantic information is respectively aggregated through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information;

step five, calculating a target frame:

performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score; the frame attention samples a grid in each candidate frame, and calculates the attention weight of the sampled feature in the grid feature;

step six, target tracking:

and sending the features obtained by fusing the template image features and the search area features to a classification regression prediction head to obtain the maximum response position of the tracking target in the search area, so as to track.

The invention also provides a target tracking system based on the dual-attention feature fusion network, wherein the system executes the target tracking method based on the dual-attention feature fusion network, and the system comprises the following steps:

an initialization convolution module for:

the feature learning module is used for:

the network training module is used for:

a learning aggregation module for:

the target frame calculation module is used for:

a target tracking module for:

The invention provides a target tracking method and a system based on a dual-attention feature fusion network, wherein the method comprises the following steps: constructing a multi-scale feature fusion network based on a Transformer through selection frame attention and example attention under a twin network framework; learning the features in the template feature map by a frame attention-based encoder to obtain a high confidence value target suggestion frame containing multiple scales; inputting the target suggestion frame in the encoder into a decoder based on example attention, simultaneously learning the features in the search area feature map, and fusing the template features and the search area features to obtain the target suggestion frame with the highest confidence value; pre-training a multi-scale feature fusion network based on a transducer, utilizing the trained multi-scale feature fusion network, enabling an encoder to rapidly focus attention to a region of interest, capturing a large amount of structured space information and local information, and enabling a decoder to explore global context information by utilizing the structured space information in the encoder; and sending the features obtained by fusing the template features and the features of the search area to a pre-measuring head to obtain the maximum response position of the tracking target in the search area, so as to track. The method and the device fully combine the advantages of frame attention and example attention to construct the multi-scale feature fusion network based on the Transformer, so that the tracker can well cope with the difficulties of serious shielding, scale change, complex background and the like in the tracking process, and can realize more accurate and robust tracking.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a target tracking method based on a dual attention feature fusion network according to the present invention;

FIG. 2 is a schematic block diagram of a target tracking method based on a dual attention feature fusion network according to the present invention;

fig. 3 is a schematic structural diagram of a target tracking system based on a dual attention feature fusion network according to the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

These and other aspects of embodiments of the invention will be apparent from and elucidated with reference to the description and drawings described hereinafter. In the description and drawings, particular implementations of embodiments of the invention are disclosed in detail as being indicative of some of the ways in which the principles of embodiments of the invention may be employed, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.

Referring to fig. 1, the present invention provides a target tracking method based on a dual-attention feature fusion network, wherein the method includes the following steps:

s101, initializing convolution:

under the twin network frame, initializing a template branch image of a first frame and a search area image of a subsequent search frame, and respectively obtaining template image features and search area features through a four-layer deep convolutional neural network.

S102, feature learning:

inputting the multi-scale high-confidence-value target suggestion frame into a transducer decoder based on example attention, simultaneously learning the features of the search region, and fusing the features of the template image after feature learning with the features of the search region after feature learning to obtain the target suggestion frame with the highest confidence value.

In this step, the calculation formula of the frame attention is expressed as:

；

wherein ,

represent the firstiFrame attention of individual head->

Representing a box attention function +.>

Representing a query vector->

Representing key vectors +_>

Representing a value vector +_>

Representing a normalization function->

Representing a transpose operation->

，/>

，/>

，/>

Representing real number set,/->

Values representing the high-by-wide of the input profile,/->

Side length of the grid feature map, +.>

Indicates the number of channels>

Representing the dimension of a head time feature.

S103, network training:

training the multi-scale feature fusion network based on the Transformer by utilizing a large-scale data set, and adjusting model parameters in the multi-scale feature fusion network model based on the Transformer.

S104, learning and aggregation:

and learning the local areas of the target features on the template branch image and the target features on the search area image by using the trained multi-scale feature fusion network based on the Transformer so as to respectively obtain corresponding local semantic information, and then respectively aggregating the local semantic information through a multi-head frame attention module and a multi-head example attention module so as to obtain global context information.

S105, calculating a target frame:

performing geometric transformation through a predefined reference window by utilizing a transducer encoder in the transducer-based multi-scale feature fusion network to generate a frame of interest, thereby capturing and obtaining a target suggestion frame containing multi-scale high confidence values, and refining the target suggestion frame containing multi-scale high confidence values by utilizing the transducer decoder to obtain a candidate frame containing the maximum confidence score.

Wherein the frame attention samples a grid in each candidate frame and calculates the attention weight of the sampled feature in the grid feature.

In particular, as shown in fig. 2, the generation principle of the multi-headed frame attention module and the multi-headed example attention module of the present invention can be seen from fig. 2. In this embodiment, the multi-headed frame attention module and the multi-headed example attention module in which the first

The method for calculating the frame attention of each attention head comprises the following steps:

s1051, give an investigation ofPolling vector

Is->

Use bilinear interpolation from the block of interest +.>

The extract has a size of +.>

Grid feature map->

。/>

It should be noted that the above query vector

Is inquiry->

Obtained through linear projection transformation.

S1052, using the position attention module to map the grid characteristics

Converting into a region of interest to adapt the region of interest to the change in appearance of the target;

s1053, by calculating the query vector

And key vector->

Matrix multiplication between to generate resulting frame attention coefficients;

s1054, calculating the frame attention coefficients using the softmax function to obtain the query vector

And key vector->

Similarity score between->

By calculating the similarity score +.>

And grid feature map->

Linear transformation matrix>

To obtain the final box attention +.>

。

To supplement, for the grid characteristic diagram

For example, grid feature map->

The following properties are satisfied:

；

attention to frame

For the sake of frame attention->

The following properties are satisfied:

。

in calculating the frame attention, the location attention module may focus the frame attention on the necessary regions, thereby more accurately predicting the high confidence value target suggestion frame. The location attention module may vector the query

Reference window +.>

By translating or scaling the geometric transformation into a region of interest, it can use spatial information to predict the target suggestion box in the grid feature map. The using method of the position attention module comprises the following steps:

by means of

Representing query vector +.>

Reference window of->

, wherein ,/>

Respectively representing the abscissa and the ordinate of the central position of the reference window, ">

The width and the height of the reference window are respectively represented;

using a first transfer function

For reference window->

Performing a conversion, a first conversion function->

Query vector +.>

And reference Window->

As an input, for moving the center position of the reference window;

using a second transfer function

For reference window->

Make adjustments, second transfer function->

Query vector +.>

And reference Window->

As an input for adjusting the size of the reference window.

In the present embodiment, the first conversion function

The corresponding calculation formula is expressed as:

；

wherein ,

representing reference Window->

The abscissa offset of the center position of +.>

Representing reference Window->

A vertical offset of the center position of (2);

second transfer function

The corresponding calculation formula is expressed as:

；

wherein ,

representing reference Window->

Width size offset of +.>

Representing reference Window->

Height-size offset of (2).

Furthermore, the offset parameter

By querying vector->

Is realized by linear projection of the corresponding calculation formula is expressed as: />

；

；

；

；

wherein ,

representing the abscissaxLinear projection parameters of>

Representing the ordinateyLinear projection parameters of>

Linear projection parameters representing the width of the reference window, +.>

Linear projection parameters representing the reference window height, < ->

Representing the abscissaxBias of linear projection of ∈10->

Representing the ordinateyBias of linear projection of ∈10->

Linear projection offset representing the reference window width, +.>

Linear projection offset representing reference window height, +.>

Representing a temperature parameter.

Reference window

Is defined by a first conversion function>

And a second transfer function->

Together, the corresponding calculation formula is expressed as:

；

wherein ,

representing the converted reference window +.>

Representing the transfer function operation.

In the present invention, for the multi-head example attention module, the operation method of the multi-head example attention module includes the following steps:

generating a high confidence value target suggestion box containing multiple scales by using the box attention in a box attention-based transform encoder and sending to a transform decoder;

refining the high confidence value target suggestion box by utilizing instance attentions in the fransformer decoders, wherein each decoder layer in the fransformer decoders contains instance attentions, and one added instance normalization layer with a residual structure is connected at each forward propagation layer;

according to the high confidence value target suggestion box in the transducer decoder, the example attention takes grid features in the high confidence value target suggestion box as input, so that the target suggestion box with the highest confidence value is obtained.

S106, target tracking:

Referring to fig. 3, the present invention further provides a target tracking system based on a dual-attention feature fusion network, wherein the system executes the target tracking method based on the dual-attention feature fusion network, the system includes:

an initialization convolution module for:

the feature learning module is used for:

the network training module is used for:

a learning aggregation module for:

the target frame calculation module is used for:

a target tracking module for:

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method for tracking a target based on a dual attention feature fusion network, the method comprising the steps of:

step one, initializing convolution:

step two, feature learning:

step three, network training:

step four, learning and aggregation:

step five, calculating a target frame:

step six, target tracking:

2. The method of claim 1, wherein in the second step, the calculation formula of the frame attention is expressed as:

；

wherein ,

represent the firstiFrame attention of individual head->

Representing a box attention function +.>

The query vector is represented as a result of which,

representing key vectors +_>

Representing a value vector +_>

Representing a normalization function->

Representing a transpose operation->

，

，/>

，/>

Representing real number set,/->

Values representing the high-by-wide of the input profile,/->

Side length of the grid feature map, +.>

Indicates the number of channels>

Representing the dimension of a head time feature.

3. The dual attention feature fusion network based object tracking method of claim 2 wherein in the multi-headed frame attention module, the first

given query vector

Is->

Use bilinear interpolation from the block of interest +.>

The extract has a size of +.>

Grid feature map->

；

Grid feature map with position attention module

by calculating query vectors

And key vector->

computing frame attention coefficients using softmax functions to obtain query vectors

And key vector->

Similarity score between

By calculating the similarity score +.>

And grid feature map->

Linear transformation matrix>

To obtain the final box attention +.>

。

4. A dual attention feature fusion network based object tracking method as defined in claim 3 in which the grid feature map is

For example, grid feature map->

The following properties are satisfied:

；

attention to the frame

For the sake of frame attention->

The following properties are satisfied:

。

5. the method for tracking an object based on a dual attention feature fusion network of claim 4, wherein the method for using the location attention module when calculating the frame attention comprises the steps of:

by means of

Representing query vector +.>

Reference window of->

, wherein ,/>

The width and the height of the reference window are respectively represented;

using a first transfer function

For reference window->

Performing a conversion, a first conversion function->

Query vector +.>

And reference Window->

As an input, for moving the center position of the reference window;

using a second transfer function

For reference window->

Make adjustments, second transfer function->

Query vector +.>

And reference Window->

As an input for adjusting the size of the reference window.

6. The dual attention feature fusion network based object tracking method of claim 5 wherein the first transformation function

The corresponding calculation formula is expressed as:

；

wherein ,

representing reference Window->

The abscissa offset of the center position of +.>

Representing reference Window->

A vertical offset of the center position of (2);

second transfer function

The corresponding calculation formula is expressed as:

；

wherein ,

representing reference Window->

Width size offset of +.>

Representing reference Window->

Height-size offset of (2).

7. The dual attention feature fusion network based target tracking method of claim 6 wherein the offset parameter

By querying vector->

；

；

；

；

wherein ,

representing the abscissaxLinear projection parameters of>

Representing the ordinateyLinear projection parameters of>

Linear projection parameters representing the reference window height, < ->

Representing the abscissaxBias of linear projection of ∈10->

Representing the ordinateyBias of linear projection of ∈10->

A linear projection offset representing the width of the reference window,

linear projection offset representing reference window height, +.>

Representing a temperature parameter.

8. The dual attention feature fusion network based target tracking method of claim 7 wherein the reference window

Is defined by a first conversion function>

And a second transfer function->

Together, the corresponding calculation formula is expressed as:

；

wherein ,

representing the converted reference window +.>

Representation turnAnd (5) function changing operation.

9. The method for tracking an object based on a dual attention feature fusion network of claim 8, wherein the method for operating the multi-headed example attention module comprises the steps of:

10. A dual attention feature fusion network based object tracking system, wherein the system performs the dual attention feature fusion network based object tracking method as claimed in any one of claims 1 to 9, the system comprising:

an initialization convolution module for:

the feature learning module is used for:

the network training module is used for:

a learning aggregation module for:

the target frame calculation module is used for:

a target tracking module for: