CN117036417A

CN117036417A - Multi-scale transducer target tracking method based on space-time template updating

Info

Publication number: CN117036417A
Application number: CN202311171271.9A
Authority: CN
Inventors: 强旭艳; 郑钰辉
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-11-10

Abstract

The invention discloses a multi-scale transducer target tracking method based on space-time template updating, which comprises the following steps: extracting features of the initial template image, the dynamic template image and the search area image by using Shunted Transformer to obtain three corresponding tokens; splicing the three tokens, and then sequentially sending the tokens into a transducer-based encoder and a transducer-based decoder for feature fusion to obtain a fused feature sequence; carrying out boundary frame prediction on the fused characteristic sequences through a classification branch and a regression branch, and outputting a tracking result; and after the running frame number of the tracker reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch. According to the invention, shunted Transformer is used as a feature extraction backbone network, so that multi-scale features can be learned, the representation capability of a target is improved, and meanwhile, the latest state of the target can be captured by adding the dynamic template, so that challenges such as shielding and deformation of the target are effectively met.

Description

Multi-scale transducer target tracking method based on space-time template updating

Technical Field

The invention belongs to the field of image recognition and target tracking, in particular to a multi-scale Transformer target tracking method based on space-time template updating.

Background

Visual target tracking is a hot research field of computer vision, and has wide application prospect in the fields of military, unmanned driving, medical treatment and the like. Visual object tracking refers to automatically giving the position and shape of an object in subsequent frames of video after the object of interest marked with a bounding box in a given first frame. However, in a complex real scene, the tracking process is affected by environmental factors such as background interference, shielding, deformation and the like, so designing a tracking algorithm that operates efficiently in the real scene is a difficult task.

Based on different working modes, the current target tracking algorithm based on deep learning is mainly divided into a tracking algorithm based on a convolutional neural network and a tracking algorithm based on a transform, wherein the target tracking algorithm based on a twin network utilizes convolution kernel operation to describe a target, and although the target tracking algorithm performs well when modeling the local relation of image content, global context information cannot be effectively processed; the transducer-based algorithm can process local and global information by using an attention mechanism, so that the characterization capability of the algorithm is enhanced, and the robustness of the tracking algorithm is improved. However, most of these algorithms handle video frames in isolation, and do not fully mine temporal and spatial information in the video. And in real-world scenarios, the tracker needs to overcome challenges such as occlusion, object deformation, scale change, etc., and these challenges are further amplified with the increase of time span, so developing a high-precision, robust and real-time tracker remains extremely challenging.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a multi-scale transducer target tracking method based on space-time template updating, which better meets challenges such as shielding, target deformation and the like and enhances the robustness of a tracker.

In order to solve the technical problems, the invention provides the following technical scheme: a multi-scale transducer target tracking method based on space-time template updating comprises the following steps:

s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;

s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;

s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;

and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.

Further, the step S1 includes the following sub-steps:

s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,

s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,

S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch;

s104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:

Q _i ＝XW _i ^Q ，

K _i ，V _i ＝MTA(X，r _i )W _i ^K ，MTA(X，r _i )W _i ^V ，

V _i ＝V _i +LE(V _i ).

wherein, MTA (.r) _i ) Is the multi-scale mark polymeric layer in the ith header, r _i For downsampling rate, W _i ^Q ，W _i ^K ，W _i ^V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q _i ，K _i ，V _i Respectively representing the query Q, the key K and the value V corresponding to the ith head, wherein x is the characteristic representation of the input image, and different i corresponds to r _i Not exactly the same, the corresponding calculation costs are different.

S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:

wherein h is _i Represents the i-th self-attention head, d _h Is a dimension;

s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:

x′＝FC(x；θ ₁ )，

x″＝FC(σ(x′+DS(x′；θ ₁ )；θ ₂ ).

wherein DS (& gtθ) is a detail specific layer with a parameter of θ, and is realized by deep convolution, FC (& gtθ) is a fully connected layer with a parameter of θ, sigma (& gtθ) represents an activation function for performing nonlinear transformation on input, x represents a characteristic sequence of the input feedforward layer, x ' represents a characteristic sequence of x output through a first fully connected layer, and x ' represents a characteristic sequence of x ' output at a second fully connected layer after being activated through the detail specific layer, namely, a characteristic sequence of the whole feedforward layer;

s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as

Further, in the step S103, the first convolution layer is 7*7, the step size is 2, the second convolution layer is 3*3, and the step size of the non-overlapping projection layer is 1.

Further, the step S2 includes the following sub-steps:

s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:

wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.

S202、Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>Connection is made->Then pair->After layer normalization, the obtained product is input into a feed forward network FFN, and the output is then combined with +.>The output of a transducer block is obtained by concatenation>The formula is as follows:

s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:

s204, the decoder takes the output of the encoder as input, namely the decoupled search area token f _x ^L And an uncoupled mixed representation f ^L All inputs are first layer normalized and then calculatedAnd f ^L To obtain a final characterization, the formula is as follows:

f _vm ＝f′ _vm +FFN(LN(f′ _vm ))。

where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.

Further, the step S3 includes the following sub-steps:

s301, sending the output sequence of the decoder into a classification branch, using I _o U-perception classification score is used as a training target, variFocal Loss is used as a training Loss function, and classification Loss is expressed as

Where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box;

s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as

Wherein the GIoU penalty is weighted by p to emphasize high class score samples.

Further, the step S4 includes the following sub-steps:

s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:

s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.

In another aspect the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods of the invention when the computer program is executed.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method of any of the invention.

Compared with the prior art, the beneficial technical effects of the invention adopting the technical scheme are as follows:

1. according to the invention, multi-scale characteristic information is reserved during characteristic extraction, and the obtained characteristic token can learn the relation between objects with different scales, so that the characterization capability of the tracker is improved;

2. the latest state of the target can be captured through the dynamic template, and challenges such as shielding, target deformation and the like can be effectively met by combining time and space information, so that the robustness of the tracker is enhanced.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of the operation of the feature extraction module of the present invention.

Detailed Description

For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.

Aspects of the invention are described herein with reference to the drawings, in which there are shown many illustrative embodiments. The embodiments of the present invention are not limited to the embodiments described in the drawings. It is to be understood that this invention is capable of being carried out by any of the various concepts and embodiments described above and as such described in detail below, since the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.

As shown in fig. 1, the multi-scale transducer target tracking method based on space-time template updating of the invention comprises the following steps: a multi-scale transducer target tracking method based on space-time template updating comprises the following steps:

Referring to fig. 2, further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S1 includes the following sub-steps:

S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch; the first convolution layer is 7*7, the step length is 2, the second convolution layer is 3*3, and the non-overlapping projection layer step length is 1.

Q _i ＝XW _i ^Q ，

K _i ，V _i ＝MTA(X，r _i )W _i ^K ，MTA(X，r _i )W _i ^V ，

V _i ＝V _i +LE(V _i ).

wherein, MTA (.r) _i ) Is the multi-scale mark polymeric layer in the ith header, r _i For downsampling rate, W _i ^Q ，W _i ^K ，W _i ^V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q _i ，K _i ，V _i The query Q, the key K and the value V corresponding to the ith head are respectively represented, X is the characteristic representation of the input image, and the corresponding calculation cost is different.

wherein d is _h Is a dimension;

x′＝FC(x；θ ₁ )，

x″＝FC(σ(x′+DS(x′；θ ₁ )；θ ₂ )

Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S2 includes the following sub-steps:

S202, an encoder comprises a series of transformers.Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>Connection is made->Then pair->Layer normalized and input to a frontIn the feed network FFN, the output is then taken as the sum +.>The output of a transducer block is obtained by concatenation>The formula is as follows:

s204, the decoder takes the output of the encoder as input, namely the decoupled search area tokenAnd an uncoupled mixed representation f ^L All inputs are first layer normalized and then +.>And f ^L To obtain a final characterization, the formula is as follows:

f _vm ＝f′ _vm +FFN(LN(f′ _vm ))。

Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S3 includes the following sub-steps:

s301, the output sequence of the decoder is sent to the classification branch, using IoU perceptual classification score as training target and VariFocal Loss as training Loss function, the classification Loss is expressed as:

where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box of the object,

Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S4 includes the following sub-steps:

In another aspect, the invention proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of the method of the invention when executing said computer program.

The invention also proposes a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to the invention.

While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.

Claims

1. The multi-scale transducer target tracking method based on space-time template updating is characterized by comprising the following steps of:

2. The method for multi-scale transducer target tracking based on space-time template updating according to claim 1, wherein the step S1 comprises the following sub-steps:

Q _i ＝XW _i ^Q ,

V _i ＝V _i +LE(V _i ).

wherein, MTA (.r) _i ) Is the multi-scale mark polymeric layer in the ith header, r _i For downsampling rate, W _i ^Q ,W _i ^V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q _i ,K _i ,V _i Respectively representing the query Q, the key K and the value V corresponding to the ith head, wherein X is the characteristic representation of the input image, and different i corresponds to r _i Not exactly the same, the corresponding calculation costs are different.

wherein h is _i Represents the i-th self-attention head, d _h Is a dimension;

x ^′ ＝FC(x；θ ₁ ),

x″＝FC(σ(x ^′ +DS(x ^′ ；θ ₁ )；θ ₂ ).

wherein DS (& gttheta) is a detail specific layer with a parameter of theta, the detail specific layer is realized through deep convolution, FC (& gttheta) is a fully connected layer with a parameter of theta, sigma (& gttheta) represents an activation function for carrying out nonlinear transformation on input, x represents a characteristic sequence of an input feedforward layer, and x is a characteristic sequence of an input feedforward layer ^′ Representing the characteristic sequence of x output through the first full connection layer, x' representing x ^′ The feature sequence output at the second full-connection layer after the detail specific layer is activated, namely the feature sequence output by the whole feedforward layer;

3. The method according to claim 2, wherein in step S103, the first convolution layer is 7*7, the step size is 2, the second convolution layer is 3*3, and the step size of the non-overlapping projection layer is 1.

4. The method for multi-scale transducer target tracking based on space-time template updating according to claim 2, wherein the step S2 comprises the sub-steps of:

S202、Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>The connection is obtainedThen pair->After layer normalization, the obtained product is input into a feed forward network FFN, and the output is then combined with +.>The output of a transducer block is obtained by concatenation>The formula is as follows:

f _vm ＝f _v ^′ _m +FFN(LN(f _v ^′ _m ))；

5. The method of multi-scale transducer target tracking based on space-time template updating according to claim 3, wherein step S3 comprises the sub-steps of:

s301, sending the output sequence of the decoder into a classification branch, using IoU perception classification score as training target, using VariFocal Loss as training Loss function, the classification Loss is expressed as

6. A multi-scale transducer target tracking method based on spatio-temporal template updating according to claim 3, characterized in that step S4 comprises the sub-steps of:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 6 when the computer program is executed by the processor.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.