LU102992B1

LU102992B1 - Siamese network target tracking method based on channel and spatial attention mechanisms

Info

Publication number: LU102992B1
Application number: LU102992A
Authority: LU
Inventors: Jun Wang; Yuanyun Wang
Original assignee: Nanchang Inst Tech
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2024-02-02

Abstract

The invention discloses a siamese network target tracking method based on channel and spatial attention mechanisms, including the following steps: processing a video or image data set to obtain a plurality of target images having a uniform image size; constructing and obtaining a new backbone network model on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism; extracting training samples from the plurality of target images to train the new backbone network model; extracting deep features of target image samples from the plurality of target images by the well-trained new backbone network model, and performing similarity matching on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks; and utilizing a target candidate block with a maximum similarity score acquired, thereby performing target tracking.

Description

SIAMESE NETWORK TARGET TRACKING METHOD BASED ON CHANNEL LU102992

AND SPATIALATTENTION MECHANISMS

TECHNICAL FIELD

[01] The invention relates to the technical field of computer vision, and in particular, to a siamese network target tracking method based on channel and spatial attention mechanisms.

BACKGROUND ART

[02] Target tracking as an important project in the computer vision has practical applications in the fields of automatic drive, video monitoring, video analysis, medical treatment, military science, and the like, but the target tracking with a complex background usually suffers from deformation and has such challenge problems as motion blurring and sheltering.

[03] Generally, target tracking algorithms include a discriminative algorithm and a generative algorithm. In recent years, deep learning and a tracking algorithm based on a siamese network have been widely concerned as well. A siamese network architecture is used for template matching of target candidate samples detected thereby, and the location of a target image is obtained by calculating a maximum similarity between a target region and a candidate region.

[04] However, a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism are not combined simultaneously in the prior art, and accuracy and robustness are dissatisfactory.

SUMMARY

[05] The invention aims to solve the problems that vision target tracking does not simultaneously combine a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism, and that target tracking accuracy and robustness are dissatisfactory.

[06] The invention provides a siamese network target tracking method based on channel and spatial attention mechanisms, comprising the following steps:

[07] step I processing a video or image data set to obtain a plurality of target images having a uniform image size;

[08] step II: constructing and obtaining a new backbone network model on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism;

[09] step III: extracting training samples from the plurality of target images to train the new backbone network model;

[10] step IV: extracting deep features of target image samples from the plurality of target images by the well-trained new backbone network mode, and performing similarity matching on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks, with each target candidate block corresponding to a similarity score;

[11] step V: utilizing a target candidate block with a maximum similarity score acquired, 1 thereby performing target tracking.

LU102992

[12] In the invention, GOT-10k is used as a training set to adjust model parameters for off- line training, which can more accurately represent a target in a video; and then, the lightweight convolutional neural network model is used for feature extraction. An appearance model of a tracking algorithm designed in the invention has better robustness and accuracy.

[13] Provided is the siamese network target tracking method based on channel and spatial attention mechanisms, wherein in the step of constructing and obtaining the new backbone network model on the basis of the convolutional neural network model, the channel attention mechanism and the spatial attention mechanism,

[14] the plurality of target images are taken as a training data set for training, wherein the training data set includes 560 motion objects and 87 motion pattern categories;

[15] a stochastic gradient descent method is used for training and construction, wherein the momentum is set as 0.9.

[16] Provided is the siamese network target tracking method based on channel and spatial attention mechanisms, wherein the sizes of target image features respectively extracted by a template branch and a search branch in the siamese network framework are “6x6x128” and “22x22x128”.

BRIEF DESCRIPTION OF THE DRAWINGS

[17] FIG. 1is a flow diagram of a siamese network target tracking method based on channel and spatial attention mechanisms according to the invention;

[18] FIG. 2 is a principle diagram of a siamese network target tracking method based on channel and spatial attention mechanisms according to the invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[19] The invention discloses a siamese network target tracking method based on channel and spatial attention mechanisms, as shown in FIG. 1 and FIG. 2, and the method includes the following steps:

[20] S101, a video or image data set is processed to obtain a plurality of target images having a uniform image size.

[21] Images in a video or image data set are processed to have a uniform size, which is not only convenient for subsequent input, but also convenient for extracting deep features of images having uniform sizes in a tracking stage.

[22] S102, a new backbone network model is constructed and obtained on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism.

[23] The new backbone network model is a siamese network framework which includes a template branch and a search branch. As shown in FIG. 2, Z corresponds to the template branch, and X corresponds to the search branch.

[24] In the dotted-line box in the middle of FIG. 2, the convolutional neural network model, 2 a channel attention module and a spatial attention module are combined to construct the new backbone network model. Therein, the convolutional neural network model includes a LU102992 convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4 and a convolutional layer 5. Therein, the channel attention module and the spatial attention module are located between the convolutional layer 1 and the convolutional layer 2. This is used in a subsequent step for processing deep features of target image samples extracted from target images.

[25] S103, training samples are extracted from the plurality of target images to train the new backbone network model.

[26] During training, it is required to determine the picture size according to the complexity and the display memory size of the model. In the invention, the sizes of the sample images input into the template branch are 127x127x3, and the sizes of the sample images input into the search branch are 255x255x3.

[27] It shall be additionally described that:

[28] when a child window for searching the target images is expanded beyond the scope of the target images, an image missing part is filled by an RGB mean value. In a subsequent testing stage (including step S104 and step S105), target images of two channels will be respectively introduced into the template branch and the search branch of the siamese network framework, so as to acquire the deep features of the target image samples.

[29] Inthe step of taking the plurality of target images as a training data set for training, the training data set includes 560 motion objects and 87 motion pattern categories. Moreover, the training data set furthermore provides video clips of over 10,000 real-world motion objects and over 1,500,000 manually-made marker frames. The new backbone network model designed above can implement end-to-end training for a large-scale data set GOT-10k.

[30] In addition, a stochastic gradient descent (SGD) method is used for training and construction, wherein the momentum is set as 0.9. The learning rate for each iteration is reduced from an initial learning rate to a final learning rate, respectively set as 0.01 and 0.00001. The new backbone network model disclosed in the invention is trained for 50 epochs in total, the weight attenuation is set as 0.0005, and the batch size is 16.

[31] S104, the deep features of the target image samples are extracted from the plurality of target images by the well-trained new backbone network model, and similarity matching is performed on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks, with each target candidate block corresponding to a similarity score.

[32] Specifically, for the above-mentioned new backbone network model, the convolutional neural network model (CNN model) includes 5 convolutional layers, without any fully- connected layer. The channel attention mechanism and the spatial attention mechanism consist of the channel attention module and the spatial attention module. According to the successive placement positions of the channel attention module and the spatial attention module, the channel attention module and the spatial attention module are constructed behind a first convolutional layer and a pooling layer. A “7x7” convolution kernel is used for a receptive field of the spatial attention module.

[33] In the siamese network framework, there is a formula as follows: 3 h(L,,x) = L,h(x) 0 LU102992

[34] In the formula, h represents a mapping function of input/output signals, k represents a

L stride, © is a conversion value of an effective region in the input/output signals, kz and

L | x 7 both represent translation operators, and represents an input target image.

[35] Moreover, a convolutional inline function pl) is usually used to associate two input target images X and Z with each other to generate an output response diagram used for representing a similarity score between the deep features of the target image samples after the two input target images pass through the siamese network framework.

[36] Therein, the formula expression of the similarity score is as follows: fz) =p) *p@ybll © f(Z,x) Le

[37] In the formula, represents the similarity score between the two input target images, bIl represents a deviation of a value bel , ] represents a set of real numbers, plz) and Px) represent output characteristics of the two input target images after passing through the siamese network framework, Z, X represents the two input target images, and pl) represents the convolutional inline function.

[38] Specifically, in the step of extracting the deep features of the target image samples in the plurality of target images by the well-trained new backbone network model , the channel attention mechanism includes the following steps:

[39] Al. Features of target images of two channels are obtained by max-pooling and global average-pooling.

[40] In the invention, the size of the input target images Z is “HxWxC”, and the features of the target images of the two channels are obtained by the max-pooling and the global average- pooling, and the size of the features of the target images of the two channels is “1x1xC”.

[41] BI. The features of the target images of the two channels obtained after the max-pooling and the global average-pooling are input into a multi-layer perceptron network to obtain a feature vector by neuron number summation.

[42] Then, the features of the target images of the two channels obtained after the max- pooling and the global average-pooling are input into a multi-layer perceptron network (namely

MLP). Wherein, a first-layer neuron number is C/r, an activation function is ReLU, and a second-layer neuron number is C. Wherein, neural network parameters of two layers are shared.

The feature vector is output after neuron number summation. . . WR, (F_(Z2)+WR,(F,. (Z

[43] Therein, the feature vector is |! o(Ene(2))) Ro (Fax (2))) . 4

[44] Cl. The feature vector is processed by a Sigmoid activation function to obtain a first . . . as . . . LU102992 weight coefficient, and the first weight coefficient is multiplied with the input target image Z to obtain a first weighted new feature.

[45] Inthe step, the first weight coefficient is represented as follows:

Orca (Z) = o(WR, ((Ene (Z))) + WR, ((Fnax (Z)))) (3)

O,(Z) . . . . .

[46] In the formula, 7 is the first weight coefficient, represents the Sigmoid activation function, Wi represents a weight of a shared multi-layer perceptron network, R,

FC). Fl) : represents a ReLU function, is a global average-pooling function, and isa max-pooling function;

[47] the first weighted new feature is represented as follows:

E,=0(2)QZ (4)

F . ®

[48] Inthe formula, ~ represents the first weighted new feature, represents neuron number level multiplication, and Z represents the input target images.

[49] Following the channel attention module, the spatial attention module is introduced to concern which features are significant in the input target images. Specifically, in the step of extracting the deep features of the target image samples from the plurality of target images by the well-trained new backbone network model to obtain, the spatial attention mechanism includes the following steps:

[50] A2. The features of the target images of the two channels are obtained by the max- pooling and the global average-pooling, and the features of the target images of the two channels are spliced by a first convolutional layer.

[51] Similar with the channel attention module, the size of the input target images Z is “HxWxC”. The size of the features of the target images of the two channels obtained by the max-pooling and the global average-pooling of one channel dimensionality is “HxWx1”, and the features are spliced together according to a standard convolutional layer (the first convolutional layer).

[52] B2. The spliced features of the target images of the two channels are calculated by a second convolutional layer and the Sigmoid activation function to obtain a second weight coefficient. 0.2).

[53] Then, a weight coefficient ” is obtained by the 7x7 convolutional layer and

CL . . . 0 (2) . 4 Le the Sigmoid activation function. Finally, the weight coefficient” (2) is multiplied with

F the input target image Z’ to obtain a second weighted new feature °P.

[54] Therein, the second weight coefficient is represented as follows:

7x7 "\. 0p (Z)= (fT (Foy (2): Fa CM 5 LU102992

O ( 7" ; ; ; f

[55] Inthe formula, is the second weight coefficient, represents that the receptive field of the convolution kernel is 7x7, Z represents the first weighted new feature as well, and © represents the Sigmoid activation function.

[56] C2. The second weight coefficient is multiplied with the first weighted new feature to obtain a second weighted new feature.

[57] The second weighted new feature is represented as follows: — ! !

F,=0,(2)®Z ©

FE, .

[58] P is the second weighted new feature.

[59] Further, the similarity matching is performed in the target image candidate region. In other words, the similarities of all conversion child windows is calculated in a dense grid, and the specific operation is as shown in Formula (2). Namely, mutual association is performed by the convolutional inline function pl) to generate the output response diagram used for representing the similarity score between the deep features of the target image samples after the two input target images pass through the siamese network framework.

[60] The target candidate blocks described here are all obtained by the search branch, and the corresponding size is “22x22x128”. The above-mentioned similarity score is obtained by similarity comparison between the target candidate blocks (which are essentially target image features) in the search branch and the sample image features in the template branch.

[61] S105, a target candidate block with a maximum similarity score acquired thereby is used for target tracking.

[62] This step specifically includes: the similarity between the deep features of the target image samples (in the template branch) and the deep features of the candidate target image samples (in the search branch) is calculated and compared, and the target images of the region with a maximum similarity score, found in subsequent frames, are determined as expected results, thereby achieving target tracking. 6

Claims

WHAT IS CLAIMED IS: LU102992

1. A siamese network target tracking method based on channel and spatial attention mechanisms, comprising the following steps: step I: processing a video or image data set to obtain a plurality of target images having a uniform image size; step II: constructing and obtaining a new backbone network model on the basis of a convolutional neural network model, a channel attention mechanism and a spatial attention mechanism; step IIT: extracting training samples from the plurality of target images to train the new backbone network model; step IV: extracting deep features of target image samples from the plurality of target images by the well-trained new backbone network mode, and performing similarity matching on the deep features of the target image samples in a target image candidate region to obtain a plurality of target candidate blocks, with each target candidate block corresponding to a similarity score; step V: utilizing a target candidate block with a maximum similarity score acquired, thereby performing target tracking.

2. The siamese network target tracking method based on the channel and spatial attention mechanisms according to claim 1, wherein the new backbone network model is a siamese network framework which comprises a template branch and a search branch; and the step of extracting the training samples from the plurality of target images comprises: when expanding a child window for searching the target images beyond the scope of the target images, filling an image missing part by an RGB mean value.

3. The siamese network target tracking method based on the channel and spatial attention mechanisms according to claim 2, wherein the sizes of target image features respectively extracted by the template branch and the search branch in the siamese network framework are “6x6x128” and “22x22x128”. 7