CN113283407A

CN113283407A - Twin network target tracking method based on channel and space attention mechanism

Info

Publication number: CN113283407A
Application number: CN202110828947.1A
Authority: CN
Inventors: 王军; 孟晨晨; 邓承志; 王员云; 张珮芸
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-08-20

Abstract

The invention provides a twin network target tracking method based on a channel and space attention mechanism, which comprises the following steps: processing the video or image data set to obtain a plurality of target images of uniform image size; constructing and obtaining a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism; extracting training samples from a plurality of target images to train the novel backbone network model; extracting deep features of a target image sample from a plurality of target images by using a trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain a plurality of target candidate blocks, wherein each target candidate block corresponds to a similarity score; and tracking the target by using the acquired target candidate block with the maximum similarity score. The apparent model of the tracking algorithm designed by the invention has better robustness and accuracy.

Description

Twin network target tracking method based on channel and space attention mechanism

Technical Field

The invention relates to the technical field of computer vision, in particular to a twin network target tracking method based on a channel and space attention mechanism.

Background

Target tracking is a very important topic in computer vision, and has practical applications in the fields of automatic driving, video monitoring, video analysis, medical treatment, military and the like. In practical application, due to wide and complex application scenes of target tracking, deformation often occurs on target tracking with a complex background, and challenging problems of motion blur, occlusion and the like exist. In addition, due to the requirements of application scenarios in the aspects of commerce, industry, military affairs, medicine and the like, the research on the target tracking technology is extremely valuable.

Generally, the target tracking algorithm includes two types, namely discriminant algorithm and generator algorithm. Wherein discriminant model-based algorithms can effectively distinguish the tracked object from the surrounding background. The model-based algorithm compares in a given search region using a learned similarity function between the target image sample and the candidate region target image sample. In recent years, with the advent of large-scale public labeled image data sets and the rapid development of computer hardware performance and software technology, deep learning has been highly successful in various fields of image processing. Among them, the discriminant correlation filter based on deep learning has been successfully applied to target tracking because of its fast operation speed. In addition, twin network-based tracking algorithms have also gained wide attention in target tracking tasks. And performing template matching on the detected target candidate sample by utilizing a twin network architecture, and calculating the highest similarity between the target region and the candidate region to obtain the position of the target image.

However, in the prior art, when performing visual target tracking, the convolutional neural network model, the channel attention mechanism and the spatial attention mechanism are not combined at the same time, and the accuracy and robustness of performing target tracking are not ideal.

Disclosure of Invention

In view of the above situation, there is a need to solve the problem in the prior art that, when performing visual target tracking, the accuracy and robustness of target tracking are not ideal without simultaneously combining the convolutional neural network model, the channel and the spatial attention mechanism.

The embodiment of the invention provides a twin network target tracking method based on a channel and space attention mechanism, wherein the method comprises the following steps:

the method comprises the following steps: processing the video or image data set to obtain a plurality of target images of uniform image size;

step two: constructing and obtaining a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism;

step three: extracting training samples from the plurality of target images to train the novel backbone network model;

step four: extracting deep features of a target image sample from the multiple target images by using the trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain multiple target candidate blocks, wherein each target candidate block corresponds to a similarity score;

step five: and tracking the target by using the acquired target candidate block with the maximum similarity score.

The invention provides a twin network target tracking method based on a channel and space attention mechanism, which comprises the steps of firstly processing a video or image data set to obtain a target image with a uniform image size, then jointly constructing a novel backbone network model based on a convolutional neural network model, the channel attention mechanism and the space attention mechanism, then extracting a training sample from the target image, training the novel backbone network model, extracting deep features of the target image sample from the target image by using the trained novel backbone network model, further performing similarity matching in a target image candidate area to obtain a plurality of target candidate blocks, and finally performing target tracking by using the obtained target candidate blocks with the maximum similarity score.

According to the method, the GOT-10k is used as a training set to adjust model parameters of off-line training, so that targets in the video can be more accurately represented; feature extraction is then performed by using a lightweight convolutional neural network model. The apparent model of the tracking algorithm designed by the invention has better robustness and accuracy.

The twin network target tracking method based on the channel and space attention mechanism is characterized in that the novel backbone network model is a twin network framework, and the twin network framework comprises a template branch and a search branch;

wherein the step of extracting training samples from the plurality of target images comprises:

when the sub-window searching the target image extends beyond the range of the target image, the missing image portion is filled with the average RGB values.

The twin network target tracking method based on the channel and space attention mechanism comprises the following steps of:

inputting a target image through the template branch and the search branch respectively, and acquiring deep features of a target image sample according to the template branch and the search branch;

in the twin network framework, the following formula exists:

where h denotes a mapping function of the input-output signal,kthe length of the stride is represented by,

is the transition value of the active area in the input-output signal,

and

both represent the translation operator and are,

representing the input target image.

The twin network target tracking method based on the channel and space attention mechanism is characterized in that in the fourth step, the similarity score is expressed by the following formula:

wherein the content of the first and second substances,

representing a similarity score between two input target images,

representative value

The deviation of (a) is determined,

a set of real numbers is represented as,

and

representing the output characteristics of two input target images after passing through the twin network framework,

representing two input target images of the object,

representing a convolution embedding function.

The twin network target tracking method based on the channel and space attention mechanism comprises the following steps of, in the step of extracting deep features of target image samples from the plurality of target images by using the trained novel backbone network model, executing the following steps by the channel attention mechanism:

obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling;

inputting the features of the target images of the two channels obtained after the maximum pooling and the global average pooling into a multilayer perceptron network, and obtaining feature vectors after element summation;

passing the feature vector through a Sigmoid activation function to obtain a first weight coefficient, and combining the first weight coefficient with an input target image

The multiplication is performed to obtain a first weighted new feature.

The twin network target tracking method based on the channel and space attention mechanism is characterized in that the first weight coefficient is expressed as:

wherein the content of the first and second substances,

is a first weight coefficient of the first weight coefficient,

indicating that the Sigmoid-activated function,

representing the weight of the shared multi-tier perceptron network,

the representation of the function of the ReLU,

is a function of the global average pooling,

is a function of the maximum pooling,

representing an input target image;

the first weighted new feature is represented as:

wherein the content of the first and second substances,

a first weighted new characteristic is represented that is,

representing multiplication at the element level.

The twin network target tracking method based on the channel and spatial attention mechanism comprises the following steps of, in the step of extracting deep features of target image samples from the plurality of target images by using the trained novel backbone network model, executing the following steps:

obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling, and splicing the characteristics of the target images of the two channels through the first convolution layer;

calculating the characteristics of the target images of the two spliced channels through a second convolution layer and a Sigmoid activation function to obtain a second weight coefficient;

and multiplying the second weighting coefficient and the first weighting new characteristic to obtain a second weighting new characteristic.

The twin network target tracking method based on the channel and space attention mechanism is characterized in that the second weight coefficient is expressed as:

wherein the content of the first and second substances,

for the purpose of the second weight coefficient,

the perceptual domain representing the convolution kernel is 7 x 7,

also a first weighted new feature is represented,

representing a Sigmoid activation function;

the second weighted new feature is represented as:

weighting the new feature for the second.

The twin network target tracking method based on the channel and space attention mechanism comprises the steps of constructing a novel backbone network model based on a convolutional neural network model, a channel attention mechanism and a space attention mechanism,

training with the plurality of target images as a training data set, wherein the training data set contains 560 moving objects and 87 motion pattern classes;

a random gradient descent method was used in the training construction, where the momentum was set to 0.9.

The twin network target tracking method based on the channel and space attention mechanism is characterized in that target image feature sizes respectively extracted by a template branch and a search branch in the twin network frame are 6 x 128 and 22 x 128.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a twin network target tracking method based on a channel and space attention mechanism according to the present invention;

FIG. 2 is a schematic diagram of a twin network target tracking method based on a channel and space attention mechanism according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

In the prior art, when visual target tracking is carried out, a convolutional neural network model, a channel attention mechanism and a space attention mechanism are not combined at the same time, and the accuracy and robustness of target tracking are not ideal.

In order to solve the technical problem, the present invention provides a twin network target tracking method based on channel and spatial attention mechanism, please refer to fig. 1 and fig. 2, wherein the method includes the following steps:

s101, processing the video or image data set to obtain a plurality of target images with uniform image sizes.

In this step, the images in the video or image data set need to be processed to a uniform size. It should be noted that the target images with uniform size are processed, which is convenient for subsequent input and extraction of deep features of images with uniform size in the tracking stage.

And S102, constructing and obtaining a novel backbone network model based on the convolutional neural network model, the channel attention mechanism and the space attention mechanism.

In this embodiment, the novel backbone network model is a twin network framework, which includes a template branch and a search branch. As shown in figure 2 of the drawings, in which,

the corresponding is the branch of the template,

the corresponding is a search branch.

Additionally, in fig. 2, a convolutional neural network model, a channel attention module, and a spatial attention module are integrated to construct a novel backbone network model within the middle dashed box. The convolutional neural network model comprises a convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4 and a convolutional layer 5. Wherein the channel attention module and the space attention module are located between the convolutional layer 1 and the convolutional layer 2. The method is used for processing the deep features of the target image sample extracted from the target image in the subsequent step.

S103, extracting training samples from the target images to train the novel backbone network model.

During training, the size of the picture needs to be determined according to the complexity of the model and the size of the video memory. In the present invention, the sample image size input to the template branch is 127 × 127 × 3, and the target image size input to the search branch is 255 × 255 × 3.

In the step of extracting the training samples from the plurality of target images, it is additionally noted that:

when the sub-window searching for the target image extends beyond the range of the target image, the missing image portion is filled with the average RGB values. In the subsequent testing stage (including step S104 and step S105), the target images of the two channels are respectively introduced into the template branch and the search branch of the twin network framework to obtain the deep features of the target image sample.

It should be noted that, here, a plurality of target images are used as the training data set for training. Wherein the training data set comprises 560 moving objects and 87 motion pattern classes. In addition, the training data set also provides video clips of over 10000 real-world moving objects and over 150 million hand-made labeled borders. In the invention, the novel backbone network model designed above can realize end-to-end training of a large-scale data set GOT-10 k.

In addition, a random gradient descent method (SGD) was used in the training construction, in which the momentum was set to 0.9. Further, the learning rate per iteration decreases from the initial learning rate to the final learning rate, being set to 0.01 and 0.00001, respectively. The new backbone network model proposed in the present invention was trained for a total of 50 epochs, with a weight decay set to 0.0005 and a batch size of 16.

And S104, extracting deep features of the target image sample from the plurality of target images by using the trained novel backbone network model, and performing similarity matching on the deep features of the target image sample in a target image candidate region to obtain a plurality of target candidate blocks, wherein each target candidate block corresponds to a similarity score.

Specifically, for the above-mentioned novel backbone network model, the convolutional neural network model (CNN model) includes 5 convolutional layers, but does not include a full link layer. The channel attention mechanism and the space attention mechanism are composed of a channel attention module and a space attention module. Is constructed after the first layer of convolutional and pooling layers, in terms of the sequential discharge positions of the channel attention module-spatial attention module. Additionally, the receptive field of the spatial attention module employs a "7 × 7" convolution kernel.

In the twin network framework, the following formula exists:

(1)

is the transition value of the active area in the input-output signal,

and

both represent the translation operator and are,

representing the input target image.

Furthermore, a convolution embedding function is typically used

So that two input target images

And

and performing correlation to generate an output response graph for representing the similarity score between the deep features of the target image sample after the two input target images pass through the twin network framework.

Wherein, the formula of the similarity score is expressed as:

（2）

wherein，

Representing a similarity score between two input target images,

representative value

The deviation of (a) is determined,

a set of real numbers is represented as,

and

representing two input target images of the object,

representing a convolution embedding function.

For the channel attention module described above, each channel of the feature map represents a particular detector when extracting relevant features of the input target image. Therefore, measures need to be taken to focus the channel attention module on certain specific features to be useful for the input target image.

Specifically, in the step of extracting deep features of the target image sample from the plurality of target images by using the trained novel backbone network model, the channel attention mechanism performs the following steps:

and A1, obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling.

In the present invention, the size of the input target image Z is "H × W × C", and Max-Pooling (maximum Pooling) and Global Average-Pooling (Global Average Pooling) are used to obtain the features of the target images of two channels, and the size of the features of the target images of two channels is "1 × 1 × C".

And B1, inputting the features of the target images of the two channels obtained after the maximum pooling and the global average pooling into the multi-layer perceptron network, and summing the element numbers to obtain a feature vector.

The features of the target images of the two channels obtained after the maximum pooling and the global average pooling are then input into a multi-layered perceptron network (i.e., MLP). Wherein, the number of the first layer of neurons is C/r, the activation function is ReLU, and the number of the second layer of neurons is C. Wherein the neural network parameters of the two layers are shared. And summing the elements to output a feature vector.

Wherein the feature vector is

。

C1, passing the feature vector through a Sigmoid activation function to obtain a first weight coefficient, and combining the first weight coefficient with the input target image

The multiplication is performed to obtain a first weighted new feature.

In this step, the first weight coefficient is expressed as:

（3）

wherein the content of the first and second substances,

is a first weight coefficient of the first weight coefficient,

indicating that the Sigmoid-activated function,

representing the weight of the shared multi-tier perceptron network,

the representation of the function of the ReLU,

is a function of the global average pooling,

is a maximum pooling function;

the first weighted new feature is represented as:

（4）

wherein the content of the first and second substances,

a first weighted new characteristic is represented that is,

which represents a multiplication at the level of the element,

representing the input target image.

Further, after the channel attention module, a spatial attention module is introduced to focus on which features in the input target image are meaningful. Specifically, in the step of extracting deep features of the target image sample from the plurality of target images by using the trained novel backbone network model, the spatial attention mechanism performs the following steps:

and A2, obtaining the characteristics of the target images of the two channels through maximum pooling and global average pooling, and splicing the characteristics of the target images of the two channels through the first convolution layer.

Similar to the channel attention Module, the purpose of the inputTarget image

The size of (1) is "H × W × C". Max-Pooling (maximum Pooling) and Global Average-Pooling (Global Average Pooling) of one channel dimension are utilized to obtain the characteristics of the target images of the two channels, the size is H multiplied by W multiplied by 1, and the target images are spliced together according to a standard convolutional layer (first convolutional layer).

And B2, calculating the characteristics of the target images of the two spliced channels through a second convolution layer and a Sigmoid activation function to obtain a second weight coefficient.

Then, the weight coefficient is obtained by 7 × 7 convolution layer and Sigmoid activation function

. Finally, the weight coefficient is calculated

Multiplying with the input target image Z' to obtain a second weighted new feature

。

Wherein the second weight coefficient is expressed as:

（5）

wherein the content of the first and second substances,

for the purpose of the second weight coefficient,

the perceptual domain representing the convolution kernel is 7 x 7,

also a first weighted new feature is represented,

representing a Sigmoid activation function.

And C2, multiplying the second weighting coefficient and the first weighting new characteristic to obtain a second weighting new characteristic.

The second weighted new feature is represented as:

（6）

weighting the new feature for the second.

For the above step S104, in summary, in the test tracking phase, the convolution feature between the target images of the two branches in the original twin network structure does not contain background context information. Therefore, the tracker has difficulty in distinguishing the target from the complex background information, and is prone to tracking drift and failure. Based on the method, the deep features of the target image sample are extracted by using a trained novel backbone network model, and the deep features of the target image sample are distinguished from background information so as to focus on important features and inhibit useless information.

Then we give more weight to the channel attention module and the spatial attention module of the sequence. The channel attention module and the spatial attention module play an important role in improving the discrimination capability of the tracker. Finally, the sizes of the target image features extracted by the template branch and the search branch in the twin network framework are "6 × 6 × 128" and "22 × 22 × 128", respectively.

Further, similarity matching is performed in the target image candidate region. I.e. calculating the similarity of all the transformation sub-windows on a dense grid, as shown in the above formula (2), i.e. using convolution embedding function

Are correlated to generate an outputAnd a response graph is used for representing the similarity score between deep features of the target image sample after the two input target images pass through the twin network framework.

Here, the target candidate blocks described here are all obtained by search branching, and the corresponding size is "22 × 22 × 128". The similarity score is obtained by comparing the similarity between the target candidate block (which is also the target image feature in nature) in the search branch and the sample image feature in the template branch.

And S105, performing target tracking by using the acquired target candidate block with the maximum similarity score.

In this step, the target candidate block with the obtained maximum similarity score is used for target tracking. The method specifically comprises the following steps: and calculating and comparing the similarity between the deep features of the target image sample (in the template branch) and the deep features of the candidate target image sample (in the search branch), and finding the target image with the region with the highest similarity score in the subsequent frame as a predicted result, thereby realizing target tracking.

The invention provides a twin network target tracking method based on a channel and space attention mechanism, which comprises the steps of firstly processing a video or image data set to obtain a target image with a uniform image size, then jointly constructing a novel backbone network model based on a convolutional neural network model, the channel attention mechanism and the space attention mechanism, then extracting a training sample from the target image, training the novel backbone network model, extracting deep features of the target image sample from the target image by using the trained novel backbone network model, further performing similarity matching in a target image candidate area to obtain a corresponding similarity score, and finally performing target tracking according to an obtained target candidate block with the maximum similarity score.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the appended claims.

Claims

1. A twin network target tracking method based on a channel and space attention mechanism is characterized by comprising the following steps:

2. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 1, wherein the novel backbone network model is a twin network framework, the twin network framework comprises a template branch and a search branch;

3. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 2, wherein in the twin network framework, the method comprises:

in the twin network framework, the following formula exists:

is the transition value of the active area in the input-output signal,

and

both represent the translation operator and are,

representing the input target image.

4. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 3, wherein in the fourth step, the formula of the similarity score is expressed as:

wherein the content of the first and second substances,

representing a similarity score between two input target images,

representative value

The deviation of (a) is determined,

a set of real numbers is represented as,

and

representing two input target images of the object,

representing a convolution embedding function.

5. The method of claim 4, wherein in the step of extracting deep features of the target image samples from the plurality of target images by using the trained novel backbone network model, the channel attention mechanism performs the following steps:

and obtaining a first weighting coefficient by passing the feature vector through a Sigmoid activation function, and multiplying the first weighting coefficient and an input target image to obtain a first weighted new feature.

6. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 5, wherein the first weight coefficient is expressed as:

wherein the content of the first and second substances,

is a first weight coefficient of the first weight coefficient,

indicating that the Sigmoid-activated function,

representing the weight of the shared multi-tier perceptron network,

the representation of the function of the ReLU,

is a function of the global average pooling,

is a function of the maximum pooling,

representing an input target image;

the first weighted new feature is represented as:

wherein the content of the first and second substances,

a first weighted new characteristic is represented that is,

representing multiplication at the element level.

7. The twin network target tracking method based on channel and spatial attention mechanism of claim 6, wherein in the step of extracting deep features of target image samples from the plurality of target images by using the trained novel backbone network model, the spatial attention mechanism performs the following steps:

8. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 7,

the second weight coefficient is expressed as:

wherein the content of the first and second substances,

for the purpose of the second weight coefficient,

the perceptual domain representing the convolution kernel is 7 x 7,

also a first weighted new feature is represented,

representing a Sigmoid activation function;

the second weighted new feature is represented as:

weighting the new feature for the second.

9. The twin network target tracking method based on channel and space attention mechanism as claimed in claim 1, wherein in the step of constructing a new backbone network model based on the convolutional neural network model, the channel attention mechanism and the space attention mechanism,

10. The twin network target tracking method based on the channel and space attention mechanism as claimed in claim 2, wherein the sizes of the target image features extracted by the template branch and the search branch in the twin network frame are "6 x 128" and "22 x 128", respectively.