CN113592900A

CN113592900A - Target tracking method and system based on attention mechanism and global reasoning

Info

Publication number: CN113592900A
Application number: CN202110656309.6A
Authority: CN
Inventors: 鲍华; 束平; 许克应
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-11-02

Abstract

The invention discloses a target tracking method and a system based on an attention mechanism and global reasoning, which belong to the technical field of computer vision and comprise the following steps: the target tracking is carried out by utilizing a target tracking model based on a twin network, the target tracking model comprises a template branch and a search branch, the template branch and the search branch respectively comprise a main network, a parallel attention mechanism and a global reasoning module, and the target tracking method comprises the following steps: acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map; carrying out weighted summation on the first score map and the second score map to obtain a regression map; and determining the position of the target according to the regression graph. Compared with the existing tracking algorithm, the method has better tracking effect.

Description

Target tracking method and system based on attention mechanism and global reasoning

Technical Field

The invention relates to the technical field of computer vision, in particular to a target tracking method and a target tracking system based on an attention mechanism and global reasoning.

Background

Target tracking is one of the challenges in the field of computer vision, which is the basis for more advanced visual understanding and scene analysis. Target tracking technology is widely used for video surveillance, human-computer interaction, robotics, video editing and unmanned driving. The visual object tracking task is to realize continuous and stable tracking of the moving target in the subsequent frame according to the target position and size information of the initial frame. Due to the interference of objects such as scale change, rotation, deformation, rapid movement, and background illumination change of the target, it is still a difficult task to achieve long-term stable target tracking.

In recent years, research on the task of visual tracking has focused on two aspects, namely, on the one hand, the speed of the algorithm and on the other hand, the accuracy of the tracking. In terms of speed, the correlation filtering algorithm is one of the most successful tracking frameworks, mainly using fast fourier transform and simpler manual functions, running at speeds close to 700 frames per second. However, this method is often difficult to handle in complex situations and performance will be greatly reduced. In terms of accuracy, the target tracking method based on deep learning shows very strong effect. Compared with a related filtering algorithm, the target tracking performance of the target tracking method based on deep learning is greatly improved, the most difficult scene can be better processed, and the speed is low.

In order to solve the problem of low tracking speed of the target tracking algorithm based on deep learning, a target tracking algorithm based on a twin network is provided. Related researchers put forward the target tracking of the twin network for the first time, namely, the target tracking problem is converted into a patch block matching problem and is realized by a neural network; researchers have also proposed an end-to-end twin network tracing algorithm SiamFC, which is fast, so that in the next few years many twin network based target tracing have appeared.

The target tracking method based on the twin network has the advantages of high speed and accuracy, so that the target tracking method is paid strong attention, but some existing twin network tracking algorithms still have some defects. In the following, for typical twin networks SiamFC and SiamRPN, two disadvantages are pointed out, one of which is that they use a shallow network structure, extract insufficient features, and do not pay good attention to the tracking target itself, so that a tracking failure may occur in the face of some tracking challenges. Secondly, neither of them considers the context information, and the tracking failure is easily caused when facing the object with large occlusion or large deformation.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and obtain better tracking effect.

In order to achieve the above object, in one aspect, the present invention adopts a target tracking method based on attention mechanism and global inference, which performs target tracking using a target tracking model based on a twin network, where the target tracking model includes a template branch and a search branch, and both the template branch and the search branch include a backbone network, a parallel attention mechanism and a global inference module, and include:

acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map;

carrying out weighted summation on the first score map and the second score map to obtain a regression map;

and determining the position of the target according to the regression graph.

Further, the backbone network adopts a renex network structure, and is configured to perform feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism.

Further, the attention mechanism includes a channel attention mechanism and a spatial attention mechanism, wherein:

the space attention mechanism is used for processing the input feature map to obtain a first feature map;

the channel attention mechanism is used for processing the input feature map to obtain a second feature map;

and adding the first feature map and the second feature map in parallel to obtain an attention feature map, and using the attention feature map as the input of the global reasoning module.

Further, the global reasoning module is used for projecting the characteristics of the attention characteristic graph to the nodes of the interaction space, carrying out reasoning, and then mapping the characteristics of the nodes of the interaction space to the original space to obtain a new characteristic graph; and adding the new feature map and the attention feature map to obtain a new feature map.

Further, the obtaining an initial frame picture and a current frame picture as input of a template branch and a search branch respectively to obtain a first score map and a second score map includes:

and performing cross-correlation operation on the new feature graph output by the global reasoning module in the template branch and the new feature graph output by the global reasoning module in the search branch to respectively obtain the first score graph and the second score graph.

On the other hand, the target tracking system based on the attention mechanism and the global reasoning comprises a picture acquisition module and a target tracking module, wherein:

the image acquisition module is used for acquiring an initial frame image and a current frame image;

the target tracking module is used for processing the initial frame image and the current frame image by using a target tracking model to determine the position of a target, the target tracking model comprises a template branch and a search branch, the template branch and the search branch both comprise a trunk network, a parallel attention mechanism and a global reasoning module, the template branch and the search branch respectively process the initial frame image and the current frame image to obtain a first score map and a second score map, and the first score map and the second score map are subjected to weighted summation to obtain a regression map and determine the position of the target.

Compared with the prior art, the invention has the following technical effects: the invention uses a deeper network structure and adds a parallel attention mechanism, so that the extracted features are more sufficient, and simultaneously, the global reasoning module is added, and the addition of the global reasoning module better considers the global context information, thereby obtaining better tracking effect.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a flow diagram of a method for target tracking based on attentional mechanisms and global reasoning;

FIG. 2 is an overall tracking block diagram of the target tracking method, which includes three parts, namely a backbone network, a parallel attention mechanism and a global reasoning module;

FIG. 3 is a block diagram of a spatial attention mechanism;

FIG. 4 is a block diagram of a channel attention mechanism;

FIG. 5 is a block diagram of a global inference module;

FIG. 6 is a comparative evaluation of the tracking algorithm of the present invention and other 5 high performance mainstream algorithms on an OTB100 reference data set, (a) is a success rate graph, and (b) is a precision graph;

FIG. 7 is a graph of the accuracy of the tracking algorithm of the present invention and other 5 high performance mainstream algorithms in various challenges on an OTB100 dataset;

FIG. 8 is a graph of the success rate of the tracking algorithm of the present invention versus other 5 high performance mainstream algorithms for various challenges on an OTB100 data set;

fig. 9 is a qualitative analysis comparison of one tracking algorithm of the present invention with three other tracking algorithms over four video frames in OTB 100.

Detailed Description

To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.

As shown in fig. 1 to 2, the present embodiment discloses a target tracking method based on attention mechanism and global inference, which performs target tracking by using a target tracking model based on a twin network, where the target tracking model includes a template branch and a search branch, and the template branch and the search branch both include a backbone network, a parallel attention mechanism and a global inference module, and includes the following steps S1 to S3:

s1, acquiring an initial frame picture and a current frame picture, and respectively taking the initial frame picture and the current frame picture as the input of a template branch and a search branch to obtain a first score map and a second score map;

s2, carrying out weighted summation on the first score map and the second score map to obtain a regression map;

and S3, determining the position of the target according to the regression graph.

It should be noted that the backbone network for extracting features in the target tracking model adopts the latest renex network structure, and is used for performing feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism, and the specific details are as follows:

the twin network has two branches, a template branch that takes as input a given initial frame picture, and a search branch that takes as input a picture of the current frame. The two branches are subjected to feature extraction through a complete convolution network, then cross-correlation operation is performed, and finally a score graph is obtained, wherein the specific situation can be represented by the following formula:

wherein z denotes a template picture, x denotes a search picture,

the characteristic diagram generated by the convolutional neural network is shown, b is an offset value, I is an identity matrix, S (z, x) is a final score diagram, the characteristic diagrams obtained by the two branches are subjected to cross-correlation operation to obtain S (z, x), and the position with the highest score is the position of the target.

As a further preferred technical solution, the attention mechanism includes a channel attention mechanism and a space attention mechanism, wherein:

As shown in fig. 3, the spatial attention mechanism in this embodiment adopts a compact, efficient, and small-computation-amount spatial attention module to input a feature map

The method is divided according to the spatial position:

F_SA＝[f^1，1，f^1，2，…，f^i，j，…，f^H，W]

wherein the content of the first and second substances,

the feature tensor at spatial position (i, j) is represented, where i ∈ {1, 2, …, H }, and j ∈ {1, 2, …, W }. Feature(s)FIG. F_SAFrom two branches input, one branch generates the weighting coefficients, while the other remains unchanged. Finally, multiplying the weight coefficient and each corresponding position tensor in the segmented characteristic diagram, and outputting the processed characteristic diagram

As shown in the following formula:

wherein, mu_i，jFrom the feature tensor f^i，jBy a convolution operation of 1 × 1, σ () represents the sigmoid activation function.

As shown in FIG. 4, the channel attention mechanism in this embodiment is to input a feature map F_CAAnd dividing the feature graph into two branches, wherein the original feature graph is kept unchanged, the two branches are respectively subjected to global average pooling, a 1 × 1 convolution compression channel and a 1 × 1 convolution expansion channel, and then a sigmoid activation function is carried out to finally generate a weight coefficient, and the original feature graph and the generated weight coefficient are weighted to obtain a new feature graph.

Feature map to be input

The division is carried out according to the number of channels, and the specific conditions are as follows:

F_CA＝[f₁，f₂，…，f_k，…，f_C]

wherein the content of the first and second substances,

k∈{1，2，…，C}。

generating an feature tensor after the feature map is subjected to global average pooling

The value of the k channel is shown below:

the generated feature tensor is subjected to two convolution operations of 1 × 1 to obtain a new feature tensor z', which is expressed by the following formula:

z′＝W₁(δ(W₂z))

wherein the content of the first and second substances,

is the weight value of the first convolutional layer,

is the weight value of the second convolution layer, δ (·) is the ReLU activation function. The final profile was obtained as shown below:

finally, a feature map generated by the spatial attention mechanism based on the parallel attention mechanism

And feature maps generated by channel attention mechanism

Parallel addition yields a new feature map F, as shown in the following equation:

as a further preferred technical solution, the global inference module is configured to project the features of the attention feature map onto nodes of an interaction space to form a completely connected map, perform inference, and then map the features of the interaction space nodes to an original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.

As shown in fig. 5The global inference module consists of five convolutions, two for size reduction and expansion on the input feature map X and the output feature map Y (leftmost and rightmost), one for generating a double projection B (top) between the coordinates and the potential interaction space, and two for global inference based on the map Ag in the interaction space (middle). Here, V encodes the region feature as a graph node, W_gRepresenting the parameters of the graph convolution. Will input the feature map

Where c is the number of channels, and L ═ hxw is mapped to the interaction space by linear transformation, as shown in the following formula:

wherein the content of the first and second substances,

the volume of the graph in the interaction space can be represented by:

Z＝GVW_g＝((I-A_g)V)W_g

wherein G and A_gRepresenting an N × N node adjacency matrix for diffusing information between nodes; w_gThe status is indicated to be updated by the user,

a node matrix is represented.

And then mapping the interactive space to the original space to obtain a new characteristic diagram, which is specifically shown by the following linear mapping:

and finally, adding the new feature diagram and the original feature diagram to obtain a final feature diagram with context information.

In this embodiment, global reasoning modules are respectively added to two branches of a twin network, and perform cross-correlation operation with another branch to obtain a score map, and a weighted sum is performed on the obtained score map, where a specific formula is as follows:

S(z，x)＝φS₁(z，x)+(1-φ)S₂(z，x)

wherein S is₁(z, x) is a score graph obtained by adding a global reasoning module to the template branch, S₂(z, x) is a score map obtained by adding the global reasoning module to the search branch, phi is a weight coefficient, in the embodiment, phi is 0.5, and S (z, x) is a final output score map.

The embodiment also discloses a target tracking system based on the attention mechanism and the global reasoning, which comprises a picture acquisition module and a target tracking module, wherein:

As a further preferred technical solution, the backbone network adopts a renex network structure, and is configured to perform feature extraction on the input initial frame picture or the current frame picture to obtain a feature map and use the feature map as an input of the attention mechanism.

As a further preferred technical solution, the global inference module is configured to project the features of the attention feature map onto nodes of an interaction space, perform inference, and then map the features of the interaction space nodes to an original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.

The target tracking system based on attention mechanism and global inference disclosed in this embodiment has the same technical features and technical effects as the target tracking method based on attention mechanism and global inference disclosed in the above embodiment, and details are not repeated here.

Compared with the existing tracking algorithm, the method has the advantages that the better tracking effect is achieved, and the experimental verification is as follows:

the data set for the experiments of this example was OTB100[ Yi Wu, Jongwood Lim, and Ming-Hsua Yang. object tracking bearing mark. IEEE Transactions on Pattern Analysis and Machine Analysis, 37(9): 1834-1848,2015.2 ]. It consists of 100 video frames and was proposed in 2015. Different data sets are also labeled with different attributes. There are 11 different attributes. These attributes may represent common difficulties in the field of target tracking. Such as Illumination Variation (IV), Scale Variation (SV), Occlusion (OCC), Deformation (DEF), Motion Blur (MB), Fast Motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background similarity (BC), Low Resolution (LR).

The quality of the tracking algorithm depends on the accuracy map and success map of the OTB100 data set, the accuracy map being the percentage of the center point of the target position estimated by the tracking algorithm for the video frames to the manually marked target center point less than a given threshold number of frames. Since the accuracy map cannot reflect changes in the size and scale of the target object, a success rate map is proposed. The success diagram is the percentage of the total frame number calculated by the frame number which is greater than the coincidence rate threshold value under the given coincidence rate threshold value, and the coincidence rate is calculated by the following formula:

wherein, O is coincidence rate, B is boundary frame area obtained by tracking algorithm, G is boundary frame area of real value, n is intersection operation, and U is union operation.

As shown in FIG. 6, the tracking algorithm of the present invention is tested on the OTB100 dataset and the results obtained are compared with the SimRPN [ Li B, Yan J, Wu W, et al]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8971-8980.]、CFNet[Valmadre J,Bertinetto L,Henriques J,et al.End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2805-2813.]、SiamFC3s[Wang L,Ouyang W,Wang X,et al.Visual tracking with fully convolutional networks[C]//Proceedings of the IEEE international conference on computer vision.2015:3119-3127.]、Staple[Bertinetto L,Valmadre J,Golodetz S,et al.Staple:Complementary learners for real-time tracking[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:1401-1409.]、fDSST[Danelljan M,

G,Khan F S,et al.Discriminative scale space tracking[J].IEEE transactions on pattern analysis and machine intelligence,2016,39(8):1561-1575.]Compared with results obtained by five mainstream algorithms in recent years, the algorithm of the invention has better effect. Compared with the SimFC algorithm, the tracking algorithm disclosed by the invention has the advantages that the average success rate and the accuracy are improved, wherein the success rate is improved by 6.9 percentage points, and the accuracy is improved by 9.6 percentage points. Compared with the SimRPN algorithm, the method is also improved, wherein the success rate and the precision are respectively improved by 1.8 percent and 2.4 percent.

As shown in fig. 7 and 8, the five classic tracking algorithms and the algorithm of the present invention have the good success rate and precision in the different properties of the OTB 100. Wherein the attributes of the tracking challenges represented in fig. 7 and 8 (a), (b), (c), (d), (e), (f), (g), and (h), respectively, are background similarity, distortion, low resolution, motion blur, occlusion, out-of-plane rotation, out-of-view and scale variation. Fig. 7 and 8 show that the tracking algorithm of the present invention outperforms the SiamRPN, CFNet, siamcfc 3s, stack and fdst algorithms in terms of both success rate and accuracy in the face of the above challenges.

As shown in fig. 9, four challenging video sequences in the OTB2015 dataset are selected, and the results obtained by the tracking algorithm of the present invention are compared with the real values, the results obtained by SiamFC, and the results obtained by stack, so that it can be found that the tracking algorithm of the present invention has significant advantages in handling the difficulties of occlusion, deformation, motion blur, and scale change.

The tracking challenge in the "Bolt 2" video sequence has distortion and similar background, and the above algorithm performs better for the distortion challenge, but the similar background challenge shows that the SiamFC performs worse as shown in the 235 th frame and the 252 th frame, while the algorithm of the present invention still performs better.

The tracking challenges in the Box video sequence include illumination change, scale change, occlusion, motion blur, in-plane rotation, out-of-view, similar background and low resolution, and for the challenges of the three tracking algorithms of illumination change, out-of-view and low resolution, the states of the three algorithms in the figure are similar, but for the six challenges of scale change, occlusion, motion blur, in-plane rotation, out-of-plane rotation and similar background, the algorithm of the invention performs better. In the figure, the SiamFC algorithm of the 43 th frame picture, the 357 th frame picture and the 945 th frame picture can follow the lost target, and the stack algorithm of the 641 th frame can follow the lost target, but the tracking state of the algorithm provided by the invention is stable all the time, and the lost following condition does not occur.

The tracking challenges that are present in "Dragon Baby" video sequences are scale change, occlusion, motion blur, fast motion, in-plane rotation, out-of-plane rotation, and out-of-view. As shown, the three algorithms basically can follow when the video is in 19 frames, but only the tracking algorithm of the present invention can follow the target when the 44 th frame has motion blur and fast motion, the 48 th frame has occlusion, and the 80 th frame has out-of-plane rotation.

Tracking challenges present in "Girl 2" video sequences are scale variations, occlusion, distortion, motion blur, and out-of-plane rotation. As shown in the figure, when the occlusion is about to occur in the 107 th frame, the tracking states of the three algorithms are similar, but only the tracking algorithm of the invention keeps track of the target in the 239 th frame after the occlusion occurs. Similarly, the algorithm of the present invention can still keep up with the target when the 842 th frame and 927 th frame are deformed, but the other two algorithms perform poorly.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A target tracking method based on an attention mechanism and global reasoning is characterized in that a target tracking model based on a twin network is used for target tracking, the target tracking model comprises a template branch and a search branch, the template branch and the search branch respectively comprise a main network, a parallel attention mechanism and a global reasoning module, and the target tracking method comprises the following steps:

and determining the position of the target according to the regression graph.

2. The target tracking method based on attention mechanism and global inference as claimed in claim 1, wherein said backbone network adopts renex t network structure, which is used to perform feature extraction on the inputted initial frame picture or the current frame picture, obtain a feature map and use it as the input of the attention mechanism.

3. The method of target tracking based on attention mechanism and global inference as claimed in claim 1, wherein said attention mechanism comprises a channel attention mechanism and a spatial attention mechanism, wherein:

4. The target tracking method based on attention mechanism and global inference as claimed in claim 3, wherein said global inference module is used for projecting the features of the attention feature map onto the nodes of the interaction space, performing inference, and then mapping the features of the interaction space nodes to the original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.

5. The method for tracking targets based on attention mechanism and global inference as claimed in claim 4, wherein said obtaining an initial frame picture and a current frame picture as input of a template branch and a search branch respectively to obtain a first score map and a second score map comprises:

6. A target tracking system based on attention mechanism and global reasoning is characterized by comprising a picture acquisition module and a target tracking module, wherein:

7. The target tracking system based on attention mechanism and global inference as claimed in claim 6, wherein said backbone network adopts a renex t network structure, which is used to perform feature extraction on the inputted initial frame picture or the current frame picture, obtain a feature map and use it as the input of the attention mechanism.

8. The target tracking system based on attention mechanism and global inference of claim 6, characterized in that the attention mechanism comprises a channel attention mechanism and a spatial attention mechanism, wherein:

9. The target tracking system based on attention mechanism and global inference as claimed in claim 8, wherein said global inference module is used to project the features of the attention feature map onto the nodes of the interaction space, to make inference, and then to map the features of the interaction space nodes to the original space to obtain a new feature map; and adding the new feature map and the attention feature map to obtain a new feature map.