CN113393496A

CN113393496A - Target tracking method based on space-time attention mechanism

Info

Publication number: CN113393496A
Application number: CN202110755862.5A
Authority: CN
Inventors: 后弘毅; 陆保国; 褚孔统
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-09-14

Abstract

The invention provides a target tracking method based on a space-time attention mechanism, which comprises the following steps: constructing a network model for acquiring a template image and an image to be tracked of a target to be tracked; constructing a channel attention mechanism model, and fusing the channel attention mechanism model into a network model; constructing a space attention mechanism model, and fusing the space attention mechanism model into a network model; training a network model fused into the channel attention mechanism model and the space attention mechanism model according to a loss function; and tracking the target in the video by using the network model obtained by training to obtain a target tracking result. Compared with the prior art, the tracking performance is greatly improved, the target can still be stably tracked when the tracked target is subjected to complex environments such as shielding, deformation, background interference and the like, the target drifting problem in the tracking process is effectively solved, and a more stable tracking result is provided for a user.

Description

Target tracking method based on space-time attention mechanism

Technical Field

The invention relates to the technical field of electronic signal detection, in particular to a target tracking method based on a space-time attention mechanism.

Background

The target tracking technology is one of hot research directions in the field of computer vision, and has wide application in various fields such as intelligent monitoring, man-machine interaction, unmanned driving and the like. The target tracking means that in a continuous video sequence, the position relation of an object to be tracked is established to obtain the complete motion track of the object, and the position and the size of the target in a subsequent sequence image are calculated according to the target coordinate position of a first frame of a given image. The target tracking technology can provide basis for behavior understanding, reasoning, decision making and the like, is the basis of high-level video processing tasks such as subsequent target recognition, behavior analysis, video compression coding and video understanding, and is a necessary premise for executing high-level intelligent behaviors. Although the target tracking technology has advanced sufficiently in recent years, many efficient algorithms are proposed to solve the challenging problems in a specific scene, but problems such as occlusion, illumination change, scale change, background interference and the like still exist, so the research of the target tracking technology is still a difficult task.

In the target tracking method based on the full convolution twin network, the method carries out target tracking through a template matching strategy, the characteristic discrimination is insufficient, only the target information of the first frame image is used in the tracking process, and the performance of the target is reduced when the target is challenged by deformation, shielding and the like. In addition, the twin network only retains the image characteristics of the first frame, so that the target characteristics are prevented from being polluted, but the method cannot capture the change of the target in the subsequent frames. Therefore, when the target is greatly deformed, the score of the response value corresponding to the real position of the target may become low, increasing the risk of target loss.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects in the prior art, the invention aims to provide a target tracking method based on a space-time attention mechanism, which is used for realizing a video target tracking method.

The technical scheme is as follows: the invention improves the discrimination capability of the characteristics through a twin network architecture; then, improved channel attention and spatial attention mechanisms were introduced, with different weights applied to features at different channels and spatial locations, focusing on features at spatial locations and channel locations that are advantageous for target tracking. In addition, an efficient online target template updating mechanism is provided, and the image features of the first frame and the image features with higher confidence in the subsequent tracking image frame are fused.

The invention provides a target tracking method based on a space-time attention mechanism, which comprises the following steps:

step 1, constructing a network model for acquiring a template image of a target to be tracked and an image to be tracked;

step 2, constructing a channel attention mechanism model, and fusing the channel attention mechanism model into a network model;

step 3, constructing a space attention mechanism model, and fusing the space attention mechanism model into a network model;

step 4, training a network model fused into the channel attention mechanism model and the space attention mechanism model according to a loss function;

and 5, tracking the target in the video by using the network model obtained by training to obtain a target tracking result.

Further, in one implementation, the network model includes a template branch and a search branch; the template branch is used for obtaining a template image as a first frame of a target to be tracked, and the template image is obtained by initialization, namely, the image of the first frame is initialized to the template image; the searching branch is used for receiving a searching image in the tracking process, and the searching image is a current frame of the target to be tracked in the tracking video; recording a template image of a first frame received by the template branch as z, recording a search image of a current frame received by the search branch as x, and recording a feature extraction network as x

Further, in one implementation, the step 2 includes:

step 2-1, performing feature compression on image features through a Global Average Pooling layer (Global Average Pooling) in the network model to obtain a feature vector with the size of 1 × 1 × C, wherein C is the number of channels, and in the invention, the image features refer to features obtained by performing feature extraction on a template image on a template branch;

2-2, respectively performing dimensionality reduction and dimensionality enhancement on the feature vector subjected to feature compression through two fully-connected layers in the network model and corresponding activation functions, and outputting the feature vector with the size of 1 multiplied by C;

step 2-3, generating channel weight alpha corresponding to each channel by using a Sigmoid function;

step 2-4, inputting image characteristics of the channel attention mechanism model

And obtaining a new channel weight corresponding to each channel according to the product of each channel and the channel weight alpha corresponding to the channel. In the invention, the favorable channel characteristics are selected according to the obtained new channel weight, and the favorable channel characteristics are used for carrying out template matching with the image to be tracked, thereby determining the target position.

Further, in one implementation, the step 3 includes:

step 3-1, sending the template image of the first frame into a space attention mechanism network to obtain the weight of each pixel point of a feature map, wherein the feature map is an image feature;

step 3-2, carrying out template matching on the weighted first frame image characteristic and the image to be tracked, namely, inputting the weighted first frame image into the spatial attention mechanism network, inputting each frame of image to be tracked into a network model, and inputting the weight of each pixel point in the characteristic graph and the input image characteristic

Multiplying to obtain a response graph of the target, and performing template matching;

and 3-3, taking the position corresponding to the response graph with the highest score as the position of the target to be tracked.

Further, in one implementation, the step 4 includes:

step 4-1, a space-time attention module obtained through the channel attention and the space attention of the steps 2-1 to 3-3 receives a pair of template images z and a search image x as input in an off-line training process based on a space-time attention mechanism target tracking algorithm;

step 4-2, respectively sending the template image z to a channel attention mechanism model and a space attention mechanism model for feature selection, wherein the channel attention mechanism model generates a channel weight alpha according to input image features, and the space attention mechanism network generates a weight beta according to the input image features;

step 4-3, obtaining a weighted feature map h (z) according to the following formula:

step 4-4, according to the following formula, the network model uses the weighted feature graph h (z) as a convolution kernel to send the search image x to a feature extraction network

Feature map obtained by feature extraction

Performing a sliding convolution operation:

wherein f (z, x) is a final response graph for performing feature fusion cross-correlation operation on the template image z and the search image x; in the invention, the final response graph is a graph finally obtained after the template image branch is cross-correlated with the search image branch through the attention mechanism module.

And 4-5, obtaining a final network model by continuously optimizing the loss function according to the following formula:

l(y,v)＝log(1+exp(-yv)) (3)

where l (y, v) is the loss function, y is ground-truth, and v is the predictor.

Further, in one implementation, the step 5 includes:

step 5-1, sending the first frame template image into a feature extraction network and an attention mechanism network for feature extraction;

step 5-2, performing similarity calculation on the feature graph obtained by each subsequent frame of search image through the feature extraction network and the template image by using convolution operation, namely adopting

And a Correlation calculation mode, wherein the similarity between the template image and the search image is calculated and obtained according to the following formula:

sim(A,B)＝AB/||A||||B||；

and 5-3, obtaining a response graph, and determining the final position of the target to be tracked according to the position corresponding to the response graph with the highest score, namely obtaining the target tracking result. In the invention, the higher the similarity score obtained by calculation is, the target which needs to be tracked is, and the lower the similarity score is, the target is lost.

In the prior art, the performance of the target is reduced when the target is challenged by deformation, shielding and the like, and the risk of target loss is increased.

In order to deal with the problem, the invention introduces a channel attention mechanism and a space attention mechanism, so that the algorithm focuses more on the space position and the characteristic which is beneficial to target tracking on the channel position. Specifically, the invention provides an efficient online updating mechanism, which fuses the image characteristics of the first frame with the image characteristics with higher reliability in the subsequent tracking image frame, and reduces the risk of tracking failure when the target is subjected to the challenge problems of shielding, deformation and the like. In the present invention, the image feature with higher confidence, i.e. the frame with better target condition, specifically, the selection of the frame with better target condition is realized through the aforementioned correlation step related to the channel attention mechanism. And the fusion is to send the image to the network model with template branch and search branch separately and utilize the network model to perform feature fusion. The experimental results show that: the method provided by the invention has higher precision on the OTB2013 and OTB2015 data sets.

Compared with the prior art, the invention has the following remarkable advantages:

firstly, when a tracked target is subjected to complex environments such as shielding, deformation, background interference and the like, the target can still be stably tracked;

secondly, the target drift problem in the tracking process is effectively solved, and a more stable tracking result is provided for a user.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an algorithm architecture of a target tracking method based on a spatiotemporal attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a workflow of a channel attention mechanism in a space-time attention mechanism-based target tracking method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a workflow of a spatial attention mechanism in a target tracking method based on a spatiotemporal attention mechanism according to an embodiment of the present invention;

FIG. 4a is a schematic diagram of a first tracking result of an exemplary target of a target tracking method based on a spatiotemporal attention mechanism according to an embodiment of the present invention;

FIG. 4b is a diagram illustrating a second tracking result of an exemplary target of a target tracking method based on a spatiotemporal attention mechanism according to an embodiment of the present invention;

FIG. 4c is a schematic diagram of a third tracking result of an exemplary target of the target tracking method based on the spatiotemporal attention mechanism according to the embodiment of the present invention;

FIG. 4d is a diagram illustrating a fourth tracking result of an exemplary target of a target tracking method based on a spatiotemporal attention mechanism according to an embodiment of the present invention;

FIG. 4e is a diagram illustrating a fifth tracking result of an exemplary target of the target tracking method based on the spatiotemporal attention mechanism according to an embodiment of the present invention;

fig. 4f is a schematic diagram of a sixth tracking result of an exemplary target of the target tracking method based on the spatiotemporal attention mechanism according to the embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1 to fig. 3, the embodiment of the invention discloses a target tracking method based on a space-time attention mechanism, which is applied to a long-time target tracking scene, and the space attention mechanism can effectively capture global information and can better track a target. The method comprises the following steps:

The embodiment provides a time-based clockIn the target tracking method of the empty attention machine system, the network model comprises a template branch and a search branch; the template branch is used for obtaining a template image as a first frame of a target to be tracked, and the template image is obtained by initialization, namely, the image of the first frame is initialized to the template image; the searching branch is used for receiving a searching image in the tracking process, and the searching image is a current frame of the target to be tracked in the tracking video; recording a template image of a first frame received by the template branch as z, recording a search image of a current frame received by the search branch as x, and recording a feature extraction network as x

As shown in fig. 2, in the target tracking method based on the spatiotemporal attention mechanism provided in this embodiment, the step 2 includes:

step 2-1, performing feature compression on image features through a Global Average Pooling layer (Global Average Pooling) in the network model to obtain a feature vector with the size of 1 × 1 × C, wherein C is the number of channels, and in the embodiment, the image features refer to features obtained by performing feature extraction on a template image on a template branch;

And obtaining a new channel weight corresponding to each channel according to the product of each channel and the channel weight alpha corresponding to the channel. In this embodiment, the favorable channel feature is selected according to the obtained new channel weight, and the favorable channel feature is used to perform template matching with the image to be tracked, thereby ensuring that the image to be tracked is accurately trackedAnd (4) determining the target position.

As shown in fig. 3, in the target tracking method based on the spatiotemporal attention mechanism provided in this embodiment, the step 3 includes:

In the target tracking method based on the spatio-temporal attention mechanism provided in this embodiment, the step 4 includes:

Feature map obtained by feature extraction

Performing a sliding convolution operation:

wherein f (z, x) is a final response graph for performing feature fusion cross-correlation operation on the template image z and the search image x; in this embodiment, the final response graph is a graph obtained by performing cross-correlation on the template image branch and the search image branch through the attention mechanism module.

l(y,v)＝log(1+exp(-yv)) (3)

where l (y, v) is the loss function, y is ground-truth, and v is the predictor.

In the target tracking method based on the spatiotemporal attention mechanism provided in this embodiment, the step 5 includes:

sim(A，B)＝AB/||A||||B||；

and 5-3, obtaining a response graph, and determining the final position of the target to be tracked according to the position corresponding to the response graph with the highest score, namely obtaining the target tracking result. In this embodiment, the higher the similarity score obtained by calculation is, the target that needs to be tracked is, and the lower the similarity score is, the target is lost.

In the prior art, the performance of the target is reduced when the target is challenged by deformation, shielding and the like, and the risk of target loss is increased. The invention provides a target tracking method based on a space-time attention mechanism, which is suitable for stably tracking a target under complex environments such as shielding, illumination change, background interference and the like. The method is characterized in that: firstly, a twin network architecture is adopted to improve the discrimination capability of the features; then, improved channel attention and spatial attention mechanisms were introduced, with different weights applied to features at different channels and spatial locations, focusing on features at spatial locations and channel locations that are beneficial for target tracking. In addition, an efficient online target template updating mechanism is provided, and the image features of the first frame and the image features with higher confidence in the subsequent tracking image frame are fused for reducing the risk of target drift. And repeating the steps until one section of video tracking is completed. Finally, the proposed tracking method was tested on OTB2013 and OTB2015 datasets. The experimental results show that: compared with the performance of the current mainstream tracking algorithm, the method has the advantage that the performance is improved by 7.6%. As shown in fig. 4a to 4f, the target tracking method based on the spatio-temporal attention mechanism provided by the present embodiment visualizes part of the challenging scenes with fast motion, occlusion, scale change, and the like.

In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the target tracking method based on the spatiotemporal attention mechanism provided in the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. A target tracking method based on a space-time attention mechanism is characterized by comprising the following steps:

2. The space-time attention mechanism-based target tracking method according to claim 1, wherein the network model comprises a template branch and a search branch; the template branch is used for obtaining a template image as a first frame of a target to be tracked, and the template image is obtained through initialization; the searching branch is used for receiving a searching image in the tracking process, and the searching image is a current frame of the target to be tracked in the tracking video; recording a template image of a first frame received by the template branch as z, recording a search image of a current frame received by the search branch as x, and recording a feature extraction network as x

3. The space-time attention mechanism-based target tracking method according to claim 1, wherein the step 2 comprises:

step 2-1, performing feature compression on image features through a global average pooling layer in the network model to obtain a feature vector with the size of 1 × 1 × C, wherein C is the number of channels;

And obtaining a new channel weight corresponding to each channel according to the product of each channel and the channel weight alpha corresponding to the channel.

4. The space-time attention mechanism-based target tracking method according to claim 1, wherein the step 3 comprises:

5. The space-time attention mechanism-based target tracking method according to claim 1, wherein the step 4 comprises:

Feature map obtained by feature extraction

Performing a sliding convolution operation:

wherein f (z, x) is a final response graph for performing feature fusion cross-correlation operation on the template image z and the search image x;

l(y,v)＝log(1+exp(-yv)) (3)

where l (y, v) is the loss function, y is ground-truth, and v is the predictor.

6. The space-time attention mechanism-based target tracking method according to claim 1, wherein the step 5 comprises:

A Correlation calculation mode, which is to calculate and obtain the template image and search according to the following formulaSimilarity between chordal images:

sim(A,B)＝AB/||A||||B||；

and 5-3, obtaining a response graph, and determining the final position of the target to be tracked according to the position corresponding to the response graph with the highest score, namely obtaining the target tracking result.