CN115578421B

CN115578421B - Target tracking algorithm based on multi-graph attention machine mechanism

Info

Publication number: CN115578421B
Application number: CN202211438781.3A
Authority: CN
Inventors: 齐玉娟; 闫石磊; 叶志鹏; 王延江; 刘宝弟
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-03-14
Anticipated expiration: 2042-11-17
Also published as: CN115578421A

Abstract

The invention discloses a target tracking algorithm based on a multi-graph attention machine mechanism, which belongs to the technical field of general image data processing or generation and is used for tracking a target in a video, wherein a first frame picture and a subsequent frame in the video are respectively used as the input of a template branch and a search branch, and feature extraction is carried out on the first frame picture and the subsequent frame picture through a twin network; inputting the output characteristics obtained in the last step into a graph attention module to perform cross-correlation operation; inputting the output obtained in the last step into an anchor-free tracking head network, obtaining the classification score of each pixel point in the characteristic diagram through classification branching, obtaining the distance relation between each pixel point and the center of a target through centrality branching, and obtaining the target frame information corresponding to each pixel point through regression branching; and multiplying the classification score by the central degree branch to obtain an accurate classification score, finding the pixel point with the highest score and the corresponding target frame information thereof to obtain the position of the target of the current frame, and repeating the steps.

Description

Target tracking algorithm based on multi-graph attention machine mechanism

Technical Field

The invention discloses a target tracking algorithm based on a multi-graph attention machine mechanism, and belongs to the technical field of general image data processing or generation.

Background

Target tracking is one of three main flow directions of computer vision, and is always concerned by people, and along with the continuous deep research on the target tracking, the application field of the target tracking also becomes wider, and the target tracking is applied to the fields of intelligent monitoring, vehicle tracking, man-machine interaction and the like. In practical applications, various complex and changeable scenes are often encountered, such as the target is blocked, the background is complex and changeable, the appearance of the target is changed, the motion blur and the like, and the existing tracker cannot well deal with the problems, so that the moving target tracking still faces huge challenges, and people need to continuously explore and improve a target tracking algorithm.

Single object tracking refers to a given object to be tracked in a first frame of a video and then tracking the object in subsequent frames. The previous research is mainly based on an algorithm based on correlation filtering, and with the development of deep learning, the strong feature extraction capability of a convolutional neural network is also widely concerned by people, and the research direction of target tracking is gradually changed to the deep learning direction.

In the research process of the target tracking algorithm based on deep learning, branches gradually appear, wherein the target tracking algorithm based on the twin network enables the tracker to reasonably balance the tracking speed and the tracking precision by virtue of unique advantages of the target tracking algorithm. However, when the existing tracker faces the situations of fuzzy targets, disordered backgrounds and the like, the characteristics of the targets are difficult to extract accurately, and the positions of the targets cannot be detected accurately. On the other hand, most twin network trackers perform similarity matching with a search area by taking the characteristics of the whole template picture as a kernel, the state of a target in the tracking process is not fixed, when the target is deformed or shielded, the global characteristics of the target can change, and the accuracy of a final result can be influenced by performing global similarity matching.

Disclosure of Invention

The invention aims to provide a target tracking algorithm based on a multi-graph attention machine system, and aims to solve the problems that in the prior art, the target tracking algorithm cannot be accurately positioned to the position of a target due to the change of global characteristics of the target, the existing network characteristic extraction capability cannot cope with the complexity and the variability of a target background, and the like.

A multi-graph attention machine mechanism based target tracking algorithm, comprising:

s1, respectively taking a first frame picture and a subsequent frame in a video as input of a template branch and a search branch, and performing feature extraction on the first frame picture and the subsequent frame through a twin network;

s2, inputting the output characteristics obtained in the S1 into a graph attention module to perform cross-correlation operation;

s3, inputting the output obtained in the S2 into an anchor-free tracking head network, obtaining the classification score of each pixel point in the characteristic diagram through classification branches, obtaining the distance relation between each pixel point and a target center through centrality branches, and obtaining target frame information corresponding to each pixel point through regression branches;

s4, multiplying the classification fraction obtained in the step S3 by the central degree branch to obtain an accurate classification fraction, finding out the pixel point with the highest fraction and the corresponding target frame information thereof, and obtaining the position of the target of the current frame;

and S5, repeating the steps from S1 to S4 until the positions of the targets in all the subsequent frames of the video are obtained.

The twin network in the S1 is GoogleNet sharing weight, an Inception V3 structure is used, the twin network is combined with a SimAM attention mechanism, and the specific operation is as follows:

adjusting the InceptitionV 3 structure of GoogleNet, wherein only the convolution and pooling layer in front of the InceptitionV 3 and the InceptitionA, the InceptitionB and the InceptitionC are used, and the following Inception module and other network layers are not used;

an attention module is added, one SimAM attention module is added after each of the three included modules, and one SimAM attention module is added after the first and third included modules.

The concrete construction process of the graph attention module of the S2 comprises the following steps:

s2.1, composing the feature images of the template frame and the search frame, and enabling each feature image to be in the feature images

The size part is used as a node to construct a corresponding bipartite graph

Wherein the node setVFrom the template subgraph

Node (a) of

And searching subgraphs

Node (a) of

The components of the composition are as follows,

is also a set of nodes, which are,

；

s2.2. According to the constructed bipartite graphGFor is to

And

solving the similarity of the nodes, and respectively operating the nodes by using three graph attention modules to obtain corresponding similarity graphs;

s2.3, obtaining three similarity graphs

Normalized by softmax, respectively

Middle node pair

The attention of the middle node is obtained

Arbitrary node ofjBy polymerization of

；

S2.4. Polymerization characteristics to be obtained

And

the linear characteristics of the corresponding nodes are fused to obtain characteristic expression

；

S2.5, obtaining all nodes through the operationjIs characterized by expression of

Corresponding three complete characteristic maps are obtainedFAnd fusing the two to obtain a final feature expression for subsequent positioning and tracking.

S3, the tracking head network is divided into a classification branch and a regression branch, and the classification branch distinguishes the category of the target and positions the target; the regression branch regresses a target frame of the target to obtain scale information of the target;

the response graph obtained by the classification branch is shown as

WhereinRWhich represents the size of the response map,

respectively representing the height and the width of the response graph, 2 representing the number of channels of the response graph, and storing classification scores of all pixel points in the two channels, wherein the classification scores are respectively the probability of a positive sample and the probability of a negative sample;

the final response graph of the regression branch is

Wherein each pixel point is in one-to-one correspondence with a pixel point of the classification response map, each point (i，j) The corresponding four channels contain the distance of the point from each edge of the bounding box, denoted as

，

Is shown byi，j) A corresponding set of four channels is provided,

respectively, the distance of the point from the four sides of the bounding box.

The classification branch and the centrality branch use a cross entropy loss function to calculate the accuracy of the classification and the accuracy of the centrality score, respectively, the regression branch uses an IOU loss function, the final loss of the whole network

Expressed as:

wherein

、

And

respectively set as 1, 1 and 2,

and

respectively represent classification loss, middleHeart loss and regression loss.

S4 response diagram of the central branch is

The center degree score of each pixel point isC(i，j) Will beC(i，j) And multiplying the classification score to further obtain a more accurate target score.

Compared with the prior art, the method has the advantages that the Inception V3 structure of GoogleNet is used and modified to be more suitable for the model provided by the invention, the training parameters are reduced, and simultaneously, the method is combined with the SimAM attention mechanism, so that the target feature extraction capability in the complex background and target blurring process is greatly improved without adding new parameters, and the subsequent target position positioning accuracy is improved; by constructing a plurality of bipartite graphs on the characteristic graphs of the template branches and the search branches, the traditional global matching mode taking the whole template picture as a core is converted into local characteristic matching, the problem of inaccurate characteristic matching when a target is deformed, shielded and the like in the tracking process is effectively solved, the accuracy of classifying each pixel point in the characteristic graphs is improved, and the tracking accuracy of the tracker is improved.

Drawings

FIG. 1 is a technical flow chart of the present invention.

Fig. 2 is an overall block diagram of the present invention.

FIG. 3 is a schematic diagram of the SimAM attention mechanism of the present invention.

FIG. 4 is a block diagram of the map attention module of the present invention.

Fig. 5 is a graph comparing the accuracy of the present invention and existing tracking algorithms on the UAV 123.

Fig. 6 is a graph comparing the success rate of the present invention and existing tracking algorithms on the UAV 123.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention are described clearly and completely below, and it is obvious that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

s3, inputting the output obtained in the S2 into an anchor-free tracking head network, obtaining a classification score of each pixel point in the characteristic diagram through classification branches, obtaining a distance relation between each pixel point and a target center through centrality branches, and obtaining target frame information corresponding to each pixel point through regression branches;

The twin network in S1 is GoogleNet sharing weight, an Inception V3 structure is used, the twin network is combined with a SimAM attention mechanism, and the specific operation is as follows:

The size part is used as a node to construct a corresponding bipartite graph

Wherein the node setVFrom template subgraph

Node (a) of

And searching subgraphs

Node (a) of

The components of the composition are as follows,

is also a set of nodes, which are,

；

s2.2. According to the constructed bipartite graphGTo pair

And

s2.3, obtaining three similarity graphs

Normalized by softmax, respectively

Middle node pair

The attention of the middle node is obtained

Arbitrary node ofjBy polymerization of

；

S2.4. Polymerization characteristics to be obtained

And

；

the response graph obtained by the classification branch is shown as

WhereinRWhich represents the size of the response map,

the final response of the regression branch is plotted as

，

Is represented by (i，j) A corresponding set of four channels is provided,

Expressed as:

in which

、

And

are respectively set as 1, 1 and 2,

and

the classification loss, center loss and regression loss are indicated, respectively.

S4 response diagram of the central branch is

Now explaining part of the english meaning in the present invention, googleNet: a deep learning network architecture, inclusion v3: a neural network structure, simAM: a three-dimensional attention mechanism, incorporated b, incorporated c: a specific network module in GoogleNet, padding: filling, IOU: one measure is a criterion for detecting the accuracy of a corresponding object in a particular data set. IoU is the result of dividing the overlapping portion of two regions by the collective portion of the two regions, and this is compared to the IoU calculation by a set threshold. UAV123: a data set for testing tracker performance. CNN: convolutional neural network, group truth: artificially mark the approximate range of the object to be detected in the training set images, resNet: residual neural network, alexNet: a deep learning network architecture, GOT10K: a data set for testing tracker performance. COCO, imageNet DET, imageNet VID, and YouTube-BB: target tracking a common training set, data sets siamcat, siamCar, KCF, ocean-only, CFNet, MDNet, ECO, siamFC, SPM, siamRPN + +, siamFC + +, CGACD, siamBAN, siamRPN, siamww for training the network: some more advanced tracking algorithms for target tracking direction.

The technical process of the invention is as shown in figure 1, and an integral network of the model is constructed, wherein the integral network consists of a feature extraction module, a graph attention module and a tracking head network. The characteristic extraction module consists of two CNNs shared by weights, and is used for respectively extracting the characteristics of the template picture and the search area; the graph attention module is mainly used for solving the similarity between the template picture and the search area and embedding the characteristic information of the template into the search area; the tracking head network consists of classification and regression branches and is used for positioning and tracking the target, and the twin network structure of the invention is shown in table 1 and fig. 2.

TABLE 1

The SimAM attention mechanism is inspired by the human brain attention mechanism, and a 3D attention weight can be derived for the feature map without additional parameters, as shown in fig. 3, which is described in detail as follows: in neuroscience, information-rich neurons typically exhibit different firing patterns than peripheral neurons, and activation of neurons typically inhibits peripheral neurons, i.e., spatial domain inhibition. Neurons with spatial domain inhibitory effects should therefore be given higher importance. To find these neurons, one can measure the linear separability between one target neuron and the other neurons. Based on the findings of neuroscience, the SimAM defines an energy function, the minimization of the energy function is equivalent to training the linear separability between the neuron t and other neurons in the same channel, and a final energy function is obtained by adopting a binary label and adding a regular term. The lower the energy in the energy function, the more the neuron t differs from the peripheral neurons, the higher the importance. Thus, the importance of the neuron can be determined by

Thus obtaining the compound. Inspired by the importance of the energy function and the mining neurons, enhancement processing of features is required as defined by the attention mechanism. The whole feature extraction process can be represented by the following operations:

wherein

Which represents a convolution of the signals of the first and second,zandxrepresenting the input of the template branch and the search branch respectively,

and

the characteristic diagram obtained after characteristic extraction by inclusion V3 is shown.

The invention uses three graph attention modules to operate the graph respectively, and the obtained similarity graph can be expressed as follows:

(ii) a Wherein

And

respectively represent

And

the vector of nodes of (a) is,

、

is passing through

The convolution of (2) linearizes the node vector.

In order to solve the problems that a moving target is often exposed to illumination change, motion blur and the like. According to the invention, a bipartite graph is established by the characteristics of the template picture and the characteristics of the search area, the local relation between the nodes is established, and then similarity calculation is carried out by a plurality of graph attention modules, wherein the detailed process is shown in FIG. 4.

Characteristics of polymerization

：

，k=1,2,3, followed by polymerization features obtained

And

the linearized characteristics of the corresponding nodes in the tree are fused to obtain more expressive characteristics

，

，k =1,2,3, whereincatRepresenting the concatenation of features.

Obtaining the feature expression of all the nodes j by the operation

Three corresponding complete feature maps F are obtained and fused to obtain a final feature expression for subsequent positioning and tracking.

In which

Showing the channel-wise stitching of the three signatures,

is a three characteristic diagramThen by

And fusing the characteristic information by the convolution kernel with the size.

For faster regression networks, the classification branch will take the cross entropy loss function and the regression branch will take the IOU loss function. The upper left corner and the lower right corner of the bounding box of the target are respectively expressed by (

) And (a)

) And (4) showing. Any point in the search area (x，y) The distance around the bounding box can be expressed as:

，

； lis the distance of any point from the left bounding box,ris the distance of any point from the right bounding box,tthe distance from any point to the upper bounding box,bis the distance from any point to the lower bounding box.

And calculating the difference between the group truth bounding box and the prediction box through an IOU loss function, and regressing the target box.

According to the survey, the score of the classification branch does not necessarily accurately represent the position of the target, and most of the high-quality target frames are generated at the center of the target. The invention decides to add a central degree branch to the classification branch to further evaluate the classification score, and the response graph of the central degree branch

The centrality score C of each pixel point (C:)i，j) Expressed as:

，

for the purpose of the index function,

；

by mixingC(i，j) And the classification scores are multiplied to further obtain more accurate target scores, so that the positioning is more accurate.

The method of the present invention was tested experimentally on GOT10K and UAV123 and compared to some of the currently more advanced trackers. When comparing on the UAV123 data set, the model used by the invention is trained by only one data set GOT10K, and the other trackers are models trained by four data sets COCO, imageNet DET, imageNet VID and YouTube-BB.

UAV123 contains 123 fully annotated high definition video datasets and fiducials captured from low altitude aerial perspectives. The method comprises the attributes of aspect ratio change, background clutter, camera motion, rapid motion, complete shielding, illumination change, low resolution, out-of-view, partial shielding, similar targets, scale change and view angle change, and can better test the comprehensive performance of the tracker. The GOT10K test set consists of 180 video sequences, comprises 84 moving objects and 32 motion modes, can enable a test experiment to be closer to reality, and can better evaluate the performance of a tracker.

The tracker of the present invention was tested and evaluated with advanced trackers such as SiamGat, siamCar, KCF, ocean, etc. on GOT10K, with the final results shown in table 2.

TABLE 2

AO represents the prediction box and the true target box of the trackerThe average rate of overlap between the two,

and

respectively representing the probability that the overlap ratio of the prediction frame and the real target frame is more than 50% and more than 75% in the prediction frame successfully tracked to the target, which can more accurately evaluate the tracking precision of the tracker. It can be seen from the table that the tracker of the present invention achieves good results in terms of overall performance.

Comparing the tracker of the present invention with advanced trackers such as Ocean, siamRPN + +, siamCAR, etc., the resulting tracking accuracy map and tracking accuracy map are shown in fig. 5 and 6, respectively, fig. 5 is an OPE accuracy map on the UAV123, including accuracy and position error thresholds, and it can be seen from the figures that the model of the present invention has great advantages in both accuracy and precision, when smaller data sets are used.

From the test results on the two data sets of GOT10K and UAV123, it can be seen that the tracker of the present invention has a very good improvement in the comprehensive performance, which also verifies the effectiveness of the algorithm proposed by the present invention.

Although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that: it is to be understood that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for some or all of the technical features thereof, without departing from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A target tracking method based on a multi-graph attention machine mechanism is characterized by comprising the following steps:

s5, repeating S1 to S4 until the positions of the targets in all subsequent frames of the video are obtained;

The size part is used as a node to construct a corresponding bipartite graph

Wherein the node setVFrom template subgraph

Node (a) of

And searching subgraphs

Node (a) of

The components of the composition are as follows,

is also a set of nodes, which are,

；

s2.2. According to the constructed bipartite graphGTo pair

And

s2.3, obtaining three similarity graphs

Normalized by softmax, respectively

Middle node pair

The attention of the middle node is obtained

Arbitrary node ofjBy polymerization of

；

S2.4. Polymerization characteristics to be obtained

And

；

Corresponding three complete characteristic maps are obtainedFFusing the two to obtain final feature expression for subsequent positioning and tracking;

the response graph obtained by the classification branch is shown as

WhereinRWhich represents the size of the response map and,

the final response graph of the regression branch is

，

Is shown byi，j) A corresponding set of four channels is provided,

2. The method for tracking the target based on the multi-graph attention machine mechanism according to claim 1, wherein the twin network in S1 is google net sharing the weight, and an inclusion v3 structure is used, and the twin network is combined with the SimAM attention machine mechanism, and the method specifically operates as follows:

3. The method of claim 1, wherein the classification branch and the centrality branch use cross entropy loss function to calculate the accuracy of classification and the accuracy of centrality score, respectively, the regression branch uses IOU loss function, and the final loss of the whole network

Expressed as:

wherein

、

And

are respectively set as 1, 1 and 2,

and

respectively represent the classificationLoss, center loss, and regression loss.

4. The method for tracking the target based on the multi-graph attention machine mechanism as claimed in claim 3, wherein the response graph of the centrality branch of S4 is