CN115187629A

CN115187629A - Method for fusing target tracking features by using graph attention network

Info

Publication number: CN115187629A
Application number: CN202210569792.9A
Authority: CN
Inventors: 郑忠龙; 张大伟; 林飞龙
Original assignee: Zhejiang Normal University CJNU
Current assignee: Zhejiang Normal University CJNU
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-10-14

Abstract

The invention provides a method for fusing target tracking characteristics by using a graph attention network, belonging to the technical field of computer vision. The method for fusing target tracking characteristics by using the graph attention network comprises the following steps: s1: cutting a current frame image to obtain a search area, and cutting a previous frame image to obtain a template image; s2: inputting the search area into a data enhancement module to obtain a plurality of groups of first training samples, and inputting the template image into the data enhancement module to obtain a plurality of groups of second training samples; s3: inputting a group of first training samples into a feature extraction module to extract the features of a search area, and inputting a group of second training samples into the feature extraction module to extract template features; s4: inputting the search area features and the template features into a graph attention module to obtain fusion features; s5: and inputting the fusion characteristics into a regression module to obtain a final predicted bounding box coordinate. The method has superiority in the aspects of performance and model complexity.

Description

Method for fusing target tracking features by using graph attention network

Technical Field

The invention relates to the technical field of computer vision, in particular to a method for fusing target tracking characteristics by using a graph attention network.

Background

Visual Object Tracking (Visual Object Tracking) refers to detecting, extracting, identifying and Tracking a moving Object in an image sequence to obtain the position of the moving Object, so as to perform the next processing and analysis. It has wide application in the real world, such as intelligent video monitoring, human-computer interaction, robot visual navigation, virtual reality, medical diagnosis, and the like. Target tracking is a challenging visual problem due to the large appearance changes of the target in the image sequence, such as deformation, occlusion, illumination changes, etc.

In recent years, because effective target feature representation can be learned by using a convolutional neural network, a deep learning method is successfully applied to a target tracking task. Wang Naiyan et al, for the first time, proposed a target tracking model using Convolutional Neural Network (CNN), SO-DLT, which uses a network structure similar to AlexNet to extract target features. Since then, various CNN-based methods have emerged, which significantly improve the performance of target tracking, and there is a trend towards designing a deeper network to pursue better performance.

The SimFC proposes a new full convolution twin network, a twin structure is utilized to calculate the similarity between two inputs by using the convolution network, but with the continuous change of the appearance of a target and the increasing complexity of the background, a model only predicts a single scale and cannot adapt to the scale change of the target. The SimRPN applies an RPN module in target detection to a tracking task, namely classification and regression branches, because the existence of the regression branches enables the algorithm to remove an original characteristic pyramid and solve the problem of unreliable scores of response graphs, the algorithm improves the precision and the speed at the same time, and converts the original similarity calculation problem into the regression and classification problem. DaSiamRPN introduced the test data set to expand the training set to improve the generalization capability of the model, and simultaneously increased the difficult negative samples of the same class and different classes to improve the discrimination capability of the model, but the above trackers all constructed networks on the similar architecture to AlexNet, and tried to train trackers with more complex architecture, such as using ResNet, but the performance was not improved at all. SiamRPN + + analyzes existing trackers and finds that the currently proposed tracking network needs to satisfy strict translation invariance, and padding in deep networks destroys this property. The current target tracking method cannot simultaneously ensure the tracking performance and high efficiency, and cannot balance the relationship between the performance and the speed.

Chinese patent CN111161311A and publication 2020-05-15 disclose a visual multi-target tracking method and device based on deep learning, the method comprises: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model, recording coordinate position information and acquiring a corresponding template image; acquiring images of each frame in the video except the 1 st frame as an image of a region to be searched; and inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network so as to obtain a tracking result of the tracking target. In the visual multi-target tracking method and device based on deep learning provided by the embodiment of the prior art, the template images corresponding to the tracking targets acquired by using the target detection network model and the images of the areas to be searched are respectively input into the target tracking network model constructed by the twin convolutional neural network, so that the tracking results of the tracking targets corresponding to the template images are acquired, the calculation amount is low, and the multi-target real-time and accurate tracking is realized. The target tracking method in the above patent cannot adapt to the scale change of the target, cannot simultaneously ensure the tracking performance and high efficiency, and cannot balance the relationship between the performance and the speed.

Disclosure of Invention

The present invention is directed to a method for fusing target tracking features using a graph attention network, which overcomes the above-mentioned shortcomings of the prior art.

The invention provides a method for fusing target tracking characteristics by using a graph attention network, which comprises the following steps:

s1: obtaining a current frame image and a previous frame image, cutting the current frame image according to the size and the position of a bounding box predicted by the previous frame image to obtain a search area, and cutting the previous frame image to obtain a template image;

s2: inputting the search area into a data enhancement module to obtain a plurality of groups of first training samples, and inputting the template image into the data enhancement module to obtain a plurality of groups of second training samples;

s3: inputting a group of first training samples into a feature extraction module to extract the features of a search area, and inputting a group of second training samples into the feature extraction module to extract template features;

s4: inputting the search area features and the template features into a graph attention module to obtain fusion features;

s5: and inputting the fusion characteristics into a regression module to obtain a final predicted bounding box coordinate.

Further, step S ₁ Comprises the following steps: f' _t ＝C(F _t ′)，F′ _t ＝C(F _t ) In which F _t For the current frame image, F _t-1 C (-) for the previous frame, if the previous frame predicts the bounding box and

is a center, and has a width and a height of W ^t-1 、H ^t-1 Then the current frame image and the previous frame image are processed

Is a center, 2W ^t ^-1 、2H ^t-1 And cutting for width and height.

Further, step S2 includes:

and Aug (-) is the data enhancement operation of the data enhancement module, and the current frame image at each moment in Aug (-) is sampled and cut through the Laplace distribution.

Further, step S3 includes: f. of _s ＝ResNet(F′ _t )，f _t ＝ResNet(F′ _i-1 ) Wherein f is _s For searching for regional features, f _t The feature extraction module comprises a feature extraction network used for extracting search area features and template features, wherein the feature extraction network selects ResNet18, and does not pass through a final average pooling layer and a softmax layer.

Further, step S4 includes:

wherein

For the fusion feature, F ₀ (. Cndot.) is the fusion operation of the graph attention module GAM.

Further, step S5 includes:

[left，bottom，right，top]＝F ₄ (f) In which F _i (. H) is the ith set of fully-connected layers, each set of fully-connected layers being provided with a ReLU activation function and a Dropout operation, F ₄ (. To) is output layer, [ left, bottom, right, top]The horizontal and vertical coordinate values of the lower left corner and the upper right corner.

Further, in the data enhancement operation, the parameter of the laplacian distribution is set to b _c ＝1/5，b _s ＝1/5，b _c 、b _t The center and the size of the bounding box are the scale of variation, respectively.

Further, in the data enhancement operation, the aspect ratio of the bounding box varies by the dimension α _w ，α _h E (0.6,1.4) to prevent the bounding box from overstretching.

Further, the map attention module CAM includes a linear transformation operation and a similarity calculation operation, and the specific operations of the map attention module CAM include: feature f of search area _s And template features f _t Splitting into two groups of 1X 1 XcNode set N _s And N _t For each node in the two sets of nodes

And

after linear transformation, by

Calculating the similarity between two nodes using inner product, wherein W _s And W _t Is a linear matrix, e _ij Representing nodes

And

similarity between them; after all nodes have been computed, e is added _ij Input softmax function:

wherein a is _ij Representing the similarity score after normalization; after the normalized similarity scores are obtained, linear transformation is carried out on the two groups of nodes once, and the information of each template node embedded into the nodes in the search area is respectively calculated according to the similarity scores:

wherein W _g Representing a linear variation matrix, g _i Representing to be aggregated to a node

Template information of (2); to pair

After the same linear transformation, by

The polymerization feature g _i And search area node characteristics

Is spliced, wherein

Is an enhanced node feature.

Further, the feature extraction module further comprises an SGD for fine-tuning parameters in the feature extraction network during training.

The method for fusing the target tracking characteristics by using the attention network has the following beneficial effects:

the invention provides a deep regression network for tracking based on graph attention, which can improve the tracking performance and keep high efficiency, can construct a local relation between the characteristics of a search area and the characteristics of a template, can integrate effective target information, enhances the characteristic representation of a tracked target, better balances the relation between the performance and the speed, and has superiority in the aspects of performance and model complexity.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention. In the drawings, like reference numerals are used to indicate like elements. The drawings in the following description are directed to some, but not all embodiments of the invention. For a person skilled in the art, other figures can be derived from these figures without inventive effort.

FIG. 1 is a flow chart of a method of fusing target tracking features using a graph attention network in accordance with the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

Please refer to fig. 1. The method for fusing the target tracking characteristics by using the attention network comprises the following steps:

The step S1 comprises the following steps: f' _t ＝C(F _t )，F′ _t ＝C(F _t ) In which F is _t For the current frame image, F _t-1 C (-) for the previous frame, if the previous frame predicts the bounding box and

is a center, and has a width and a height of W ^t-1 、H ^t-1 Then the current frame image and the previous frame image are processed to obtain the image

Is a center, 2W ^t-1 、2H ^t-1 And cutting for width and height.

The step S2 comprises the following steps:

and Aug (-) is the data enhancement operation of the data enhancement module, and the current frame image at each moment in Aug (-) is sampled and cut through the Laplace distribution. F' _t Is search area sample, F' _t-1 Sampling and cutting the search area sample and the template image sample through Laplace distribution to obtain ten groups of corresponding training samples respectively.

The step S3 comprises the following steps: f. of _s ＝ResNet(F′ _t )，f _t ＝ResNet(F′ _t-1 ) Wherein f is _s For searching for regional features, f _t For template features, the feature extraction module comprises a feature extraction network for extracting features of a search area and the template features, the feature extraction network selects ResNet18, and the feature extraction network does not pass through a final average pooling layer and a softmax layer, namely the feature extraction network ResNet18 in the application does not pass through the final average pooling layer and the softmax layer.

The step S4 comprises the following steps:

wherein

The step S5 comprises the following steps:

[left，bottom，right，top]＝F ₄ (f) In which F is _t (. H) is the ith set of fully-connected layers, with a total of three sets of fully-connected layers, each set of fully-connected layers being provided with a ReLU activation function and a Dropout operation, F ₄ (. H) is output layer, [ left, bottom, right, top]The horizontal and vertical coordinate values of the lower left corner and the upper right corner. The number of nodes in each set of fully connected layers in the regression module may be 4096 and the number of nodes in the output layer may be 4.

In the data enhancement operation, the parameter of the Laplace distribution is set to b _c ＝1/5，b _s ＝1/5，b _c 、b _s The center and the size of the bounding box are the scale of variation, respectively.

In the data enhancement operation, the aspect ratio variation scale alpha of the bounding box _w ，α _h E (0.6,1.4) to prevent the bounding box from overstretching.

The entire tracking network is divided in this application into a feature extraction module, a graph attention module, and a regression module. The minimum batch size was set to 50, the initial learning rate was set to 1e-5, every 1 × 10 ⁵ And (4) performing secondary iteration, reducing the learning rate according to the attenuation factor of 0.1, and optimizing the network weight by adopting an L1 loss function. And (3) realizing a target tracking network framework on a PyTorch platform, and carrying out an experiment by using an Nvidia 1080Ti display card.

The attention CAM module includes a plurality of linear transformation operations and similarity calculation operations, and the specific operations of the attention CAM module include: search area feature f _s And template features f _t Respectively split into two groups of 1 multiplied by c node sets N _s And N _t For each node in the two sets of nodes

And

after linear transformation, by

The similarity between two nodes is calculated using the inner product, where W _s And W _t Is a linear matrix, e _ij Representing nodes

And

similarity between them; after all nodes have completed computation, e is added _ij Input softmax function:

wherein a is _sj Representing the similarity score after normalization; after the normalized similarity scores are obtained, linear transformation is carried out on the two groups of nodes once, and the information of each template node embedded into the nodes in the search area is respectively calculated according to the similarity scores:

Template information of (i.e. g) _t Is a polymerization feature; to pair

After the same linear transformation, by

The polymerization feature g _i And search area node characteristics

Is spliced, wherein

The node characteristics are enhanced, so that richer characteristic representation can be obtained. The invention can notice the force module mechanism through the drawingThe relation between the search area and the target template part is established, and the attention of the network to each part is described, so that richer and effective characteristics can be obtained.

The feature extraction module also includes an SGD to fine-tune parameters in the feature extraction network during training. Feature extraction module uses ResNet18 as a backbone network for extracting search region features f _s And template features f _t And in the training process, the SGD is used for fine adjustment of parameters of the backbone network.

In this embodiment, the data set, the evaluation index, and the implementation details are sequentially performed as follows:

(1) Data set

The training data in the GOT-10k data set is selected as the training data set, 9335 video sequences are used, and the search area and the template image are respectively obtained by cutting two frames before and after random sampling in the corresponding video sequences. For testing, GOT-10k was selected as the reference dataset.

(2) Evaluation index

For fairness of evaluation, the tracker is trained strictly following the criteria of GOT-10 k: the model was trained using only the training set of GOT-10K, and tests were performed on the test set of GOT-10K. According to AO, SR in GOT-10k _0.5 And SR _0.75 To evaluate the performance of the model.

(3) Implementation details

When the model is trained, a video sequence is randomly selected from a training set, and then two adjacent frames before and after the video sequence are randomly selected. And cutting the front frame and the rear frame according to the position, the size and the position of the bounding box of the front frame to obtain a search area and a template image. At the feature extraction module, feature extraction is performed using ResNet18 as the backbone network, but the final average pooling layer and softmax layer are not used. In GAM, the initial channel of input features is 256, and output features g are calculated through linear transformation and similarity _i The number of channels in (2) is still 256, the number of channels after splicing is 512, and finally the number of channels is reduced to 256 through a down-sampling layer. Set the minimum batch size to 50, initial learning rate to1e-5, 1X 10 each ⁵ And in the secondary iteration, the learning rate is reduced according to the attenuation factor of 0.1, and the L1 loss function is adopted to optimize the network weight. We extend the training data using a laplacian distribution for random sampling. And (3) realizing a target tracking network framework on a PyTorch platform, and carrying out an experiment by using an Nvidia 1080Ti display card.

The present embodiment verifies the proposed method of fusing target tracking features using the graph attention network and the effectiveness of GAM through ablation experiments.

The influence of the backbone network on the model performance is analyzed firstly, then the effectiveness of the deeper convolutional neural network is proved by fine-tuning network parameters, and finally GAM is added to verify the effectiveness of the module.

Table 1 is the result of ablation experiments, which shows that different backbone networks can greatly affect the speed and performance of the tracker. But when using ResNet18 as the backbone network without trimming, the performance is worse than when using AlexNet and without trimming. The main reason is that the ResNet18 network structure is deeper, and the parameters obtained by pre-training are not beneficial to the tracking task. After using the ResNet18 and performing fine tuning, the performance is greatly improved, and the tracking speed still meets the real-time performance. In addition, to study the effectiveness of GAM, the module was added under otherwise unchanged conditions, with the results shown in table 1, with the model improving performance by 1.7%,2.5% and 1.4%, respectively. While the tracking speed increased from 35.20fps to 36.02fps. The tracking performance is improved without affecting the tracking speed, which proves the effectiveness of the module.

TABLE 1 influence of backbone network and GAM on various performance indicators of the model on the GOT-10ks dataset

The model was compared to 10 advanced tracking model methods on the GOT-10k dataset, including SiamRPN, daSiamRPN, CFNet, VITAL, siamFC, GOTURN, CCOT, ECO, MDNet. As shown in table 2, the quantitative results indicate that the present invention has good performance relative to some advanced tracking methods at present.

TABLE 2 and 10 model Performance on GOT-10k dataset

The invention provides a depth regression tracker for feature fusion based on a graph attention network, which comprises a feature extraction module, a graph attention module and a regression module. The graph attention module is used for constructing the relation between the search area and the target template part, fusing effective target information, enhancing the feature representation of the tracking target, and describing the attention distributed by the network to each part so as to obtain richer and effective features. The comparison with other advanced trackers on a reference data set shows that the invention obtains better results in the aspects of performance and effect, can better balance the relation between the performance and the speed, and has superiority in the aspects of performance and model complexity.

The above-described aspects may be implemented individually or in various combinations, and such variations are within the scope of the present invention.

It should be noted that in the description of the present application, the terms "upper end", "lower end" and "bottom end" indicating the orientation or positional relationship are based on the orientation or positional relationship shown in the drawings, or the orientation or positional relationship which the product of the application is conventionally placed in use, and are only for convenience of describing the present application and simplifying the description, but do not indicate or imply that the device referred to must have a specific orientation, be constructed in a specific orientation and be operated, and thus should not be construed as limiting the present application. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Finally, it should be noted that: the above examples are only for illustrating the technical solution of the present invention, and are not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for fusing target tracking features using a graph attention network, comprising the steps of:

s1: acquiring a current frame image and a previous frame image, cutting the current frame image according to the size and the position of a bounding box predicted by the previous frame image to acquire a search area, and cutting the previous frame image to acquire a template image;

2. A use as claimed in claim 1The method for fusing target tracking features by the graph attention network is characterized in that the step S1 comprises the following steps: f' _t ＝C(F _t )，F′ _t ＝C(F _t ) In which F is _t For the current frame image, F _t-1 C (-) is the cropping operation for the previous frame image if the previous frame image predicts the bounding box

Is a center, 2W ^t-1 、2H ^t-1 And cutting for width and height.

3. The method for fusing target tracking features using the attentive force network as claimed in claim 2, wherein the step S2 comprises:

4. The method of fusing target tracking features using an attentive force network as claimed in claim 3, wherein step S3 comprises: f. of _s ＝ResuNet(F′ _t )，f _t ＝ResNet(F′ _t-1 ) Wherein f is _s For searching for regional features, f _t The feature extraction module comprises a feature extraction network used for extracting the features of the search area and the features of the template, wherein the feature extraction network selects ResNet18, and the feature extraction network does not adopt the feature extraction networkThrough the final average pooling layer and softmax layer.

5. The method of fusing target tracking features using an attentive force network as claimed in claim 4, wherein step S4 comprises:

wherein

For the fusion feature, F ₀ (.) is the fusion operation of the graph attention module GAM.

6. The method of fusing target tracking features using an attentive force network as claimed in claim 5, wherein step S5 comprises:

[left，bottom，right，top]＝F ₄ (f) In which F is _i (. H) is the ith set of fully-connected layers, each set of fully-connected layers being provided with a ReLU activation function and a Dropout operation, F ₄ (. H) is output layer, [ left, bottom, right, top]The horizontal and vertical coordinate values of the lower left corner and the upper right corner.

7. A method for fusing target tracking features using a graph attention network as claimed in any one of claims 3 to 6 wherein: in the data enhancement operation, the parameter of the laplacian distribution is set to b _c ＝1/5,b _s ＝1/5，b _c 、b _s The center and the size of the bounding box are the scale of variation, respectively.

8. The method of fusing target tracking features using an attentive force network of claim 7, wherein: in the data enhancement operation, the aspect ratio variation scale alpha of the bounding box _w ，α _h E (0.6,1.4) to prevent the bounding box from overstretching.

9. The method for fusing target tracking features using a graph attention network as claimed in claim 5 or 6, wherein the graph attention module CAM comprises a linear transformation operation and a similarity calculation operation, and the graph attention module CAM specifically comprises: feature f of search area _s And template features f _t Respectively split into two groups of 1 multiplied by c node sets N _s And N _t For each node in the two sets of nodes

And

after linear transformation, by

The similarity between two nodes is calculated using inner product, where W _s And W _t Is a linear matrix, e _ij Representing nodes

And

wherein a is _ij Representing the similarity score after normalization; after the normalized similarity scores are obtained, linear transformation is carried out on the two groups of nodes, and the information of each template node embedded into the nodes in the search area is respectively calculated according to the similarity scores:

wherein W _g Representing a linear variation matrix, g _i Indicating that is to be aggregated to a node

Template information of (2); to pair

After the same linear transformation, by

The polymerization feature g _i And search area node characteristics

Is spliced, wherein

Is an enhanced node feature.

10. The method of fusing target tracking features using an attentive force network of claim 4, wherein: the feature extraction module also includes an SGD to fine-tune parameters in the feature extraction network during training.