CN111860248B

CN111860248B - Visual target tracking method based on twin gradual attention-guided fusion network

Info

Publication number: CN111860248B
Application number: CN202010653263.8A
Authority: CN
Inventors: 宋晓宁; 范颖; 冯振华
Original assignee: Shanghai Litu Information Technology Co ltd
Current assignee: Ditu Suzhou Biotechnology Co ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2021-06-25
Anticipated expiration: 2040-07-08
Also published as: CN111860248A

Abstract

The invention discloses a visual target tracking method based on a twin gradual attention-guided fusion network, which comprises the following steps of utilizing a twin main network to take charge of extracting multilayer characteristic representations of an example sample and a search sample; the definition feature aggregation module is used for coding and fusing each layer of information step by step and selectively combining the high-layer feature information and the shallow-layer feature information to generate a comprehensive enhanced feature representation; and an attention guide module is introduced into the feature aggregation module, so that feature representation is refined, interference caused by noise is suppressed, and fused features are more representative. The invention has the beneficial effects that: high-level semantic information is transmitted from a deep convolutional layer to a shallow convolutional layer in a top-down layer-by-layer fusion mode, and a space channel attention module is additionally introduced to refine the whole network, so that the network can selectively integrate characteristic information from multiple layers to generate strong characteristics, and the effectiveness of the network is proved by wide evaluation.

Description

Visual target tracking method based on twin gradual attention-guided fusion network

Technical Field

The invention relates to the technical field of visual target tracking, in particular to a visual target tracking method based on a twin gradual attention-guided fusion network.

Background

In recent years, visual target tracking is one of the basic problems of computer vision, and has gained more and more attention in the past years, and it has been widely applied in video monitoring, human-computer interaction, video editing and the like. However, visual target tracking still has great challenges due to the influence of practical factors such as scale change, occlusion, deformation, fast motion, background clutter and the like.

Most existing tracking methods are developed based on two successful frameworks. The first is a correlation filter. The algorithm based on the correlation filtering has higher computational efficiency and quite good tracking accuracy due to the use of the fast Fourier transform. A notable example is MOSSE, which is the first method to use correlation filters for target tracking, which operates at a speed of about 700 frames per second. However, for complex tracking scenarios, the performance of the tracker is typically significantly degraded. To meet various challenges and achieve competitive tracking performance, the related filtering tracker is then developed focusing on kernel functions, motion information, multi-dimensional features, multi-scale estimation, mitigation of boundary effects, and deep convolution features. Especially, the deep convolution feature, which has stronger representation capability compared with the manual feature, has become an important means for improving the tracking accuracy.

Unfortunately, however, visual tracking communities still suffer from several problems. Firstly, the recognition capability of these tracking algorithms depends to a large extent on the feature extraction capability of the twin network. Secondly, in the twin networks, feature representations are often extracted from the last convolutional layer to perform matching similarity calculation, and deep feature representations are often low in resolution and not beneficial to accurate positioning.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned conventional problems.

Therefore, the technical problem solved by the invention is as follows: a visual target tracking method based on a twin gradual attention-guided fusion network is used for target tracking.

In order to solve the technical problems, the invention provides the following technical scheme: a visual target tracking method based on a twin gradual attention-guided fusion network comprises the following steps of utilizing a twin main network to extract multilayer characteristic representations of example samples and search samples; the definition feature aggregation module is used for coding and fusing each layer of information step by step and selectively combining the high-layer feature information and the shallow-layer feature information to generate a comprehensive enhanced feature representation; an attention guide module is introduced into the feature aggregation module, feature representation is refined, interference caused by noise is suppressed, and fused features are more representative; and constructing a twin gradual attention guide fusion network tracker for visual target tracking.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the twin backbone network comprises the steps of employing a modified ResNet 22; defining a network to be divided into 3 stages, wherein the overall step length is 8 and the network consists of 22 convolutional layers; when the convolution layer uses padding, using a clipping operation to eliminate the feature calculation affected by zero padding, and simultaneously keeping the structure of the internal block unchanged; feature down-sampling following the original ResNet in the first two phases of the network; in the third stage, downsampling is performed by the largest pooling with step size 2 instead of the convolutional layer, and this layer is located in the first block of this stage, layer 2-1.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the twin backbone network comprises two identical branches, an example branch and a search branch, respectively, wherein the example branch receives an input of an example sample; the search branch receives an input of a search sample; the example branch and the search branch share parameters in a convolutional neural network to ensure that the same transformation is used for both samples.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the feature aggregation module receives two parts of input, namely a low-level feature from a corresponding layer and a high-level attention feature generated by the attention guidance module at a previous layer; the upsampled high-level attention features are fused with low-level features using a cascading operation.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the feature aggregation module includes the steps of the advanced attention feature generated by the attention guidance module

Then sent to a deconvolution layerIn deconv, where l is the layer number index; the deconvolution layer upsamples features to the current low-level features S_l-1The same size of space; cascading the up-sampled high-level attention features and low-level features together, and performing feature channel compression through 1 × 1 convolution operation conv to generate comprehensive features

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: comprises the following steps of fusing the comprehensive characteristics

Input into the attention guidance module to generate an attention feature

Making fine features and reducing noise, i.e. directing attention to features

Focus on more representative information representations; the procedure is represented as follows:

where cat (-) denotes cascade operation and attguidmem (-) denotes attention-directed module.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: includes the following steps of generating the attention feature

As high-level features of layer (l-2); height ofLevel feature and low level feature S_l-2Inputting the feature fusion data into the feature aggregation modules of the corresponding layers together to repeat the feature fusion process; the output characteristics of the final three convolutional layer layers of layer2-2, layer2-3 and layer2-4 of the 3 rd stage of the example branching network are l-3, convolutional characteristic S_l∈{1,2,3}(ii) a Layer-by-layer generation of attention features

And

wherein attention is paid to the characteristics

For directly branching an example network top-level feature S₃Generated by input to the attention guidance module, i.e.

Features to be finally generated respectively

And top layer features S₃Used for carrying out matching similarity calculation with the output characteristics of the corresponding layer of the search branch network.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the attention guidance module comprises a space attention mechanism and a channel attention mechanism; wherein, a spatial attention mechanism is adopted to guide the fusion feature to pay attention to a more representative feature region, a channel attention mechanism is adopted to strengthen the feature from the aspect of a channel, and the feature representation of a specific target is improved by distributing larger weight to a feature channel which has higher response to the target.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the spatial attention mechanism includesThe following step, directly representing the input features as F e R^W×H×CWherein C, W and H represent the channel, width and height dimensions, respectively; compressing the input features by using a 3 x 3 convolutional layer; normalization operation is carried out by using Sigmoid function to generate attention mapping m epsilon R^W×H×1Is m ═ σ (conv)_3×3(F) Where σ () is the Sigmoid function, conv_3×3(. -) represents a 3 × 3 convolution operation; final spatial attention feature F^sa∈R^W×H×CIs calculated as

Wherein

Representing multiplication in the element direction.

As a preferred solution of the visual target tracking method based on the twin gradual attention-directed fusion network according to the present invention, wherein: the channel attention mechanism comprises the steps of compressing the spatial dependence of the input features by global average pooling; highlighting channels that respond to a particular target, given an input feature F ∈ R, using a 1 × 1 convolutional layer and Sigmoid function^W×H×C(ii) a Channel attention mapping u ∈ R^C×1×1Calculated as u- σ (conv)_1×1(gap (F))), wherein conv_1×1(. cndot.) represents a 1 × 1 convolution operation, gap (. cndot.) is a global average pooling divided by channels, σ (. cndot.) represents a Sigmoid function, u. epsilon.R^C ^×1×1As a channel attention map, is applied to the input features to guide the generation of more representative attention features as

Wherein

Representing multiplication in the element direction; the attention guide module is composed of the channel space attention mechanism in series, and the whole module process can be expressed as

The invention has the beneficial effects that: high-level semantic information is transmitted from a deep convolutional layer to a shallow convolutional layer in a top-down layer-by-layer fusion mode, and a space channel attention module is additionally introduced to refine the whole network, so that the network can selectively integrate characteristic information from multiple layers to generate strong characteristics, and the effectiveness of the network is proved by wide evaluation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

FIG. 1 is a schematic diagram of a twin gradual attention-directed fusion network framework according to the present invention;

FIG. 2 is a schematic diagram of a feature aggregation module detail in accordance with the present invention;

FIG. 3 is a schematic diagram of details of an attention guidance module according to the present invention;

FIG. 4 is a schematic diagram of an assessment success map on an OTB2013 dataset according to the present invention;

FIG. 5 is a schematic diagram of an assessment accuracy map on the OTB2013 dataset according to the present invention;

fig. 6 is a schematic diagram of an assessment success map on an OTB2015 data set according to the invention;

FIG. 7 is a schematic illustration of an evaluation accuracy map on an OTB2015 dataset according to the invention;

FIG. 8 is a schematic diagram of the evaluation success map of each module according to the present invention on an OTB2013 data set;

FIG. 9 is a schematic diagram of the modules of the present invention evaluating an accuracy map on an OTB2013 dataset;

fig. 10 is a schematic diagram of the evaluation success map of each module according to the present invention on an OTB2015 data set;

fig. 11 is a schematic diagram of modules of the present invention evaluating an exact map on an OTB2015 data set.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

As target tracking technology has evolved, the accuracy of correlation filtering based trackers has begun to increase dramatically over a variety of tracking references and challenges, but this has also presented other problems. Most of the Convolutional Neural Networks (CNN) for feature extraction by trackers are trained offline in advance by using image classification data sets, and are not originally trained for tracking problems, so that the Convolutional Neural Networks (CNN) are influenced by training noise in the tracking process and sometimes cannot be well adapted to specific targets. Although some trackers have the ability to learn online, this problem can be overcome by performing online fine-tuning of network parameters during the tracking process, such as using a stochastic gradient descent method (SGD). The process of online learning and updating is very computationally expensive, resulting in a tracker that cannot meet the real-time requirements (>30 FPS). Therefore, in order to balance tracking performance and speed, another popular target tracking framework, twin network-based target tracking, has emerged in recent years. These twin network based tracking algorithms (e.g., SiamFC) use a twin convolutional network to extract feature representations of the target template and the search area for comparison, thereby converting the tracking problem into a matching learning problem. Since the framework is trained end-to-end, it is easier to train on the trace data set, so that these trace methods can provide high accuracy performance without any trimming or online update process. Several recent extensions, using regional suggestion networks (RPN) for classification and detection after feature extraction with twin networks, have achieved excellent results on a variety of reference datasets.

In this embodiment, a new end-to-end depth architecture is proposed, named twin progressive awareness guided convergence network (siamagaf). Advanced backbone networks are employed in the network to achieve better feature recognition capabilities. In addition, the deep feature representation of the backbone network has high semantic level although the resolution is low, and can effectively distinguish different types of targets; the shallow feature representation has high resolution, can capture abundant structure detail information, and is very useful for accurate positioning. Therefore, in the embodiment, a top-down mode is adopted to gradually fuse high-level semantic information and shallow-level spatial structure information, and an attention guide module is introduced to refine feature suppression fusion noise and gradually improve the feature learning capability of the network. By the method, stronger feature representation can be learned to calculate the matching similarity, and the tracking performance is effectively improved.

Since twin network-based trackers are leading to the latest tracking trend, they can achieve a considerably high tracking accuracy at a high tracking speed, but none of the conventional methods can flexibly cope with target scale changes. Referring to the illustration of fig. 1, the overall structure of a twin gradual attention guidance fusion network framework in the visual target tracking method based on a twin gradual attention guidance fusion network proposed in this embodiment is illustrated. As shown in fig. 1, the SiamPAGF is composed of a twin backbone network, a feature aggregation module and an attention guidance module, the twin backbone network is used for extracting the multilayer feature representation of the example sample and the search sample, the feature aggregation module gradually integrates the multilayer feature information in a top-down manner, and the attention guidance module refines the guidance network by using a group of spatial channel attention in each feature aggregation module and intelligently selects the feature representation information to be integrated.

Specifically, the method comprises the following steps,

s1: utilizing a twin backbone network is responsible for extracting the multi-layer feature representation of the example sample and the search sample. The twin backbone network in this step comprises the following construction steps,

using a modified ResNet 22;

defining a network to be divided into 3 stages, wherein the overall step length is 8 and the network consists of 22 convolutional layers;

when the convolution layer uses padding, using a clipping operation to eliminate the feature calculation affected by zero padding, and simultaneously keeping the structure of the internal block unchanged;

feature down-sampling following the original ResNet in the first two phases of the network;

in the third stage, downsampling is performed by the largest pooling with step size 2 instead of the convolutional layer, and this layer is located in the first block of this stage, layer 2-1.

Further, the twin backbone network comprises two identical branches, an example branch and a search branch, respectively, wherein the example branch receives an input of an example sample; the search branch receives input of a search sample; the example branch and the search branch share parameters in the convolutional neural network to ensure that the same transformation is used for both samples.

In addition, the deep neural network in recent years has proved to be effective in a twin network-based tracker which is based on the full convolution property and only suitable for the situation that the used backbone network has no filling operation, and although the original ResNet can learn very strong feature representation, the filling operation used in the network can introduce position deviation in the twin tracking framework, so that the matching similarity of the target and the search sample is reduced, and the tracking performance is reduced.

S2: and the defined characteristic aggregation module is used for gradually coding and fusing the information of each layer and selectively combining the high-layer characteristic information and the shallow-layer characteristic information to generate a comprehensive enhanced characteristic representation. In this step, due to downsampling operations such as pooling and convolution, the resolution of the feature map extracted from the last layer of the twin trunk network is greatly reduced (for example, the size of the feature map at the last layer of the example branch is 5 × 5). Although the deep features can provide good understanding of semantic information, they also result in the inability to accurately locate the target. Therefore, the embodiment not only needs the high-level features of the top layer, but also should apply the low-level features containing the spatial structure information to the matching similarity calculation, which is very beneficial for the final target tracking result. However, it is also noted that processing each layer of features independently, i.e., directly using shallow and deep features for computation, is often not as efficient.

To this end, with reference to the schematic of fig. 2, a feature aggregation module is proposed in the network. The feature aggregation module receives two parts of input, namely low-level features from a corresponding layer and high-level attention features generated by an attention guide module at a previous layer; and the upsampled high-level attention features are fused with the low-level features by adopting a cascading operation.

Further, the feature aggregation module includes the steps of,

advanced attention feature generated by attention guidance module

Then, the data is sent into a deconvolution layer deconv;

deconvolution layer upsamples features to the current low-level features S_l-1The same size of space;

cascading the up-sampled high-level attention features and low-level features together, and performing feature channel compression through 1 × 1 convolution operation conv to generate comprehensive features

S3: and an attention guide module is introduced into the feature aggregation module, so that feature representation is refined, interference caused by noise is suppressed, and fused features are more representative. Referring to the schematic of fig. 3, the present embodiment combines the low-level features and the upper-level attention features in a cascade manner in the feature aggregation module to obtain a comprehensive feature expression. The method is simple and intuitive, but also brings certain defects. For example, due to the contradiction between different levels of feature information, some noise is inevitably brought to affect the final tracking effect. Therefore, an attention-guiding module is proposed here to further refine the features and suppress the interference caused by noise, so that the fused features are more representative.

Further, the method also comprises the following steps,

the integrated features after fusion

Input into an attention guidance module to generate attention features

Making fine features and reducing noise, i.e. directing attention to features

Focus on more representative information representations;

the procedure is represented as follows:

where cat (-) denotes cascade operation, attguidmem (-) denotes attention leading module, and l is layer number index.

Further, the method comprises the following steps of,

attention feature to be generated

As high-level features of layer (l-2);

high level features and low level features S_l-2Inputting the data into a feature aggregation module of a corresponding layer together to repeat a feature fusion process;

the output characteristics of the final three convolutional layer layers of layer2-2, layer2-3 and layer2-4 of the 3 rd stage of the example branching network are l-3, convolutional characteristic S_l∈{1,2,3}；

Layer-by-layer generation of attention features

And

wherein attention is paid to the characteristics

For directly branching an example network top-level feature S₃Input to attention guidance Module Generation, i.e.

Features to be finally generated respectively

And top layer features S₃Used for carrying out matching similarity calculation with the output characteristics of the corresponding layer of the search branch network. This is optimal for the final tracking effect.

S4: and constructing a twin gradual attention guide fusion network tracker for visual target tracking.

It should be noted that the attention method is helpful for the computer vision to focus on the important features, and at the same time, the attention guiding module includes a space attention mechanism and a channel attention mechanism; wherein a spatial attention mechanism is used to guide the fused features to focus on more representative feature regions, and a channel attention mechanism is used to enhance the features from the perspective of the channel, improving the feature representation for a particular target by assigning greater weight to feature channels that are more responsive to the target.

The spatial attention mechanism includes the following steps,

direct representation of input features as F e R^W×H×CWherein C, W and H represent the channel, width and height dimensions, respectively;

compressing the input features by using a 3 x 3 convolutional layer;

normalization operation is carried out by using Sigmoid function to generate attention mapping m epsilon R^W×H×1Is m ═ σ (conv)_3×3(F) Where σ () is Sigmoid)Function, conv_3×3(. -) represents a 3 × 3 convolution operation;

final spatial attention feature F^sa∈R^W×H×CIs calculated as

Wherein

Representing multiplication in the element direction.

Further, to mitigate noise interference in the fused features, the spatial attention mechanism deals with the problem from a spatial perspective. In fact, considering that each channel of the convolutional network features will respond to different semantics, the features can also be enhanced from the channel perspective. For example, the feature representation for a particular target is improved by assigning greater weight to feature channels that have a higher response to the target.

Specifically, the channel attention mechanism includes the following steps,

compressing the spatial dependence of the input features by global average pooling;

highlighting channels that respond to a particular target, given an input feature F ∈ R, using a 1 × 1 convolutional layer and Sigmoid function^W×H×C；

Channel attention mapping u ∈ R^C×1×1Calculated as u- σ (conv)_1×1(gap (F))), wherein conv_1×1(. cndot.) represents a 1 × 1 convolution operation, gap (. cndot.) is a global average pooling divided by channels, σ (. cndot.) represents a Sigmoid function, u. epsilon.R^C×1×1As a channel attention map, is applied to the input features to guide the generation of more representative attention features as

Wherein

Representing multiplication in the element direction;

the attention guide module is composed of a channel space attention mechanism in series connection, and the whole module is connected with the attention guide moduleA procedure can be expressed as

In the embodiment, a strong backbone network is deployed and an attention mechanism is introduced to capture the characteristics with richer representation capability; secondly, a twin gradual attention guiding fusion network framework is provided, convolution features of different layers are gradually modulated and fused in a top-down mode, and multi-layer feature information is effectively utilized.

Effective convolution features are crucial for twin network-based target tracking algorithms, but how to learn powerful feature representations remains a challenging task. It can be seen that most of the existing twin network-based tracking algorithms calculate similarity by using the last layer of semantic features of the backbone network to perform target tracking, however, it is often insufficient to use deep feature space singly.

Example 2

In order to verify the effective result of the above method, by the above proposed method, we can selectively integrate the multi-layer feature information to calculate similarity for target tracking, and the tracker siamagaf proposed in the above embodiment is evaluated on five common tracking references, namely OTB2013, OTB50, OTB2015, VOT2016 and VOT 2017. Experiments have shown that siamagaf can achieve a very high relative improvement over the baseline tracker. The tracker was implemented using Python 3.6 and PyTorch 0.4.1 frameworks, and the experiment was performed on a GeForce RTX 2080 GPU.

Details of the experiment:

training process: the network training set is from a large object tracking data set GOT10K, which comprises 10000 video segments of moving objects in the real world, and is divided into 560 categories, and the bounding boxes of the objects are all manually marked, and the total number is over 150 ten thousand. The data set is preprocessed in the same way as the baseline tracker SiamDW, with example image sizes and search image sizes 127 × 127, 255 × 255, respectively. During training, the twin backbone network was initialized by the ResNet22 model pre-trained on the ImageNet classification dataset, using a random gradient descent method, with a momentum of 0.9 and a weight decay of 0.0005. The learning rate decays exponentially from 0.01 to 0.00001, the whole process comprises 100 generations, and training is carried out on 4 GPUs, each GPU iteratively processing 8 images each time.

The testing process comprises the following steps: in the tracking process, the target area is first clipped as a sample in the initial frame according to a given comment tag, and then adjusted to 127 × 127 pixel size to be input to the example branching network. For the subsequent frame, the search area centered at the previous frame position is clipped and adjusted to 255 × 255 pixels. After feature extraction is carried out by using a search branch network, sample features and search sample features are correlated to generate a similarity score map with the size of 17 multiplied by 17. In addition to dealing with target scaling, the target is searched over three scales and the scales are updated by linear interpolation by a factor of 0.3291.

In contrast to the advanced tracking method:

the experiments were evaluated on OTB:

tracker siapagf was evaluated on 5 common tracking benchmarks (OTB2013, OTB50, OTB2015, VOT2016, and VOT2017), respectively. To show fairness in comparison, these methods are implemented using author-provided trace benchmark results or using parameter settings given by the author.

The evaluation was performed on OTB tracking benchmarks (OTB2013, OTB50 and OTB2015), which are one of the most widely used common tracking marks at present, where both OTB2013 and OTB50 consist of 50 fully annotated sequences. Whereas the OTB2015 data set is an extension to OTB2013, containing 100 video sequences. The OTB benchmark is to use the area under the curve (AUC) and the precision map of the success map as evaluation standards, and adopt a one-off evaluation (OPE) strategy, and FIGS. 4-7 show the success map and the precision map of the OTB2013/2015 benchmark test at different thresholds compared with the baseline tracker siamDW and some advanced tracking methods. It can clearly be seen that the tracker siamagaf of the present method outperforms the baseline in each of these two data sets. In addition, siamPAGF is also evaluated with more advanced tracking methods on three OTB datasets, such as ECO-HC, siamRPN, structSim, SA-Sim, DaSiamRPN, siamBM, and C-RPN. Detailed experimental results as shown in table 1 below, the proposed siapagf performed very competitively in all of these trackers on the three OTB2013/50/2015 tracking benchmarks.

Table 1: AUC score results on OTB basis.

The experiment was evaluated on VOT:

the tracker siamagaf evaluates on two VOT reference datasets, VOT2016 and VOT 2017. The evaluation is performed by the VOT official toolkit, in terms of accuracy (A), robustness (R), and Expected Average Overlap (EAO). In this process, SiamPAGF is not only compared to twin network based trackers, but is also evaluated against some top ranked trackers in the VOT2016 or VOT 2017. The results of the experiments on VOT2016 and VOT2017 are specifically shown in table 2 below.

Table 2: VOT benchmark evaluation results.

In order to verify the tracker of the present invention and analyze the effectiveness of each module, baseline SiamDW and several different types of SiamPAGF were evaluated on OTB basis, and specific experimental results can be seen visually in fig. 8-11.

Verification of the feature aggregation module: to investigate the effectiveness of the proposed feature aggregation module, a siamagaf-NAG tracker was first implemented, which only uses the feature aggregation module, and where no attention was paid to the bootstrap module. Compared with a baseline tracker, the AUC scores of the OTB2013/2015 reference data sets are respectively improved by 2% and 0.7%, and reach 0.684/0.659. Another deformation tracker, siamPAGF-NF, was constructed that used the shallow and deep features directly for similarity calculations. As can be seen from the experimental results, this straightforward calculation resulted in a significant decrease in AUC scores at both OTB baseline datasets, 0.639/0.617 respectively. The characteristic aggregation module of the method can effectively utilize multi-layer characteristic information and is very useful for tracking results.

For verification of the attention guidance module: the method provided by the invention is characterized in that a guide module is used for refining fusion characteristics and suppressing noise from two aspects of space and channel, and experimental comparative analysis is carried out under different conditions in order to verify the individual contribution of different components in the module to the tracking effect. Table 3 below is the experimental results using different attention modules. It can be observed that integrating either the Channel (CA) or the Spatial Attention (SA) module in the tracker siapagf-NAG alone is not as effective as serializing the two attention modules (i.e. the attention-directing module we propose) compared to siapagf-NAG (tracker without attention-directing module). This shows that although the spatial attention or channel attention alone only slightly improves the final tracking result or even degrades the performance, their combination can improve the tracking effect very well.

Table 3: experimental results of different deformation trackers on OTB basis.

In the embodiment, a progressive attention-guided fusion tracking model (siamagaf) based on a twin network is adopted to effectively encode and fuse deep semantic information and shallow spatial structure information by using a top-down method through a plurality of progressive feature aggregation modules, and spatial and channel attention in an attention-guided module is utilized to reduce feature redundancy generated by fusion. By the method, the similarity can be calculated by selectively integrating the multi-layer characteristic information so as to track the target. The proposed tracker siapagf evaluates on five common tracking benchmarks OTB2013, OTB50, OTB2015, VOT2016 and VOT 2017. Experiments have shown that siamagaf can achieve a very high relative improvement over the baseline tracker.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein. A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A visual target tracking method based on a twin gradual attention-guided fusion network is characterized in that: comprises the following steps of (a) carrying out,

utilizing a twin backbone network to take charge of extracting multi-layer feature representations of the example samples and the search samples;

the definition feature aggregation module is used for coding and fusing each layer of information step by step and selectively combining the high-layer feature information and the shallow-layer feature information to generate a comprehensive enhanced feature representation;

an attention guide module is introduced into the feature aggregation module, feature representation is refined, interference caused by noise is suppressed, and fused features are more representative;

constructing a twin gradual attention guide fusion network tracker for tracking a visual target;

wherein the twin backbone network comprises the steps of,

using a modified ResNet 22;

in the third stage, downsampling is performed by the largest pooling with step size 2 instead of the convolutional layer, and this layer is located in the first block of this stage, i.e., layer 2-1;

the twin backbone network comprises two identical branches, an example branch and a search branch, respectively, wherein,

the example branch receives an input of an example sample;

the search branch receives an input of a search sample;

the example branch and the search branch share parameters in a convolutional neural network to ensure the same transformation is used for both samples;

the feature aggregation module includes a feature aggregation module including,

receiving two parts of input, namely low-level features from a corresponding layer and high-level attention features generated by the attention guidance module at a previous layer;

the upsampled high-level attention features are fused with low-level features using a cascading operation.

2. The twin progressive attention-guided fusion network-based visual target tracking method of claim 1, wherein: the feature aggregation module comprises the steps of,

advanced attention features generated by the attention guidance module

Then, sending the data to a deconvolution layer deconv, wherein l is a layer number index;

the deconvolution layer upsamples features to the current low-level features S_l-1The same size of space;

3. The twin progressive attention-guided fusion network-based visual target tracking method of claim 2, wherein: comprises the following steps of (a) carrying out,

the integrated features after fusion

Input into the attention guidance module to generate an attention feature

Making fine features and reducing noise, i.e. directing attention to features

Focus on more representative information representations;

the procedure is represented as follows:

4. The twin progressive attention-guided fusion network-based visual target tracking method of claim 3, wherein: comprises the following steps of (a) carrying out,

attention feature to be generated

As high-level features of layer (l-2);

high level features and low level features S_l-2Inputting the feature fusion data into the feature aggregation modules of the corresponding layers together to repeat the feature fusion process;

stage 3 last three convolutional layer layers 2-2, layer2-3, andthe output characteristic of layer2-4 is l-3, convolution characteristic S_l∈{1,2,3}；

Layer-by-layer generation of attention features

And

wherein attention is paid to the characteristics

Features to be finally generated respectively

5. The twin progressive attention-guided fusion network-based visual target tracking method of claim 4, wherein: the attention guidance module comprises a space attention mechanism and a channel attention mechanism; wherein,

a spatial attention mechanism is employed to guide the fused features to focus on more representative feature regions,

features are enhanced from a channel perspective using a channel attention mechanism to improve the representation of features to a particular target by assigning greater weight to feature channels that are more responsive to the target.

6. The twin progressive attention-guided fusion network-based visual target tracking method of claim 5, wherein: the spatial attention mechanism comprises the following steps,

compressing the input features by using a 3 x 3 convolutional layer;

normalization operation is carried out by using Sigmoid function to generate attention mapping m epsilon R^W×H×1Is m ═ σ (conv)_3×3(F) Where σ () is the Sigmoid function, conv_3×3(. -) represents a 3 × 3 convolution operation;

final spatial attention feature F^sa∈R^W×H×CIs calculated as

Wherein

Representing multiplication in the element direction.

7. The twin progressive attention-guided fusion network-based visual target tracking method of claim 6, wherein: the channel attention mechanism includes the following steps,

Wherein

Representing multiplication in the element direction;

the attention guide module is composed of the channel space attention mechanism in series, and the whole module process can be expressed as