CN113658218A

CN113658218A - Dual-template dense twin network tracking method and device and storage medium

Info

Publication number: CN113658218A
Application number: CN202110811344.0A
Authority: CN
Inventors: 胡栋; 张虎; 张庆军
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-11-16
Anticipated expiration: 2041-07-19
Also published as: CN113658218B

Abstract

The invention belongs to the technical field of video analysis and discloses a dual-template dense twin network target tracking method based on global context, which is characterized in that based on a twin network framework, an AlexNet network in a twin network is replaced by a deeper dense convolution network, characteristics among layers are spliced on a channel to realize characteristic reuse, and adding a global attention module behind the network to capture context dependent information, in addition, designing a new template updating scheme, extracting a picture frame with better performance in a historical tracking result, processing the picture frame to serve as a new target template, performing feature fusion on the original target template and the new template by adopting a space-time attention mechanism, and finally obtaining new template features.

Description

Dual-template dense twin network tracking method and device and storage medium

Technical Field

The invention relates to a double-template dense twin network tracking method, a double-template dense twin network tracking device and a storage medium, and belongs to the technical field of video analysis.

Background

The visual target tracking is an important component of computer vision, and is widely applied to the aspects of intelligent video monitoring, intelligent traffic, unmanned driving, military reconnaissance and the like. The target tracking technology has important research significance in both civil field and military safety field.

In 2016, Luca Bertinetto and the like begin to utilize twin networks for target tracking, and a SimFC algorithm is proposed, and target tracking formally enters the twin network era. The SimFC adopts a twin network structure, one branch circuit extracts target template information, the other branch circuit extracts search area characteristics, then the two parts of characteristics are subjected to related operation, and the target position is judged according to the maximum value of a response graph. Guo et al propose a dynamic twin network DSiam algorithm based on SimFC, which can learn target appearance changes on line, suppress irrelevant backgrounds, and improve the capability of target on-line updating. He and the like are improved on a feature extraction network, an SA-Sim algorithm is provided, and semantic features and appearance features of the image are comprehensively utilized.

However, in an actual life scene, due to complexity and uncertainty of an environment, when a target is subjected to background clutter and target distortion in video target tracking, a problem that tracking performance of a tracker is reduced is caused.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a dual-template dense twin network tracking method, a device and a storage medium, and solves the technical problem that the tracking performance of a tracker is reduced when a target is subjected to background clutter and target distortion in video target tracking.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the invention provides a dual-template dense twin network tracking method, which comprises the following steps:

acquiring and preprocessing an original template image, a new template image and a search image;

inputting the preprocessed original template image, the preprocessed new template image and the preprocessed search image into a pre-constructed and trained twin network model to obtain an original template characteristic diagram, a new template characteristic diagram and a search image characteristic diagram;

respectively obtaining an original template weight map and a new template weight map by a pre-added space-time attention module from the original template feature map and the new template feature map;

carrying out weighted fusion processing on the original template weight graph and the template image feature graph, and the new template weight graph and the new template feature graph to obtain a fusion template feature graph;

and performing cross-correlation operation on the fusion template characteristic diagram and the search image characteristic diagram to obtain a response diagram, and judging the target position according to the maximum value of the response diagram.

Further, the twin network model includes: template branch, new template branch and search branch, the twin network adopts the dense convolution network;

a global attention module is added after the template branch and the new template branch,

a common space-time attention module is added behind the template branch and the new template branch;

the dense convolutional network comprises a convolutional layer, dense blocks 1, a transition layer, dense blocks 2, a transition layer, dense blocks 3 and dense blocks 4; the template branch and the new template branch are completely consistent in structure.

Further, the preprocessing the original template image, the new template image and the search image includes:

and giving an initial target frame in the original template image, the new template image and the search image to obtain an initial target center and a target scale, and calculating according to the initial target frame, the initial target center and the target scale to obtain the sizes of the adjusted original template image, the adjusted new template image and the adjusted search image.

Further, the obtaining the original template weight map and the new template weight map includes:

and calculating the similarity between the new template characteristic diagram and the original template characteristic diagram by using cosine similarity measurement, and executing softmax operation on each position of the new template characteristic diagram and the original template characteristic diagram to obtain an original template weight diagram and a new template weight diagram.

In a second aspect, the invention provides a dual-template dense twin network tracking method, which comprises the following steps:

inputting the preprocessed original template image, the preprocessed new template image and the preprocessed search image into a pre-constructed and trained twin network model to obtain a first original template feature map, a first new template feature map and a search image feature map;

inputting the first original template feature map and the first new template feature map into a global attention module added in advance, and acquiring a second original template feature map and a second new template feature map;

inputting a second original template feature map and a second new template feature map to a pre-added space-time attention module to respectively obtain an original template weight map and a new template weight map;

carrying out weighted fusion processing on the original template weight graph and the second template image feature graph as well as the new template weight graph and the second new template feature graph to obtain a fused template feature graph;

Further, the twin network model includes: template branch, new template branch and search branch, the twin network adopts dense convolution network,

In a third aspect, the invention provides a dual-template dense twin network tracking device, comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any of the above.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs a new template and an original template for feature complementation, realizes the dynamic update of the target template, inputs the original template and the new template into a twin network after selection to respectively obtain respective feature maps, then obtains respective weight maps by the two feature maps through a space-time attention module, and obtains a fused template feature map by weighting and fusing the weight maps and the corresponding feature maps.

2. The method is based on the twin network framework, and can splice the characteristics between layers on the channel to realize characteristic reuse, thereby improving the generalization capability of the characteristics.

3. According to the invention, the global attention module is added after the twin network template branches to aggregate the global context information of the target, so that the network can output a feature map with richer semantic information, and the robustness of the network to the appearance change of the target is further enhanced.

Drawings

FIG. 1 is a flowchart of a method for tracking a dual-template dense twin network according to an embodiment of the present invention;

FIG. 2 is a flow chart of an algorithm provided by an embodiment of the present invention;

FIG. 3 is a diagram of a network architecture provided by an embodiment of the present invention;

FIG. 4 is a diagram of a global attention module provided by an embodiment of the present invention;

FIG. 5 is a block diagram of a spatiotemporal attention module provided by an embodiment of the present invention;

FIG. 6 is a graph of accuracy provided by an embodiment of the present invention;

FIG. 7 is a graph of the success rate provided by an embodiment of the present invention;

FIG. 8 is a graph of accuracy for background clutter according to an embodiment of the present invention;

FIG. 9 is a graph illustrating the success rate of background clutter according to an embodiment of the present invention;

fig. 10 is a diagram of a partial trace result provided by an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1

As shown in fig. 1, the dual-template dense twin network tracking method provided in this embodiment includes:

Specifically, the twin network model includes: template branch, new template branch and search branch, the twin network adopts the dense convolution network;

Specifically, the preprocessing the original template image, the new template image and the search image includes:

Specifically, the obtaining of the original template weight map and the new template weight map includes:

In the embodiment, a new template and an original template are designed to perform feature complementation, so that dynamic update of a target template is realized, the original template and the new template are input into a twin network after selection to respectively obtain respective feature maps, then the two feature maps are used for obtaining respective weight maps through a space-time attention module, the weight maps and the corresponding feature maps are subjected to weighted fusion to obtain a fused template feature map, and the algorithm for updating the target template in real time can also play a good target tracking effect when a target is distorted.

Example 2

In order to make the objects, implementation schemes, and advantages of the present invention clearer, the following takes a sequence Singer1 in an open test set OTB Benchmark as an example, and the specific implementation of the present invention is further described in detail with reference to the accompanying drawings, which are specifically set forth as follows:

the embodiment provides a double-template dense twin network tracking method. The network of the method has three inputs, an original template image, here a first frame image, a new template image and a search image. And obtaining a feature map with global context information by the template image and the new template image through the same dense network and the global attention module. And then the two feature maps are subjected to weight calibration through a space-time attention module to obtain respective weights, and then the weights are summed with the corresponding feature maps to obtain a fused template feature map. And extracting the self characteristic diagram of the search image of the search branch through a same dense network, then performing cross correlation on the fused template characteristic diagram and the characteristic diagram of the search image to obtain a final response diagram, and finally determining the final position of the tracking target according to the response diagram.

The method comprises the following steps:

step 1, adjusting and training a twin network structure:

structure adjustment: the AlexNet network in the original twin network is replaced by a dense convolutional network, the dense convolutional network comprises a convolutional layer, dense blocks 1, a transition layer, dense blocks 2, a transition layer, dense blocks 3 and dense blocks 4, a new template branch completely consistent with a template branch structure is added, a global attention module is added after the template branch and the new template branch, as shown in figure 4, a common space-time attention module is added after the template branch and the new template branch, as shown in figure 5, the improved network structure is as shown in figure 3, the improved network model is trained by an ImageNet data set, and improved network parameters are obtained.

Training process: optimizing a logistic regression target by using a stochastic gradient descent method, training 50 epochs, wherein each epoch comprises 50000 sample pairs, the Batchsize is set to 8, the parameters in the neural network are initialized by adopting Gaussian distribution, the stochastic gradient descent training network with the momentum of 0.9 is used, and the learning rate is exponentially attenuated from 10^-3Is attenuated to 10^-8。

Step 2, in the first frame template image of Singer1, an initial target frame is given as (48,98,40,142), wherein the initial target center ispos (48,98), the target dimension target is 40 × 142, the template image input by the network, the new template image and the search image are read, and according to a given initial target frame (μ, v, w, h), the position of the target is pos (μ, v), and the target dimension is target (w, h). A standard template image may then be generated by the following formula:

wherein A is 127²S is a scale factor, resize is performed on the picture expansion to generate a 127 × 127 template image, a new template image having the same size and a search image having a size of 255 × 255 are generated in the same manner,

and 3, respectively inputting the preprocessed template image, the new template image and the search image into a network, and respectively extracting respective characteristic graphs through a dense convolution network.

And 4, extracting the feature map with the global context information from the template feature map and the new template feature map through a global attention module. Firstly, carrying out context modeling on a feature map, aggregating features of all positions to form a global context feature, mainly utilizing 1 × 1 convolution and a Softmax function, then carrying out feature conversion to capture interdependency among channels, mainly using 1 × 1 convolution, layer normalization and a ReLu function, and finally combining the global context feature into the features of all the positions, wherein the process can be represented by the following formula:

wherein z is_iRepresenting the output characteristic map at position i, x_iRepresents the input feature map at location i, i represents the location index,

weight representing the global attention pool, δ (·) W_v2ReLU(LN(W_v1(.))) represents a feature transformation, W_v2、W_v1And W_kAre all 1 × 1 convolutions, m and j enumerate all positions, and LN (·) represents layer normalization.

And 5, performing feature fusion on the two template feature graphs with the global context information through a space-time attention mechanism to obtain fusion template features. Spatial attention is first used to derive the weight magnitudes at different spatial locations. Specifically, first, features are mapped using a bottleneck subnetwork

Generating embedded features from mapping from one space to another

Then, feature vectors of the original template and the new template at a spatial position p are taken out, cosine similarity between the taken-out feature vectors and the feature vectors taken out by the original template is respectively solved, and softmax operation is executed, so that the feature weight of each template at the position p is obtained. The same operation is performed for each position p of the feature map, resulting in a spatial weight map.

Calculating similarity w between new template features and original template features using cosine similarity measure_i(p) represented by the following formula:

where p represents the spatial position on the feature map, z represents the original template, i represents the new template or the original template,

representing the spatially generated embedded features of the new or original template at position p,

representing the spatially generated embedded features of the original template at position p.

After the weight map of the spatial attention module output is obtained, channel attention is added after the spatial attention in order to further enhance the network performance.

The spatial attention module outputs two 9 × 9 × 128 weight maps, which are the original target template weight map and the new template weight map, respectively. For each weight map, first global average pooling along the channel results in a 1 x 128 weight vector, this vector, to some extent, has a global receptive field, characterizing the global distribution of responses in the channel dimension, while also shielding the effects of spatial information distribution, then sequentially passing through two full-connection layers of 1 × 1 × 128 and 1 × 1 × 128, training out the weight value of the channel by using the correlation between the channels, the activation function behind the first full-connection layer is a ReLU function, the activation function behind the second full-connection layer is a Sigmoid function, a weight vector of 1 x 128 is obtained behind the full-connection layer 2, a weight value is generated for each channel through the Sigmoid function, and each channel of the weight graph output by the spatial attention module is multiplied by the weight of the corresponding channel to obtain a final weight graph. And finally, multiplying the template features extracted by the feature extraction network by the final weight graph, and then summing to obtain the final fusion template features.

And 6, performing related operation on the fusion template features and the previously obtained search image features to obtain a final response image, and further obtaining a target position.

And 7, calculating the average peak correlation energy value of the response image, updating the new template image if the average peak correlation energy value is larger than the average peak correlation energy value of the previous T frames, otherwise, not updating, calculating the average peak correlation energy values of the response images from the T frame to the T-30 th frame, and selecting the frame with the maximum average peak correlation energy value from the 30 frames as a new target template.

The average peak correlation energy value of the calculated score map oscillation may be represented by the following equation:

wherein s is_maxAnd s_minIs the maximum and minimum values in the score plot s, s [ v ]]Represents each oneThe predicted value, | M | is the number of elements in the score map.

Selecting one from the previous predicted results as a new template according to the average peak correlation energy value:

wherein, APCEⁿRepresenting the average peak correlation energy value of the score map in the nth frame.

Step 8, judging whether the current frame is the last frame, if so, ending, otherwise, returning to the step 2, reading a new frame of search image, regenerating a standard search image and continuing target tracking; the algorithm flow chart is shown in fig. 2.

To verify the effect of the method of the invention, the following verification experiments were performed:

the invention adopts 100 recognized and marked video sequences on an OTB (Online Tracking benchmark) platform to test, and the video sequences simulate various conditions in a real scene, including illumination change, scale transformation, partial or serious shielding, deformation and the like. Table 1 shows the hardware and software simulation environment for the experiments of the present invention.

Table 1 hardware and software simulation environment for experiments

CPU	Inter Xeon W-2133
		GPU	Nvidia GeForce RTX 2080Ti
Memory device	32GB
		Operating system	Windows	10
Development environment	Pytorch
		Programming language	Python3.6

On the OTB test platform, there are two main evaluation criteria: precision (Precision Plot) and Success rate (Success Plot).

In the target tracking process, the accuracy can reflect whether the tracker can accurately track the target in the subsequent frame. The central position of the target obtained by the algorithm is called a predicted value, the position of the target marked manually is called a real value, the average Euclidean distance between the predicted value and the real value is calculated by accuracy, if the distance is smaller than a given threshold value, the predicted value is closer to the real value, and the accuracy curve represents the proportion of the frame number of the distance between the predicted value and the real value within the given threshold value to the total frame number when the tracking result is more accurate. The accuracy at different thresholds also constitutes the final accuracy map.

The success rate is measured by the overlapping accuracy between the candidate target frame obtained by target tracking and the artificially marked real area. If B represents the predicted target bounding box and B represents the manually marked real bounding box, the overlapping scores are as follows:

wherein B ≈ B^*Denotes the area of the overlap region between regions B and B ^ B^*The area of the union zone between zones B and B is indicated. Whether the target object is tracked by the algorithm in a picture frame can be judged by testing whether the overlapping score is larger than a certain threshold value, and when the threshold value is between 0 and 1When the success rate changes, and therefore a success rate graph is obtained.

As shown in fig. 6, the accuracy of the present invention reaches 84.8% at a threshold of 20 pixels. Compared with the algorithms of SRDCF, Stacke, CFNet and SimFC, the accuracy of the invention is improved by 5.9%, 6.5%, 7.0% and 8.3%, respectively. As can be seen in fig. 7, the success rate of the present invention reaches 63.2%, which is improved by 3.4%, 4.5%, 5.0% and 5.4% compared to the SRDCF, CFNet, SiamFC and stack algorithms, respectively. The remaining two graphs show the calculated accuracy and success rate according to the sequence with the target deformation challenge, and as can be seen from fig. 8, the accuracy of the present invention is improved by 2.3%, 4.1%, 6.1% and 10.0% respectively compared with several algorithms, namely, stack, SRDCF, CFNet and SiamFC, when the target deformation challenge exists. As can be seen from fig. 9, the success rate of the present invention is improved by 0.8%, 1.5%, 3.3% and 5.7% compared to several algorithms, namely, stack, SRDCF, CFNet and SiamFC, respectively, when there is a background clutter challenge. Based on the above data, it can be seen that the present invention achieves excellent results.

FIG. 10 is a partial tracking effect diagram of the present invention, in which three selected sequences have challenges such as background clutter, target distortion, and scale variation. For Bird1 sequence, all algorithms can track the target in the 10 th frame, from the 208 th frame, due to the change of the target form, other comparison algorithms except the algorithm of the invention gradually lose the target or track the wrong target, which shows that the tracking effect of the invention is the best, for Jump sequence, CFNet already loses the target in the 10 th frame, SRDCF loses the target in the 29 th frame, 62 frames of Staple also lose the target, and 116 frames of Staple lose the target except the invention. For the Skiing sequence, the target morphology changes more obviously, the SimFC and the Staple lose targets by the 25 th frame, the CFNet also loses targets at 42 frames, and only the SRDCF and the invention can still accurately track the targets until 60 frames.

The invention provides a dual-template dense twin network tracking algorithm based on global context, which is characterized in that an AlexNet network is replaced by a dense convolution network with stronger feature extraction capability, the generalization capability of features is improved, in order to further improve the expressive force of target appearance features, a global attention module is added after the twin network template is branched, the global attention module can aggregate information of the target global context, so that the robustness of depth features on target appearance changes is improved, a new template is designed to supplement the original template, the dynamic updating of the template is realized, and the algorithm for updating the target template in real time can also play a good target tracking effect when a target is distorted.

Example 3

The embodiment provides a dual-template dense twin network tracking method, which comprises the following steps:

Example 4

The invention provides a double-template dense twin network tracking device, which comprises a processor and a storage medium, wherein the processor is used for processing a twin network;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of:

Example 5

The invention provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of any of the methods of:

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A double-template dense twin network tracking method is characterized by comprising the following steps:

2. The dual-template dense twin network tracking method of claim 1, wherein the twin network model comprises: template branch, new template branch and search branch, the twin network adopts the dense convolution network;

3. The dual-template dense twin network tracking method of claim 1, wherein the preprocessing the original template image, the new template image, and the search image comprises:

4. The dual-template dense twin network tracking method according to claim 1, wherein the obtaining of the original template weight map and the new template weight map comprises:

5. A double-template dense twin network tracking method is characterized by comprising the following steps:

6. The dual-template dense twin network tracking method of claim 5, wherein the twin network model comprises: template branch, new template branch and search branch, the twin network adopts dense convolution network,

7. A double-template dense twin network tracking device is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 6.

8. Computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.