CN112541468A

CN112541468A - Target tracking method based on dual-template response fusion

Info

Publication number: CN112541468A
Application number: CN202011524190.9A
Authority: CN
Inventors: 史殿习; 王宁; 刘聪; 杨文婧; 杨绍武
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-03-23
Anticipated expiration: 2040-12-22
Also published as: CN112541468B

Abstract

The invention discloses a target tracking method based on dual-template response fusion, and aims to solve the problems that the accuracy of a template is reduced in a template fixing mode, and the robustness of the template is reduced in a template dynamic updating mode. Firstly, constructing a target tracking system consisting of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module; then, selecting a training set, and training a target tracking system; and finally, performing target tracking on each frame of the video sequence by using a trained target tracking system, wherein the target tracking comprises the steps of feature extraction, cross-correlation response calculation, cross-correlation response fusion and target position and size prediction, and obtaining a target tracking result. In the target tracking process, the template initialized by the first frame of the video and the template dynamically updated by the tracking result in each subsequent frame are used in the twin tracking network, so that the advantages and disadvantages of the two templates are complemented, the target tracking precision is improved, and the robustness and the real-time performance are ensured.

Description

Target tracking method based on dual-template response fusion

Technical Field

The invention belongs to the field of computer vision target tracking, and particularly relates to a target tracking method based on dual-template response fusion in a twin network structure.

Background

Target tracking is an important task in the field of computer vision and can be divided into single-target tracking and multi-target tracking according to the number of targets. Wherein the application range of the single target tracking is wider. Specifically, for a video sequence, single object tracking uses a given object bounding box in the first frame of images in the sequence to initialize the tracker and to predict the location of the object in subsequent video frames and frame out the object. The target tracking technology plays an important role in the technical fields of military, agriculture, security and the like, and along with the rapid development of the artificial intelligence technology and the increase of the requirements of practical application, the performance requirements of people on the target tracking algorithm are higher and higher, so that the research on the target tracking technology is very necessary.

Single target tracking algorithms are mainly divided into two main categories: generative tracking and discriminant tracking. Generative tracking focuses primarily on the characterization of the intrinsic distribution of target appearance data, which is indistinguishable. Discriminant tracking equates the tracking problem to a classification problem, separating the target from the background by learning a classifier. Because the discriminant tracking algorithm has strong foreground and background distinguishing capability, the accuracy and the robustness of the discriminant tracking algorithm are higher than those of a generative tracking algorithm.

Two of the most popular discriminant tracking methods at present are correlation filtering-based tracking and deep learning-based tracking, respectively. The tracking algorithm based on deep learning can extract the depth features of the target, the depth features have richer semantic information, and the appearance expression capability of the target is stronger. Therefore, the current deep learning-based method has more excellent accuracy and robustness. In the current tracking method based on deep learning, the twin network structure is most commonly adopted. The twin network is a convolutional neural network with two parameters identical. The twin network tracker uses a twin network as a feature extraction network, takes one of the network branches as a template branch, takes GT (real target bounding box) in a first frame of a video as input, and extracts features of a target as a template. And the other network branch is used as a detection branch, firstly, a detection area is extracted from each frame after the first frame, and then the detection area is used as network input to output the characteristics of the detection area. In the subsequent classification and regression network, candidate frames are extracted from the detection area, the candidate frames are matched with the template features, the candidate frame with the highest matching score is selected as a target frame, and regression correction is carried out on the length and the width of the target frame.

Currently, in twin network based tracking methods, the template is either fixed or dynamically updated.

Template fixing is to initialize the template by using the GT of the first frame of the video, and the template is kept unchanged in the following tracking process. Since the template is initialized with GT, the template is absolutely correct; but in subsequent video frames, the object may change in size, shape, etc., resulting in a change in the feature. Therefore, the latest features of the target cannot be obtained by using the fixed template, so that the template cannot be correctly matched with the features of the target, and inaccurate tracking is caused.

The dynamic updating of the template is to initialize the template in the first frame and update the template according to the predicted target frame of each frame in the following tracking process. Because the template is updated by using the tracking result of each frame, the template always contains the latest features of the target, and the tracking accuracy can be improved; however, if the tracking is lost or biased, the updated template will be damaged by the background features, so that the template becomes no longer robust, and the tracking is wrong.

Therefore, how to consider the accuracy and robustness of the template in the target tracking method based on the twin network enables the template to always acquire the latest features of the target in the tracking process, and meanwhile, the template is not damaged due to tracking drift, so that the overall accuracy and robustness of the target tracking method are improved, and the method is a hotspot problem which is researched by researchers in the field.

Disclosure of Invention

The invention aims to solve the technical problem that the template fixing mode in the twin network tracking method may reduce the accuracy of the template, and the template dynamic updating mode may reduce the robustness of the template.

The invention provides a target tracking method based on a dual template on the basis of the existing twin network tracking method, namely, an initial template (a template initialized by a first frame of a video) and an updated template (a template dynamically updated by using a tracking result in each frame later) are simultaneously used in the twin tracking network, so that the advantages and disadvantages of the two templates are complemented, and the target tracking precision is improved.

In order to solve the technical problems, the technical scheme of the invention is as follows: firstly, a target tracking system consisting of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module is constructed. Then select ImageNet VID and DET (see the documents "Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual basic recognition [ J ]. International juuran of Computer Vision,2015,115(3):211-, paper by Esteban Real: YouTube bounding box: large high-precision manual labeling datasets for object detection in video) and GOT-10k (see the literature "Huang L, Zhao X, Huang k. go-10 k: a large high-diversity benchmark for genetic object tracking in the world [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2019.", paper by Lianghua Huang: GOT-10k: large-scale complex diversity reference for field general target tracking) as a training set for training a target tracking system, and then performing feature extraction, cross-correlation response calculation, cross-correlation response fusion and target position and size prediction on each frame of a video sequence by using the trained target tracking system to obtain a target tracking result.

The specific technical scheme of the invention is as follows:

the method comprises the following steps of firstly, building a target tracking system, wherein the target tracking system is composed of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module.

The characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module. The convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside₀And the target search area image of each subsequent frame (the target search area image of the ith frame is X_iRepresent), respectively for Z₀And X_iCarrying out feature extraction, and extracting the obtained initial template features z₀And search area feature x_iSent together to the cross-correlation response module while simultaneously correlating the initial template features z₀And sending the data to a linear fusion submodule. The convolutional neural network submodule is a modified AlexNet (see the literature "Krizhevsky A, Sutskeeper I, Hinton G E. imaging class with default neural networks [ C)]// Advances in neural information processing systems.2012: 1097-: ImageNet classification using a deep convolutional neural network), the modified AlexNet comprises 5 convolutional layers, 2 maximal pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are all ReLU activation function layers. Z₀Has a size of 127 × 127 × 3, the image content is the target to be tracked, and the feature extraction module pair Z₀Initial template feature z obtained by feature extraction₀The size is 6 × 6 × 256; x_iHas a size of 287 × 287 × 3, the task of tracking is at X_iIs found with Z₀Most similar target, feature extraction Module Pair X_iExtracting the features to obtain the features x of the search area_i，x_iThe size is 26 × 26 × 256. The meaning of the dimensions is: the first two values are the length and width of the image, respectively, and the third value represents the number of channels, e.g., 6 × 6 × 256 represents 256 channels, each channel having a length and width of 6.

Linear fusion of submodules with initial template features z₀The feature of the fused template of the previous frame (i.e., frame i-1)

And the tracked target feature z of the i-1 th frame_i-1As input, where z₀Is the output of the convolutional neural network sub-module,

is the output of the linear fusion submodule in the i-1 frame tracking, z_i-1Is the tracking result image feature of the i-1 th frame. Linear fusion submodule pair z₀、

And z_i-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame (i.e. the ith frame)

And will be

And sending the data to a cross-correlation response module. In the first frame, the fused template feature is the initial template feature z₀And then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame.

The cross-correlation response module is connected with the feature extraction module and the response fusion module. The module consists of two parallel branches, namely a first classification branch and a first regression branch, twoThe branches are all convolutional neural networks, the network structures are completely the same, but the parameters in the networks are different. The first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the two submodules respectively comprise 1 convolution layer, 1 BatchNorm layer and 1 ReLU activation function layer. The classification kernel module first receives z from the convolutional neural network submodule₀Generating integrated initial template features

The classification nucleon module then receives from the linear fusion submodule

Generating fused template features after further integration

Where cls is an abbreviation for classification,

and

all of the dimensions of (1) are 4X 256. The classification search submodule receives x from the convolutional neural network submodule_iTo x_iIntegrating to obtain integrated search region characteristics

The size is 24 × 24 × 256. Then the first classification branch will first

As a convolution kernel, will

As a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial template

Then will be

As a convolution kernel, will

As a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion template

And

all of the dimensions of (a) are 22X 256. Final first sort branch output

And

to the response fusion module.

The first regression branch is used to generate a regression cross-correlation response. Like the first classification branch, the first regression branch also contains two sub-modules: the system comprises a regression kernel module and a regression search submodule, wherein the network structures of the regression kernel module and the regression search submodule are the same as that of the classification kernel module of the first classification branch. The regression kernel module first receives z from the convolutional neural network submodule₀Generating integrated initial template features

Then received from the linear fusion submodule

Generating integrated fused template features

Where reg is an abbreviation for regression,

and

all of the dimensions of (1) are 4X 256. The regression search submodule receives x from the convolutional neural network submodule_iTo x_iIntegrating to obtain integrated search region characteristics

The size is 24 × 24 × 256. Then the first regression branch is firstly

As a convolution kernel, will

As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template

Then will be

As a convolution kernel, will

And

all of the dimensions of (a) are 22X 256. Finally, the first regression branch is output

And

to the response fusion module.

The response fusion module is a convolution neural network and is connected with the cross-correlation response module and the target frame output module. The module is branched by two parallel neural networks: the second classification branch and the second regression branch. The second branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolution layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer. The second classification branch receives two classification cross-correlation responses from the first classification branch

And

will be provided with

And

stacking in channel dimension to generate 22 × 22 × 512-sized classification stacking response, and then performing classification fusion on the classification stacking response to obtain 22 × 22 × 256-dimensional classification fusion response

Will be provided with

And sending the data to a target frame output module. The network structure of the second regression branch is the same as the second classification branch. The second regression branch receives two regression cross-correlation responses from the first regression branch

And

will be provided with

And

stacking in channel dimension to generate regression stacking response with the size of 22 multiplied by 512, and then performing regression fusion on the regression stacking response to obtain regression fusion response with the size of 22 multiplied by 256

Will be provided with

And sending the data to a target frame output module.

The target frame output module is connected with the response fusion module. The target frame output module is branched by two parallel neural networks: a third classification branch and a third regression branch. The third branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (the convolution kernel size is 1 × 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer. The third classification branch receives the classification fusion response from the second classification branch

To pair

Performing response classification to obtain the result of the response classification

Has a dimension of 22X 2K, where K is a Region pro-social Network (RPN, Region suggestion Network, see the document "Ren S, He K, Girshick R, et al. faster R-cnn: Towards real-time object detection with Region pro-social Network [ C]91-99. ", article by Shaoqing Ren: faster r-cnn: real-time target detection is achieved through a regional proposal network), 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that an image in the anchor frame is a target and not a target. The network structure of the third regression branch is the same as the third classification branch, and the third regression branch receives the regression fusion response from the second regression branch

To pair

Performing response regression to obtain response regression result

Dimension (d) is 22 × 22 × 4 k. Wherein k is the same as k in the third classification branch, and represents the number of anchor frames in the RPN, and 4k means that there are k anchor frames and each anchor frame corresponds to 4 regression values: dx, dy, dw, dh which respectively represent the corrected values of the x, y coordinates and the length, width of the corresponding original anchor frame. And the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is the tracking frame of the target.

Secondly, preparing a training data set of the target tracking system, wherein the method comprises the following steps:

the training data set of the system is divided into two parts: first training data set T₁And a second training data set T₂，T₁For training feature extraction module, cross-correlation response module and target box output module, T₂For training the response fusion module.

2.1 from VID and YTB (i.e., Youtube-BoundingBox) 100000 positive sample pairs were selected by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, and a maker of the data sets marks a target frame for each video frame, wherein the target frame is marked by coordinates of a left upper corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames a target position.

2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting one frame from one video sequence of VID and YTB as a template image, and randomly selecting one frame from the other video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair. 100000 negative sample pairs are generated in this sampling manner.

2.3 choose 100000 positive sample pairs from DET and COCO by: two different images in the same object are randomly selected from DET and COCO to be used as 1 positive sample pair, and 100000 positive sample pairs are generated according to the sampling mode. One sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection datasets, containing target box labels.

2.4 select 100000 negative sample pairs from DET and COCO by:

2.4.1 select two images of the same kind but not the same object from DET and COCO, one as template image and the other as search area image, to get 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.

2.4.2 select two different objects of different classes, one as a template image and the other as a search area image, from DET and COCO to obtain 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.

100000 negative sample pairs are finally obtained through 2.4.1 and 2.4.2.

2.5 the template images in all positive and negative sample pairs are scaled to a size of 127 × 127 × 3 and all search area images are scaled to a size of 287 × 287 × 3. Taking all finally obtained positive and negative sample pairs as a first training data set T₁。

2.6 choosing the training set of GOT-10k as the second training data set T₂。

Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:

3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method.

3.2 use of T₁Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:

3.2.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T₁The number of samples in (1) is Len (T)₁)。

3.2.2 Using T₁Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:

3.2.2.1 initialization variable d is 1.

3.2.2.2 taking T₁The pictures from the d th picture to the (d + batch size) picture as training data are input into the feature extraction module, and the direction of the data flow is the feature extraction module → the mutual correlation response module → the target frame output module. Random gradient descent (SGD) method is used (see the literature "Back propagation applied to hand write zip code recognition [ J ]]// Neural Computation,1989 ", article by Yannlecun et al: back propagation applied to handwritten zip code recognition) to train the feature extraction module, the cross-correlation response module, and the target box output module, minimizing the loss function, to update the network parameters of the convolutional neural network sub-module, the cross-correlation response module, and the target box output module. The loss function is based on the classification loss L_clsAnd regression loss L_regA combination of the form:

L＝L_cls+λL_reg

wherein L is the total loss function, L_clsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, L_regIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box.

3.2.2.3 if d is d + batchsize>Len(T₁) Turning to 3.2.2.4; if d is less than or equal to Len (T)₁) Turn 3.2.2.2.

3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if epoch >10, go to 3.2.2.6.

3.2.2.5 sets all 16-layer network parameters of the convolutional neural network sub-module to fixed (i.e., untrained), go to 3.2.2.6.

3.2.2.6 sets the first 11 level parameters of the convolutional neural network sub-modules to fixed and the last 5 level parameters to trainable, go to 3.2.2.7.

3.2.2.7 if epoch is less than 50, make lr equal to 0.5 × lr, turn 3.2.2.1; if the epoch is 50, 3.2.2.8 is turned.

3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.

3.3 mixing of T₂All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored₂Output of each video frame: initial template response of classification module

Fused template responses

And GT response

(using the response between the target template and the search area corresponding to GroudTruth) and the initial template response of the regression model

Predicting template responses

And GT response

Will T₂As a third training data set T, six responses of the classification and regression corresponding to each video frame₂′。

3.4 use of T₂' training a response fusion module, the method comprises the following steps:

3.4.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition of T₂The number of samples in is Len (T)₂′)。

3.4.2 use of T₂' training a response fusion module, comprising the following specific steps:

3.4.2.1 initializes a variable d to 1.

3.4.2.2 get T₂The d < th > to the (d + batch size) pictures as training data, and the SGD algorithm is used for training the response fusion module to optimize the parameters of the response fusion module, the loss function of the fusion module is Euclidean distance loss, and the form is as follows:

wherein

For fusion response, R_GTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible.

3.4.2.3 let d be d + batchsize. If d is>Len(T₂'), go to 3.4.2.4; if d is less than or equal to Len (T)₂') to 3.4.2.2.

3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epoch is 50, 3.4.2.5 is turned.

3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.

Fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:

4.1 real-time acquisition of video stream I from Camera₀,…,I_i,…,I_NThe target tracking system processes each frame in turn. Wherein I_iFor the ith frame in the video, N is the total number of video frames. The initialization variable i is 1.

4.2 feature extraction Module from frame 1I₀To obtain a target image Z with the size of 127 multiplied by 3₀And to Z₀Performing feature extraction to obtain initial template features z with the size of 6 multiplied by 256₀。

4.3 if i is 1, let

4.6 is rotated; if i>1, turn 4.4.

4.4 before the ith frame tracking, the target tracking system obtains the fusion template characteristics used by the ith-1 frame tracking

Tracking result Z of_i-1. Using feature extraction module pair Z_i-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th frame_i-1。

4.5 Linear fusion submodule of feature extraction Module Using initial template z₀Fusion template used in tracking of i-1 th frame

And the i-1 th frame tracking result z_i-1Fusion generation of fusion template features for tracking of ith frame

The fusion formula is:

wherein λ₁＝0.99,λ₂＝0.01，λ₁,λ₂For preset parameters。

4.6 feature extraction Module with Z_i-1In I_i-1Is centered at the central coordinate of_iSelecting an image area having a size of 287 × 287 × 3 as the target search area X_iAnd to X_iExtracting the features to obtain the search area features x with the size of 26 multiplied by 256_i。

4.7 the Cross-correlation response Module receives z from the feature extraction Module₀、

And x_iFirst class branch first pair z₀And x_iProcessing to obtain initial template classification response

Followed by a first sort branch pair

And x_iProcessing to obtain fusion template classification response

First regression branch first pair z₀And x_iProcessing to obtain regression response of initial template

Followed by a first regression branch pair

And x_iProcessing to obtain regression response of fused template

The four responses are 22 × 22 × 256 in size.

4.8 response fusion Module reception

And

using a second sort branch pair

And

performing fusion to obtain classified fusion response

Using a second regression branch pair

And

performing fusion to obtain regression fusion response

The two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:

wherein

Show that

And

stacking is carried out on the channel dimension, and the channel dimension is input into a second classification branch for fusion.

Show that

And

after fusion, with

Carrying out residual error connection, and finally generating classification fusion response after residual error connection

Show that

And

stacking is performed in the channel dimension, and the stacking is input into a second regression branch for fusion.

Show that

And

after fusion, with

Performing residual connection, and finally generating regression fusion response

4.9 target Box output Module reception

And

wherein the third branch pair

Processing to obtain classification result

The dimension is 22 × 22 × 2k, where k is the number of anchor boxes, 22 × 22 × 2k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 2 classification values, and the 2 classification values of each anchor box respectively indicate the probability that the anchor box is a target and not a target; third regression branch pair

Processing to obtain regression result

The dimension is 22 × 22 × 4k, and 22 × 22 × 4k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values indicate the correction values of the center position coordinates x and y and the correction values of the length and the width of the anchor box to the actual target box, respectively.

4.10 third branch of Classification on the results of the Classification

And counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame.

4.11 third regression Branch on regression results

Finding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h) obtained in 4.10, and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:

obtained

Namely the target frame is obtained,

is the coordinates of the center of the target frame,

the length and width of the target box. The target image Z of the ith frame can be obtained by using the target frame_i，Z_iI.e. the tracking result of the ith frame.

4.12 if i < N, make i ═ i +1, change 4.4; if i ═ N, 4.13 turns.

4.13 obtaining tracking results Z of all frames of video sequence₀,Z₁,…,Z_NAnd then, the process is ended.

The invention can achieve the following technical effects:

1. the invention adds the fusion template into the tracking system on the basis of the twin network tracking system of the single template. The fusion template can continuously acquire the latest tracking result of the system, and the template is ensured to be more accurate, so that the aim of enhancing the template matching effect is fulfilled, and the tracking precision is improved.

2. The invention adopts a double-template strategy and simultaneously uses an initial template and a fusion template in target tracking. By the method, when the tracking system drifts or loses in the tracking process, the initial template can ensure the template purity of the whole target tracking system, so that the robustness of the whole target tracking system is ensured.

3. The invention can improve the tracking precision and meet the real-time requirement. On NVIDIA GTX1050Ti, the tracking speed of the method is 68.3 FPS.

Drawings

FIG. 1 is a general flow diagram of the present invention.

FIG. 2 is a logical block diagram of a target tracking system constructed in accordance with the present invention.

Detailed Description

Fig. 1 is a general flow chart of the present invention, and as shown in fig. 1, the present invention comprises the following steps:

firstly, a target tracking system is built, and as shown in fig. 2, the system is composed of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module.

The characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module. The convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside₀And the target search area image of each subsequent frame (the target search area image of the ith frame is X_iRepresent), respectively for Z₀And X_iCarrying out feature extraction, and extracting the obtained initial template features z₀And search area feature x_iSent together to the cross-correlation response module while simultaneously correlating the initial template features z₀And sending the data to a linear fusion submodule. The convolutional neural network submodule is a modified AlexNet, the modified AlexNet totally comprises 5 convolutional layers, 2 maximum pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and totally 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are ReLU activation function layers. Z₀Has a size of 127X 3, as shown in the figureLike the content is the target to be tracked, the feature extraction module pairs Z₀Initial template feature z obtained by feature extraction₀The size is 6 × 6 × 256; x_iHas a size of 287 × 287 × 3, the task of tracking is at X_iIs found with Z₀Most similar target, feature extraction Module Pair X_iExtracting the features to obtain the features x of the search area_i，x_iThe size is 26 × 26 × 256.

And will be

The cross-correlation response module is connected with the feature extraction module and the response fusion module. The module consists of two parallel branches, namely a first classification branch and a first regression branch, wherein the two branches are convolutional neural networksThe network structure is identical, but the parameters in the network are different. The first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the two submodules respectively comprise 1 convolution layer, 1 BatchNorm layer and 1 ReLU activation function layer. The classification kernel module first receives z from the convolutional neural network submodule₀Generating integrated initial template features

Generating fused template features after further integration

And

The size is 24 × 24 × 256. Then the first classification branch will first

As a convolution kernel, will

Then will be

As a convolution kernel, will

And

all of the dimensions of (a) are 22X 256. Final first sort branch output

And

to the response fusion module.

Then received from the linear fusion submodule

Generating integrated fused template features

And

all of the dimensions of (1) are 4X 256. Regression search submodule slave convolutional neural network submoduleReceive x_iTo x_iIntegrating to obtain integrated search region characteristics

The size is 24 × 24 × 256. Then the first regression branch is firstly

As a convolution kernel, will

Then will be

As a convolution kernel, will

And

And

to the response fusion module.

The response fusion module is a convolution neural network and is connected with the cross-correlation response module and the target frame output module. The moldThe block is branched by two parallel neural networks: the second classification branch and the second regression branch. The second branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolution layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer. The second classification branch receives two classification cross-correlation responses from the first classification branch

And

will be provided with

And

Will be provided with

And

will be provided with

And

stacking in channel dimensions generates a regressive stacking response of dimensions 22 x 512Then, regression fusion is carried out on the regression stack response to obtain 22 multiplied by 256 dimensional regression fusion response

Will be provided with

And sending the data to a target frame output module.

To pair

Is 22 × 22 × 2k, where k is the number of anchor frames in the region-proposed network RPN, 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that the image in the anchor frame is a target and not a target. The network structure of the third regression branch is the same as the third classification branch, and the third regression branch receives the regression fusion response from the second regression branch

To pair

Performing response regression to obtain response regression result

2.1 select 100000 positive sample pairs from VID and YTB by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, and a maker of the data sets marks a target frame for each video frame, wherein the target frame is marked by coordinates of a left upper corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames a target position.

2.4 select 100000 negative sample pairs from DET and COCO by:

100000 negative sample pairs are finally obtained through 2.4.1 and 2.4.2.

3.2.1 setting up the standardThe total number of the training iteration rounds is 50, and the epoch is initialized to 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T₁The number of samples in (1) is Len (T)₁)。

3.2.2.1 initialization variable d is 1.

3.2.2.2 taking T₁The pictures from the d th picture to the (d + batch size) picture as training data are input into the feature extraction module, and the direction of the data flow is the feature extraction module → the mutual correlation response module → the target frame output module. And training the feature extraction module, the cross-correlation response module and the target frame output module by using a Stochastic Gradient Descent (SGD) method to minimize a loss function so as to update network parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module. The loss function is based on the classification loss L_clsAnd regression loss L_regA combination of the form:

L＝L_cls+λL_reg

Fused template responses

And GT response

Predicting template responses

And GT response

3.4.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition ofT₂The number of samples in is Len (T)₂′)。

3.4.2.1 initializes a variable d to 1.

wherein

4.2 feature extraction Module from frame 1I₀To obtain a target image Z with the size of 127 multiplied by 3₀And to Z₀Performing feature extraction to obtain a size of6 × 6 × 256 initial template feature z₀。

4.3 if i is 1, let

4.6 is rotated; if i>1, turn 4.4.

The fusion formula is:

wherein λ₁＝0.99,λ₂＝0.01，λ₁,λ₂Is a preset parameter.

Followed by a first sort branch pair

And x_iProcessing to obtain fusion template classification response

Followed by a first regression branch pair

And x_iProcessing to obtain regression response of fused template

The four responses are 22 × 22 × 256 in size.

4.8 response fusion Module reception

And

using a second sort branch pair

And

performing fusion to obtain classified fusion response

Using a second regression branch pair

And

performing fusion to obtain regression fusion response

wherein

Show that

And

Show that

And

after fusion, with

Show that

And

Show that

And

after fusion, with

4.9 target Box output Module reception

And

wherein the third branch pair

Processing to obtain classification result

Dimension 22X 2k, where k isThe number of the anchor frames is 22 multiplied by 2k, wherein the 22 multiplied by 22 k indicates that k anchor frames exist on each point of 22 multiplied by 22, each anchor frame corresponds to 2 classification values, and the 2 classification values of each anchor frame respectively indicate the probability that the anchor frame is a target and is not the target; third regression branch pair

Processing to obtain regression result

4.10 third branch of Classification on the results of the Classification

4.11 third regression Branch on regression results

obtained

Namely the target frame is obtained,

is the coordinates of the center of the target frame,

4.12 if i < N, make i ═ i +1, change 4.4; if i ═ N, 4.13 turns.

The target tracking field adopts success rate (suc) and confidence level precision (pre) to represent the accurate performance of tracking. SUC denotes the fraction of overlap between the predicted target box and GT box, and PRE denotes the percentage of frames with a center positioning error below some threshold to the total number of frames. Higher both SUC and PRE indicate better tracking performance. The tracking speed is measured by FPS (frames per second), and represents the number of frames processed per second, and the larger the FPS, the faster the tracking speed.

Table 1 shows the results of comparing the present invention with eight other high-performance target tracking methods on an OTB-100 dataset.

TABLE 1 comparison of test indexes of the present invention on OTB-100 data set with other eight high-performance target tracking methods

The first row of Table 1 is a shorthand for eight tracking algorithms to compare, the second row is the SUC values measured by these algorithms on the OTB-100 dataset, and the third row represents the PRE values measured by the algorithms. Bold font represents the optimal value. It can be seen from table 1 that the present invention outperforms both SUC and PRE in comparison to these eight high performance algorithms, and that the present invention improves 2.8% on SUC and 1.4% on PRE in comparison to the optimal DaSiamRPN of the eight algorithms. The invention improves the target tracking precision.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Any modification which does not depart from the functional and structural principles of the present invention is intended to be included within the scope of the claims.

Claims

1. A target tracking method based on dual-template response fusion is characterized by comprising the following steps:

firstly, a target tracking system is set up, and the target tracking system consists of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module;

the characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module; the convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside₀And then the target search area image of each frame, for Z separately₀And target search region image X of the i-th frame_iCarrying out feature extraction, and extracting the obtained initial template features z₀And search area feature x_iSent together to the cross-correlation response module while simultaneously correlating the initial template features z₀Sending the data to a linear fusion submodule; z₀Is the target to be tracked, the feature extraction module is on Z₀Initial template feature z obtained by feature extraction₀(ii) a The task of tracking is at X_iIs found with Z₀Most similar target, feature extraction Module Pair X_iPerforming feature extraction to obtainSearch area feature x_i；

Linear fusion of submodules with initial template features z₀The fusion template feature of the previous frame, i.e. the i-1 th frame

is the output of the linear fusion submodule in the i-1 frame tracking, z_i-1Is the tracking result image characteristic of the (i-1) th frame; linear fusion submodule pair z₀、

And z_i-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame, namely the ith frame

And will be

Sending the data to a cross-correlation response module; in the first frame, the fused template feature is the initial template feature z₀Then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame;

the cross-correlation response module is connected with the feature extraction module and the response fusion module; the module consists of two parallel branches, namely a first classification branch and a first regression branch, wherein the two branches are convolutional neural networks, the network structures are completely the same, but the parameters in the networks are different; the first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the classification kernel module firstly receives z from the convolutional neural network submodule₀Generating an integrated initialFeatures of the form

Generating fused template features after further integration

The classification search submodule receives x from the convolutional neural network submodule_iTo x_iIntegrating to obtain integrated search region characteristics

Then the first classification branch will first

As a convolution kernel, will

Then will be

As a convolution kernel, will

Final first sort branch output

And

to a response fusion module;

the first regression branch is used for generating a regression cross-correlation response; the first regression branch contains two sub-modules: a regression kernel module and a regression search submodule, wherein the regression kernel module first receives z from the convolutional neural network submodule₀Generating integrated initial template features

Then received from the linear fusion submodule

Generating integrated fused template features

The regression search submodule receives x from the convolutional neural network submodule_iTo x_iIntegrating to obtain integrated search region characteristics

Then the first regression branch is firstly

As a convolution kernel, will

Then will be

As a convolution kernel, will

Finally, the first regression branch is output

And

to a response fusion module;

the response fusion module is a convolutional neural network and is connected with the cross-correlation response module and the target frame output module; the module is branched by two parallel neural networks: the second classification branch and the second regression branch; the second classification branch receives two classification cross-correlation responses from the first classification branch

And

will be provided with

And

stacking on channel dimension to generate a classification stacking response, and then performing classification fusion on the classification stacking response to obtain a classification fusion response

Will be provided with

Sending the data to a target frame output module; the second regression branch receives two regression cross-correlation responses from the first regression branch

And

will be provided with

And

stacking on channel dimension to generate regression stacking response, and performing regression fusion on the regression stacking response to obtain regression fusion response

Will be provided with

Sending the data to a target frame output module;

the target frame output module is connected with the response fusion module; the target frame output module is branched by two parallel neural networks: the third classification branch and the third regression branch; the third classification branch receives the classification fusion response from the second classification branch

To pair

Is 22 × 22 × 2k, where k is the number of anchor frames in the regional proposed network RPN, 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that the image in the anchor frame is a target and not a target; the third regression branch receives the regression fused response from the second regression branch

To pair

After the response regression is carried out, the result of the response regression is obtained

Has a size of 22 × 22 × 4 k; 4k means that there are k anchor boxes and each anchor box corresponds to 4 regression values: dx, dy, dw and dh which respectively represent the x and y coordinates and the length and width correction values of the corresponding original anchor frame; the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is a tracking frame of the target;

secondly, preparing a training data set of the target tracking system, wherein the training data set is divided into two parts: first training data set T₁And a second training data set T₂The method comprises the following steps:

2.1 select 100000 positive sample pairs from VID and YTB by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, each video frame is marked with a target frame, the target frame is marked as the coordinates of the upper left corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames the target position;

2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting a frame from a certain video sequence of VID and YTB as a template image, randomly selecting a frame from another video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair; generating 100000 negative sample pairs in the sampling mode;

2.3 choose 100000 positive sample pairs from DET and COCO by: randomly selecting two different images in the same object from DET and COCO as 1 positive sample pair, and generating 100000 positive sample pairs according to the sampling mode; one sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection data sets and comprise target box labels;

2.4 select 100000 negative sample pairs from DET and COCO by:

2.4.1 selecting two images of the same type but not the same object from DET and COCO respectively, wherein one image is used as a template image, and the other image is used as a search area image to obtain 1 negative sample pair; 50000 negative sample pairs are generated by the sampling mode;

2.4.2 selecting two different objects of different types from DET and COCO respectively to be pictures, wherein one is used as a template image, and the other is used as a search area image to obtain 1 negative sample pair; 50000 negative sample pairs are generated by the sampling mode;

2.5 scaling the template images in all positive and negative sample pairs to 127 × 127 × 3 size, all search area images to 287 × 287 × 3 size; taking all the positive and negative sample pairs after scaling as T₁；

2.6 choosing the training set of GOT-10k as T₂；

3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method;

3.2.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T₁The number of samples in (1) is Len (T)₁)；

3.2.2 Using T₁Training the feature extraction module, the cross-correlation response module and the target frame output module, and taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module which are updated after training as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network;

Fused template responses

And GT response

And initial template response of regression module

Predicting template responses

And GT response

Will T₂As a third training data set T, six responses of the classification and regression corresponding to each video frame₂′；

3.4 use of T₂Training a response fusion module, and taking a response fusion module parameter obtained by training as a network parameter of a final response fusion module;

4.1 real-time acquisition of video stream I from Camera₀,…,I_i,…,I_NThe target tracking system processes each frame in sequence; wherein I_iThe frame number is the ith frame in the video, and N is the total frame number of the video; initializing a variable i to 1;

4.2 feature extraction Module from frame 1I₀To obtain a target image Z₀And to Z₀Carrying out feature extraction to obtain initial template features z₀；

4.3 if i is 1, let

4.6 is rotated; if i>1, rotating by 4.4;

4.4 Using feature extraction Module Pair Z_i-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th frame_i-1；

4.6 feature extraction Module with Z_i-1In I_i-1Center ofCoordinates as center, at I_iUpper selection target search area X_iAnd to X_iExtracting the features to obtain the features x of the search area_i；

Followed by a first sort branch pair

And x_iProcessing to obtain fusion template classification response

Followed by a first regression branch pair

And x_iProcessing to obtain regression response of fused template

4.8 response fusion Module reception

And

using a second sort branch pair

And

performing fusion to obtain classified fusion response

Using a second regression branch pair

And

performing fusion to obtain regression fusion response

4.9 target Box output Module reception

And

wherein the third branch pair

Processing to obtain classification result

Processing to obtain regression result

The dimension is 22 × 22 × 4k, 22 × 22 × 4k indicates that there are k anchor frames at each point of 22 × 22, and each anchor frame corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values respectively indicate the corrected values of the coordinates x and y of the center positions of the anchor frames to the actual target frame and the corrected values of the length and the width;

4.10 third branch of Classification on the results of the Classification

Counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame;

4.11 third regression Branch on regression results

Finding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h), and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:

obtained

Namely the target frame is obtained,

is the coordinates of the center of the target frame,

the length and width of the target frame; the target image Z of the ith frame can be obtained by using the target frame_i，Z_iThe tracking result of the ith frame is obtained;

4.12 if i < N, make i ═ i +1, change 4.4; if i is equal to N, 4.13 is rotated;

2. The target tracking method based on dual-template response fusion of claim 1, wherein the convolutional neural network submodule is a modified AlexNet, the modified AlexNet comprises 5 convolutional layers, 2 maximum pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and comprises 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are all ReLU activation function layers.

3. The method of claim 1, wherein Z is the target tracking method based on dual-template response fusion₀Has a size of 127 × 127 × 3, the initial template feature z₀The size is 6 × 6 × 256; x_iHas a size of 287 × 287 × 3, search area feature x_iThe size is 26 × 26 × 256; the meaning of the dimensions is: the first two values are the length and width of the image, respectively, and the third value represents the number of channels.

4. The method of claim 1, wherein the classification kernel module and the classification search submodule of the first classification branch each comprise 1 convolution layer, 1 BatchNorm layer, and 1 ReLU activation function layer; the network structures of the regression kernel module and the regression search submodule are respectively the same as the network structure of the classification kernel module.

5. The method of claim 1, wherein the second classification branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolutional layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer; the network structure of the second regression branch is the same as the second classification branch.

6. The target tracking method based on dual-template response fusion of claim 1, wherein the third classification branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (convolution kernel size 1 x 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer; the network structure of the third regression branch is the same as that of the third classification branch.

7. The method of claim 1, wherein the target tracking method based on dual-template response fusion is characterized in that

And

are all 4 x 256 in size,

the size is 24 x 256,

and

all the sizes of (1) are 22 multiplied by 256; the above-mentioned

And

are all 4 x 256 in size,

the size is 24 x 256,

and

all the sizes of (1) are 22 multiplied by 256; the size of the classification stacking response is 22 multiplied by 512, and the classification fusion response

Has a size of 2 × 22 × 256; the regression stacking response size is 22 multiplied by 512, and the regression fusion response

Has a size of 22 × 22 × 256.

8. The target tracking method based on dual-template response fusion as claimed in claim 1, wherein said step 3.2.2 adopts T₁The method for training the feature extraction module, the cross-correlation response module and the target frame output module comprises the following steps:

3.2.2.1 initialization variable d ═ 1;

3.2.2.2 taking T₁The pictures from the d th picture to the d + batch size picture are used as training data and input into a feature extraction module, and the direction of the data flow is the feature extraction module → a mutual correlation response module → a target frame output module; training the feature extraction module, the cross-correlation response module and the target frame output module by using a stochastic gradient descent method, minimizing a loss function, so as to update the convolutional neural network sub-module, the target frame output module and the target frame output module of the feature extraction module,Network parameters of a cross-correlation response module and a target frame output module; the loss function is based on the classification loss L_clsAnd regression loss L_regThe combination composition is used for optimizing a target function and minimizing a loss function so as to update network parameters of the convolutional neural network submodule, the cross-correlation response module and the target frame output module; the loss function is based on the classification loss L_clsAnd regression loss L_regA combination of the form:

L＝L_cls+λL_reg

wherein L is the total loss function, L_clsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, L_regIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box;

3.2.2.3 if d is d + batchsize>Len(T₁) Turning to 3.2.2.4; if d is less than or equal to Len (T)₁) Turning to 3.2.2.2;

3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if the epoch is greater than 10, switching to 3.2.2.6;

3.2.2.5, setting all 16-layer network parameters of the convolutional neural network submodule to be fixed, and turning to 3.2.2.6;

3.2.2.6, setting the first 11 layer parameters of the convolutional neural network sub-module as fixed and the last 5 layer parameters as trainable, turning to 3.2.2.7;

3.2.2.7 if epoch <50, let lr be 0.5 x lr for 3.2.2.1, if epoch <50, go 3.2.2.8;

9. The method for tracking targets based on fusion of dual-template responses as claimed in claim 1, wherein said step 3.4 uses T₂' the method for training the response fusion module comprises the following steps:

3.4.1 set Total number of training iterations to 50, and initiateChanging epoch to 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition of T₂The number of samples in is Len (T)₂′)；

3.4.2.1 initializing variable d ═ 1;

3.4.2.2 get T₂The d < th > picture to the d + batch size picture as training data, and training the response fusion module by using an SGD algorithm to optimize the parameters of the response fusion module, wherein the loss function of the fusion module is Euclidean distance loss, and the form is as follows:

wherein

For fusion response, R_GTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible;

3.4.2.3 let d be d + batchsize; if d is>Len(T₂'), go to 3.4.2.4; if d is less than or equal to Len (T)₂'), go to 3.4.2.2;

3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epich is 50, switching to 3.4.2.5;

10. The method of claim 1, wherein the linear fusion submodule of the feature extraction module in step 4.5 uses an initial template z₀Fusion template used in tracking of i-1 th frame

The fusion formula of (a) is:

wherein λ₁＝0.99,λ₂＝0.01，λ₁,λ₂Is a preset parameter.

11. The target tracking method based on dual-template response fusion of claim 1, wherein the second classification branch pair of the response fusion module in step 4.8