CN112541468A - Target tracking method based on dual-template response fusion - Google Patents

Target tracking method based on dual-template response fusion Download PDF

Info

Publication number
CN112541468A
CN112541468A CN202011524190.9A CN202011524190A CN112541468A CN 112541468 A CN112541468 A CN 112541468A CN 202011524190 A CN202011524190 A CN 202011524190A CN 112541468 A CN112541468 A CN 112541468A
Authority
CN
China
Prior art keywords
module
response
fusion
frame
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011524190.9A
Other languages
Chinese (zh)
Other versions
CN112541468B (en
Inventor
史殿习
王宁
刘聪
杨文婧
杨绍武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202011524190.9A priority Critical patent/CN112541468B/en
Publication of CN112541468A publication Critical patent/CN112541468A/en
Application granted granted Critical
Publication of CN112541468B publication Critical patent/CN112541468B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method based on dual-template response fusion, and aims to solve the problems that the accuracy of a template is reduced in a template fixing mode, and the robustness of the template is reduced in a template dynamic updating mode. Firstly, constructing a target tracking system consisting of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module; then, selecting a training set, and training a target tracking system; and finally, performing target tracking on each frame of the video sequence by using a trained target tracking system, wherein the target tracking comprises the steps of feature extraction, cross-correlation response calculation, cross-correlation response fusion and target position and size prediction, and obtaining a target tracking result. In the target tracking process, the template initialized by the first frame of the video and the template dynamically updated by the tracking result in each subsequent frame are used in the twin tracking network, so that the advantages and disadvantages of the two templates are complemented, the target tracking precision is improved, and the robustness and the real-time performance are ensured.

Description

Target tracking method based on dual-template response fusion
Technical Field
The invention belongs to the field of computer vision target tracking, and particularly relates to a target tracking method based on dual-template response fusion in a twin network structure.
Background
Target tracking is an important task in the field of computer vision and can be divided into single-target tracking and multi-target tracking according to the number of targets. Wherein the application range of the single target tracking is wider. Specifically, for a video sequence, single object tracking uses a given object bounding box in the first frame of images in the sequence to initialize the tracker and to predict the location of the object in subsequent video frames and frame out the object. The target tracking technology plays an important role in the technical fields of military, agriculture, security and the like, and along with the rapid development of the artificial intelligence technology and the increase of the requirements of practical application, the performance requirements of people on the target tracking algorithm are higher and higher, so that the research on the target tracking technology is very necessary.
Single target tracking algorithms are mainly divided into two main categories: generative tracking and discriminant tracking. Generative tracking focuses primarily on the characterization of the intrinsic distribution of target appearance data, which is indistinguishable. Discriminant tracking equates the tracking problem to a classification problem, separating the target from the background by learning a classifier. Because the discriminant tracking algorithm has strong foreground and background distinguishing capability, the accuracy and the robustness of the discriminant tracking algorithm are higher than those of a generative tracking algorithm.
Two of the most popular discriminant tracking methods at present are correlation filtering-based tracking and deep learning-based tracking, respectively. The tracking algorithm based on deep learning can extract the depth features of the target, the depth features have richer semantic information, and the appearance expression capability of the target is stronger. Therefore, the current deep learning-based method has more excellent accuracy and robustness. In the current tracking method based on deep learning, the twin network structure is most commonly adopted. The twin network is a convolutional neural network with two parameters identical. The twin network tracker uses a twin network as a feature extraction network, takes one of the network branches as a template branch, takes GT (real target bounding box) in a first frame of a video as input, and extracts features of a target as a template. And the other network branch is used as a detection branch, firstly, a detection area is extracted from each frame after the first frame, and then the detection area is used as network input to output the characteristics of the detection area. In the subsequent classification and regression network, candidate frames are extracted from the detection area, the candidate frames are matched with the template features, the candidate frame with the highest matching score is selected as a target frame, and regression correction is carried out on the length and the width of the target frame.
Currently, in twin network based tracking methods, the template is either fixed or dynamically updated.
Template fixing is to initialize the template by using the GT of the first frame of the video, and the template is kept unchanged in the following tracking process. Since the template is initialized with GT, the template is absolutely correct; but in subsequent video frames, the object may change in size, shape, etc., resulting in a change in the feature. Therefore, the latest features of the target cannot be obtained by using the fixed template, so that the template cannot be correctly matched with the features of the target, and inaccurate tracking is caused.
The dynamic updating of the template is to initialize the template in the first frame and update the template according to the predicted target frame of each frame in the following tracking process. Because the template is updated by using the tracking result of each frame, the template always contains the latest features of the target, and the tracking accuracy can be improved; however, if the tracking is lost or biased, the updated template will be damaged by the background features, so that the template becomes no longer robust, and the tracking is wrong.
Therefore, how to consider the accuracy and robustness of the template in the target tracking method based on the twin network enables the template to always acquire the latest features of the target in the tracking process, and meanwhile, the template is not damaged due to tracking drift, so that the overall accuracy and robustness of the target tracking method are improved, and the method is a hotspot problem which is researched by researchers in the field.
Disclosure of Invention
The invention aims to solve the technical problem that the template fixing mode in the twin network tracking method may reduce the accuracy of the template, and the template dynamic updating mode may reduce the robustness of the template.
The invention provides a target tracking method based on a dual template on the basis of the existing twin network tracking method, namely, an initial template (a template initialized by a first frame of a video) and an updated template (a template dynamically updated by using a tracking result in each frame later) are simultaneously used in the twin tracking network, so that the advantages and disadvantages of the two templates are complemented, and the target tracking precision is improved.
In order to solve the technical problems, the technical scheme of the invention is as follows: firstly, a target tracking system consisting of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module is constructed. Then select ImageNet VID and DET (see the documents "Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual basic recognition [ J ]. International juuran of Computer Vision,2015,115(3):211-, paper by Esteban Real: YouTube bounding box: large high-precision manual labeling datasets for object detection in video) and GOT-10k (see the literature "Huang L, Zhao X, Huang k. go-10 k: a large high-diversity benchmark for genetic object tracking in the world [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2019.", paper by Lianghua Huang: GOT-10k: large-scale complex diversity reference for field general target tracking) as a training set for training a target tracking system, and then performing feature extraction, cross-correlation response calculation, cross-correlation response fusion and target position and size prediction on each frame of a video sequence by using the trained target tracking system to obtain a target tracking result.
The specific technical scheme of the invention is as follows:
the method comprises the following steps of firstly, building a target tracking system, wherein the target tracking system is composed of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module.
The characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module. The convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside0And the target search area image of each subsequent frame (the target search area image of the ith frame is XiRepresent), respectively for Z0And XiCarrying out feature extraction, and extracting the obtained initial template features z0And search area feature xiSent together to the cross-correlation response module while simultaneously correlating the initial template features z0And sending the data to a linear fusion submodule. The convolutional neural network submodule is a modified AlexNet (see the literature "Krizhevsky A, Sutskeeper I, Hinton G E. imaging class with default neural networks [ C)]// Advances in neural information processing systems.2012: 1097-: ImageNet classification using a deep convolutional neural network), the modified AlexNet comprises 5 convolutional layers, 2 maximal pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are all ReLU activation function layers. Z0Has a size of 127 × 127 × 3, the image content is the target to be tracked, and the feature extraction module pair Z0Initial template feature z obtained by feature extraction0The size is 6 × 6 × 256; xiHas a size of 287 × 287 × 3, the task of tracking is at XiIs found with Z0Most similar target, feature extraction Module Pair XiExtracting the features to obtain the features x of the search areai,xiThe size is 26 × 26 × 256. The meaning of the dimensions is: the first two values are the length and width of the image, respectively, and the third value represents the number of channels, e.g., 6 × 6 × 256 represents 256 channels, each channel having a length and width of 6.
Linear fusion of submodules with initial template features z0The feature of the fused template of the previous frame (i.e., frame i-1)
Figure BDA0002850299370000041
And the tracked target feature z of the i-1 th framei-1As input, where z0Is the output of the convolutional neural network sub-module,
Figure BDA0002850299370000042
is the output of the linear fusion submodule in the i-1 frame tracking, zi-1Is the tracking result image feature of the i-1 th frame. Linear fusion submodule pair z0
Figure BDA0002850299370000043
And zi-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame (i.e. the ith frame)
Figure BDA0002850299370000044
And will be
Figure BDA0002850299370000045
And sending the data to a cross-correlation response module. In the first frame, the fused template feature is the initial template feature z0And then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame.
The cross-correlation response module is connected with the feature extraction module and the response fusion module. The module consists of two parallel branches, namely a first classification branch and a first regression branch, twoThe branches are all convolutional neural networks, the network structures are completely the same, but the parameters in the networks are different. The first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the two submodules respectively comprise 1 convolution layer, 1 BatchNorm layer and 1 ReLU activation function layer. The classification kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template features
Figure BDA0002850299370000046
The classification nucleon module then receives from the linear fusion submodule
Figure BDA0002850299370000047
Generating fused template features after further integration
Figure BDA0002850299370000048
Where cls is an abbreviation for classification,
Figure BDA0002850299370000049
and
Figure BDA00028502993700000410
all of the dimensions of (1) are 4X 256. The classification search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristics
Figure BDA00028502993700000411
The size is 24 × 24 × 256. Then the first classification branch will first
Figure BDA00028502993700000412
As a convolution kernel, will
Figure BDA00028502993700000413
As a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial template
Figure BDA00028502993700000414
Then will be
Figure BDA00028502993700000415
As a convolution kernel, will
Figure BDA00028502993700000416
As a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion template
Figure BDA00028502993700000417
And
Figure BDA00028502993700000418
all of the dimensions of (a) are 22X 256. Final first sort branch output
Figure BDA00028502993700000419
And
Figure BDA00028502993700000420
to the response fusion module.
The first regression branch is used to generate a regression cross-correlation response. Like the first classification branch, the first regression branch also contains two sub-modules: the system comprises a regression kernel module and a regression search submodule, wherein the network structures of the regression kernel module and the regression search submodule are the same as that of the classification kernel module of the first classification branch. The regression kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template features
Figure BDA00028502993700000421
Then received from the linear fusion submodule
Figure BDA00028502993700000422
Generating integrated fused template features
Figure BDA00028502993700000423
Where reg is an abbreviation for regression,
Figure BDA00028502993700000424
and
Figure BDA00028502993700000425
all of the dimensions of (1) are 4X 256. The regression search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristics
Figure BDA00028502993700000426
The size is 24 × 24 × 256. Then the first regression branch is firstly
Figure BDA0002850299370000051
As a convolution kernel, will
Figure BDA0002850299370000052
As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template
Figure BDA0002850299370000053
Then will be
Figure BDA0002850299370000054
As a convolution kernel, will
Figure BDA0002850299370000055
As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template
Figure BDA0002850299370000056
And
Figure BDA0002850299370000057
all of the dimensions of (a) are 22X 256. Finally, the first regression branch is output
Figure BDA0002850299370000058
And
Figure BDA0002850299370000059
to the response fusion module.
The response fusion module is a convolution neural network and is connected with the cross-correlation response module and the target frame output module. The module is branched by two parallel neural networks: the second classification branch and the second regression branch. The second branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolution layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer. The second classification branch receives two classification cross-correlation responses from the first classification branch
Figure BDA00028502993700000510
And
Figure BDA00028502993700000511
will be provided with
Figure BDA00028502993700000512
And
Figure BDA00028502993700000513
stacking in channel dimension to generate 22 × 22 × 512-sized classification stacking response, and then performing classification fusion on the classification stacking response to obtain 22 × 22 × 256-dimensional classification fusion response
Figure BDA00028502993700000514
Will be provided with
Figure BDA00028502993700000515
And sending the data to a target frame output module. The network structure of the second regression branch is the same as the second classification branch. The second regression branch receives two regression cross-correlation responses from the first regression branch
Figure BDA00028502993700000516
And
Figure BDA00028502993700000517
will be provided with
Figure BDA00028502993700000518
And
Figure BDA00028502993700000519
stacking in channel dimension to generate regression stacking response with the size of 22 multiplied by 512, and then performing regression fusion on the regression stacking response to obtain regression fusion response with the size of 22 multiplied by 256
Figure BDA00028502993700000520
Will be provided with
Figure BDA00028502993700000521
And sending the data to a target frame output module.
The target frame output module is connected with the response fusion module. The target frame output module is branched by two parallel neural networks: a third classification branch and a third regression branch. The third branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (the convolution kernel size is 1 × 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer. The third classification branch receives the classification fusion response from the second classification branch
Figure BDA00028502993700000522
To pair
Figure BDA00028502993700000523
Performing response classification to obtain the result of the response classification
Figure BDA00028502993700000524
Has a dimension of 22X 2K, where K is a Region pro-social Network (RPN, Region suggestion Network, see the document "Ren S, He K, Girshick R, et al. faster R-cnn: Towards real-time object detection with Region pro-social Network [ C]91-99. ", article by Shaoqing Ren: faster r-cnn: real-time target detection is achieved through a regional proposal network), 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that an image in the anchor frame is a target and not a target. The network structure of the third regression branch is the same as the third classification branch, and the third regression branch receives the regression fusion response from the second regression branch
Figure BDA00028502993700000525
To pair
Figure BDA00028502993700000526
Performing response regression to obtain response regression result
Figure BDA00028502993700000527
Dimension (d) is 22 × 22 × 4 k. Wherein k is the same as k in the third classification branch, and represents the number of anchor frames in the RPN, and 4k means that there are k anchor frames and each anchor frame corresponds to 4 regression values: dx, dy, dw, dh which respectively represent the corrected values of the x, y coordinates and the length, width of the corresponding original anchor frame. And the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is the tracking frame of the target.
Secondly, preparing a training data set of the target tracking system, wherein the method comprises the following steps:
the training data set of the system is divided into two parts: first training data set T1And a second training data set T2,T1For training feature extraction module, cross-correlation response module and target box output module, T2For training the response fusion module.
2.1 from VID and YTB (i.e., Youtube-BoundingBox) 100000 positive sample pairs were selected by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, and a maker of the data sets marks a target frame for each video frame, wherein the target frame is marked by coordinates of a left upper corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames a target position.
2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting one frame from one video sequence of VID and YTB as a template image, and randomly selecting one frame from the other video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair. 100000 negative sample pairs are generated in this sampling manner.
2.3 choose 100000 positive sample pairs from DET and COCO by: two different images in the same object are randomly selected from DET and COCO to be used as 1 positive sample pair, and 100000 positive sample pairs are generated according to the sampling mode. One sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection datasets, containing target box labels.
2.4 select 100000 negative sample pairs from DET and COCO by:
2.4.1 select two images of the same kind but not the same object from DET and COCO, one as template image and the other as search area image, to get 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
2.4.2 select two different objects of different classes, one as a template image and the other as a search area image, from DET and COCO to obtain 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
100000 negative sample pairs are finally obtained through 2.4.1 and 2.4.2.
2.5 the template images in all positive and negative sample pairs are scaled to a size of 127 × 127 × 3 and all search area images are scaled to a size of 287 × 287 × 3. Taking all finally obtained positive and negative sample pairs as a first training data set T1
2.6 choosing the training set of GOT-10k as the second training data set T2
Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:
3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method.
3.2 use of T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T1The number of samples in (1) is Len (T)1)。
3.2.2 Using T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.2.1 initialization variable d is 1.
3.2.2.2 taking T1The pictures from the d th picture to the (d + batch size) picture as training data are input into the feature extraction module, and the direction of the data flow is the feature extraction module → the mutual correlation response module → the target frame output module. Random gradient descent (SGD) method is used (see the literature "Back propagation applied to hand write zip code recognition [ J ]]// Neural Computation,1989 ", article by Yannlecun et al: back propagation applied to handwritten zip code recognition) to train the feature extraction module, the cross-correlation response module, and the target box output module, minimizing the loss function, to update the network parameters of the convolutional neural network sub-module, the cross-correlation response module, and the target box output module. The loss function is based on the classification loss LclsAnd regression loss LregA combination of the form:
L=Lcls+λLreg
wherein L is the total loss function, LclsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, LregIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box.
3.2.2.3 if d is d + batchsize>Len(T1) Turning to 3.2.2.4; if d is less than or equal to Len (T)1) Turn 3.2.2.2.
3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if epoch >10, go to 3.2.2.6.
3.2.2.5 sets all 16-layer network parameters of the convolutional neural network sub-module to fixed (i.e., untrained), go to 3.2.2.6.
3.2.2.6 sets the first 11 level parameters of the convolutional neural network sub-modules to fixed and the last 5 level parameters to trainable, go to 3.2.2.7.
3.2.2.7 if epoch is less than 50, make lr equal to 0.5 × lr, turn 3.2.2.1; if the epoch is 50, 3.2.2.8 is turned.
3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.
3.3 mixing of T2All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored2Output of each video frame: initial template response of classification module
Figure BDA0002850299370000081
Fused template responses
Figure BDA0002850299370000082
And GT response
Figure BDA0002850299370000083
(using the response between the target template and the search area corresponding to GroudTruth) and the initial template response of the regression model
Figure BDA0002850299370000084
Predicting template responses
Figure BDA0002850299370000085
And GT response
Figure BDA0002850299370000086
Will T2As a third training data set T, six responses of the classification and regression corresponding to each video frame2′。
3.4 use of T2' training a response fusion module, the method comprises the following steps:
3.4.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition of T2The number of samples in is Len (T)2′)。
3.4.2 use of T2' training a response fusion module, comprising the following specific steps:
3.4.2.1 initializes a variable d to 1.
3.4.2.2 get T2The d < th > to the (d + batch size) pictures as training data, and the SGD algorithm is used for training the response fusion module to optimize the parameters of the response fusion module, the loss function of the fusion module is Euclidean distance loss, and the form is as follows:
Figure BDA0002850299370000087
wherein
Figure BDA0002850299370000091
For fusion response, RGTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible.
3.4.2.3 let d be d + batchsize. If d is>Len(T2'), go to 3.4.2.4; if d is less than or equal to Len (T)2') to 3.4.2.2.
3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epoch is 50, 3.4.2.5 is turned.
3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.
Fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:
4.1 real-time acquisition of video stream I from Camera0,…,Ii,…,INThe target tracking system processes each frame in turn. Wherein IiFor the ith frame in the video, N is the total number of video frames. The initialization variable i is 1.
4.2 feature extraction Module from frame 1I0To obtain a target image Z with the size of 127 multiplied by 30And to Z0Performing feature extraction to obtain initial template features z with the size of 6 multiplied by 2560
4.3 if i is 1, let
Figure BDA0002850299370000092
4.6 is rotated; if i>1, turn 4.4.
4.4 before the ith frame tracking, the target tracking system obtains the fusion template characteristics used by the ith-1 frame tracking
Figure BDA0002850299370000093
Tracking result Z ofi-1. Using feature extraction module pair Zi-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th framei-1
4.5 Linear fusion submodule of feature extraction Module Using initial template z0Fusion template used in tracking of i-1 th frame
Figure BDA0002850299370000094
And the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frame
Figure BDA0002850299370000095
The fusion formula is:
Figure BDA0002850299370000096
wherein λ1=0.99,λ2=0.01,λ12For preset parameters。
4.6 feature extraction Module with Zi-1In Ii-1Is centered at the central coordinate ofiSelecting an image area having a size of 287 × 287 × 3 as the target search area XiAnd to XiExtracting the features to obtain the search area features x with the size of 26 multiplied by 256i
4.7 the Cross-correlation response Module receives z from the feature extraction Module0
Figure BDA0002850299370000097
And xiFirst class branch first pair z0And xiProcessing to obtain initial template classification response
Figure BDA0002850299370000098
Followed by a first sort branch pair
Figure BDA0002850299370000099
And xiProcessing to obtain fusion template classification response
Figure BDA00028502993700000910
First regression branch first pair z0And xiProcessing to obtain regression response of initial template
Figure BDA00028502993700000911
Followed by a first regression branch pair
Figure BDA00028502993700000912
And xiProcessing to obtain regression response of fused template
Figure BDA00028502993700000913
The four responses are 22 × 22 × 256 in size.
4.8 response fusion Module reception
Figure BDA0002850299370000101
And
Figure BDA0002850299370000102
using a second sort branch pair
Figure BDA0002850299370000103
And
Figure BDA0002850299370000104
performing fusion to obtain classified fusion response
Figure BDA0002850299370000105
Using a second regression branch pair
Figure BDA0002850299370000106
And
Figure BDA0002850299370000107
performing fusion to obtain regression fusion response
Figure BDA0002850299370000108
The two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:
Figure BDA0002850299370000109
Figure BDA00028502993700001035
wherein
Figure BDA00028502993700001010
Show that
Figure BDA00028502993700001011
And
Figure BDA00028502993700001012
stacking is carried out on the channel dimension, and the channel dimension is input into a second classification branch for fusion.
Figure BDA00028502993700001013
Show that
Figure BDA00028502993700001014
And
Figure BDA00028502993700001015
after fusion, with
Figure BDA00028502993700001016
Carrying out residual error connection, and finally generating classification fusion response after residual error connection
Figure BDA00028502993700001017
Show that
Figure BDA00028502993700001018
And
Figure BDA00028502993700001019
stacking is performed in the channel dimension, and the stacking is input into a second regression branch for fusion.
Figure BDA00028502993700001020
Show that
Figure BDA00028502993700001021
And
Figure BDA00028502993700001022
after fusion, with
Figure BDA00028502993700001023
Performing residual connection, and finally generating regression fusion response
Figure BDA00028502993700001024
4.9 target Box output Module reception
Figure BDA00028502993700001025
And
Figure BDA00028502993700001026
wherein the third branch pair
Figure BDA00028502993700001027
Processing to obtain classification result
Figure BDA00028502993700001028
The dimension is 22 × 22 × 2k, where k is the number of anchor boxes, 22 × 22 × 2k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 2 classification values, and the 2 classification values of each anchor box respectively indicate the probability that the anchor box is a target and not a target; third regression branch pair
Figure BDA00028502993700001029
Processing to obtain regression result
Figure BDA00028502993700001030
The dimension is 22 × 22 × 4k, and 22 × 22 × 4k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values indicate the correction values of the center position coordinates x and y and the correction values of the length and the width of the anchor box to the actual target box, respectively.
4.10 third branch of Classification on the results of the Classification
Figure BDA00028502993700001031
And counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame.
4.11 third regression Branch on regression results
Figure BDA00028502993700001032
Finding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h) obtained in 4.10, and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:
Figure BDA00028502993700001033
Figure BDA00028502993700001034
Figure BDA0002850299370000111
Figure BDA0002850299370000112
obtained
Figure BDA0002850299370000113
Namely the target frame is obtained,
Figure BDA0002850299370000114
is the coordinates of the center of the target frame,
Figure BDA0002850299370000115
the length and width of the target box. The target image Z of the ith frame can be obtained by using the target framei,ZiI.e. the tracking result of the ith frame.
4.12 if i < N, make i ═ i +1, change 4.4; if i ═ N, 4.13 turns.
4.13 obtaining tracking results Z of all frames of video sequence0,Z1,…,ZNAnd then, the process is ended.
The invention can achieve the following technical effects:
1. the invention adds the fusion template into the tracking system on the basis of the twin network tracking system of the single template. The fusion template can continuously acquire the latest tracking result of the system, and the template is ensured to be more accurate, so that the aim of enhancing the template matching effect is fulfilled, and the tracking precision is improved.
2. The invention adopts a double-template strategy and simultaneously uses an initial template and a fusion template in target tracking. By the method, when the tracking system drifts or loses in the tracking process, the initial template can ensure the template purity of the whole target tracking system, so that the robustness of the whole target tracking system is ensured.
3. The invention can improve the tracking precision and meet the real-time requirement. On NVIDIA GTX1050Ti, the tracking speed of the method is 68.3 FPS.
Drawings
FIG. 1 is a general flow diagram of the present invention.
FIG. 2 is a logical block diagram of a target tracking system constructed in accordance with the present invention.
Detailed Description
Fig. 1 is a general flow chart of the present invention, and as shown in fig. 1, the present invention comprises the following steps:
firstly, a target tracking system is built, and as shown in fig. 2, the system is composed of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module.
The characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module. The convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside0And the target search area image of each subsequent frame (the target search area image of the ith frame is XiRepresent), respectively for Z0And XiCarrying out feature extraction, and extracting the obtained initial template features z0And search area feature xiSent together to the cross-correlation response module while simultaneously correlating the initial template features z0And sending the data to a linear fusion submodule. The convolutional neural network submodule is a modified AlexNet, the modified AlexNet totally comprises 5 convolutional layers, 2 maximum pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and totally 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are ReLU activation function layers. Z0Has a size of 127X 3, as shown in the figureLike the content is the target to be tracked, the feature extraction module pairs Z0Initial template feature z obtained by feature extraction0The size is 6 × 6 × 256; xiHas a size of 287 × 287 × 3, the task of tracking is at XiIs found with Z0Most similar target, feature extraction Module Pair XiExtracting the features to obtain the features x of the search areai,xiThe size is 26 × 26 × 256.
Linear fusion of submodules with initial template features z0The feature of the fused template of the previous frame (i.e., frame i-1)
Figure BDA0002850299370000121
And the tracked target feature z of the i-1 th framei-1As input, where z0Is the output of the convolutional neural network sub-module,
Figure BDA0002850299370000122
is the output of the linear fusion submodule in the i-1 frame tracking, zi-1Is the tracking result image feature of the i-1 th frame. Linear fusion submodule pair z0
Figure BDA0002850299370000123
And zi-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame (i.e. the ith frame)
Figure BDA0002850299370000124
And will be
Figure BDA0002850299370000125
And sending the data to a cross-correlation response module. In the first frame, the fused template feature is the initial template feature z0And then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame.
The cross-correlation response module is connected with the feature extraction module and the response fusion module. The module consists of two parallel branches, namely a first classification branch and a first regression branch, wherein the two branches are convolutional neural networksThe network structure is identical, but the parameters in the network are different. The first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the two submodules respectively comprise 1 convolution layer, 1 BatchNorm layer and 1 ReLU activation function layer. The classification kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template features
Figure BDA0002850299370000126
The classification nucleon module then receives from the linear fusion submodule
Figure BDA0002850299370000127
Generating fused template features after further integration
Figure BDA0002850299370000128
Figure BDA0002850299370000129
And
Figure BDA00028502993700001210
all of the dimensions of (1) are 4X 256. The classification search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristics
Figure BDA00028502993700001211
The size is 24 × 24 × 256. Then the first classification branch will first
Figure BDA00028502993700001212
As a convolution kernel, will
Figure BDA00028502993700001213
As a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial template
Figure BDA00028502993700001214
Then will be
Figure BDA00028502993700001215
As a convolution kernel, will
Figure BDA00028502993700001216
As a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion template
Figure BDA00028502993700001217
And
Figure BDA00028502993700001218
all of the dimensions of (a) are 22X 256. Final first sort branch output
Figure BDA00028502993700001219
And
Figure BDA00028502993700001220
to the response fusion module.
The first regression branch is used to generate a regression cross-correlation response. Like the first classification branch, the first regression branch also contains two sub-modules: the system comprises a regression kernel module and a regression search submodule, wherein the network structures of the regression kernel module and the regression search submodule are the same as that of the classification kernel module of the first classification branch. The regression kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template features
Figure BDA0002850299370000131
Then received from the linear fusion submodule
Figure BDA0002850299370000132
Generating integrated fused template features
Figure BDA0002850299370000133
And
Figure BDA0002850299370000134
all of the dimensions of (1) are 4X 256. Regression search submodule slave convolutional neural network submoduleReceive xiTo xiIntegrating to obtain integrated search region characteristics
Figure BDA0002850299370000135
Figure BDA0002850299370000136
The size is 24 × 24 × 256. Then the first regression branch is firstly
Figure BDA0002850299370000137
As a convolution kernel, will
Figure BDA0002850299370000138
As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template
Figure BDA0002850299370000139
Then will be
Figure BDA00028502993700001310
As a convolution kernel, will
Figure BDA00028502993700001311
As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template
Figure BDA00028502993700001312
And
Figure BDA00028502993700001313
all of the dimensions of (a) are 22X 256. Finally, the first regression branch is output
Figure BDA00028502993700001314
And
Figure BDA00028502993700001315
to the response fusion module.
The response fusion module is a convolution neural network and is connected with the cross-correlation response module and the target frame output module. The moldThe block is branched by two parallel neural networks: the second classification branch and the second regression branch. The second branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolution layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer. The second classification branch receives two classification cross-correlation responses from the first classification branch
Figure BDA00028502993700001316
And
Figure BDA00028502993700001317
will be provided with
Figure BDA00028502993700001318
And
Figure BDA00028502993700001319
stacking in channel dimension to generate 22 × 22 × 512-sized classification stacking response, and then performing classification fusion on the classification stacking response to obtain 22 × 22 × 256-dimensional classification fusion response
Figure BDA00028502993700001320
Will be provided with
Figure BDA00028502993700001321
And sending the data to a target frame output module. The network structure of the second regression branch is the same as the second classification branch. The second regression branch receives two regression cross-correlation responses from the first regression branch
Figure BDA00028502993700001322
And
Figure BDA00028502993700001323
will be provided with
Figure BDA00028502993700001324
And
Figure BDA00028502993700001325
stacking in channel dimensions generates a regressive stacking response of dimensions 22 x 512Then, regression fusion is carried out on the regression stack response to obtain 22 multiplied by 256 dimensional regression fusion response
Figure BDA00028502993700001326
Will be provided with
Figure BDA00028502993700001327
And sending the data to a target frame output module.
The target frame output module is connected with the response fusion module. The target frame output module is branched by two parallel neural networks: a third classification branch and a third regression branch. The third branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (the convolution kernel size is 1 × 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer. The third classification branch receives the classification fusion response from the second classification branch
Figure BDA00028502993700001328
To pair
Figure BDA00028502993700001329
Performing response classification to obtain the result of the response classification
Figure BDA00028502993700001330
Is 22 × 22 × 2k, where k is the number of anchor frames in the region-proposed network RPN, 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that the image in the anchor frame is a target and not a target. The network structure of the third regression branch is the same as the third classification branch, and the third regression branch receives the regression fusion response from the second regression branch
Figure BDA00028502993700001331
To pair
Figure BDA00028502993700001332
Performing response regression to obtain response regression result
Figure BDA00028502993700001333
Figure BDA0002850299370000141
Dimension (d) is 22 × 22 × 4 k. Wherein k is the same as k in the third classification branch, and represents the number of anchor frames in the RPN, and 4k means that there are k anchor frames and each anchor frame corresponds to 4 regression values: dx, dy, dw, dh which respectively represent the corrected values of the x, y coordinates and the length, width of the corresponding original anchor frame. And the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is the tracking frame of the target.
Secondly, preparing a training data set of the target tracking system, wherein the method comprises the following steps:
the training data set of the system is divided into two parts: first training data set T1And a second training data set T2,T1For training feature extraction module, cross-correlation response module and target box output module, T2For training the response fusion module.
2.1 select 100000 positive sample pairs from VID and YTB by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, and a maker of the data sets marks a target frame for each video frame, wherein the target frame is marked by coordinates of a left upper corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames a target position.
2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting one frame from one video sequence of VID and YTB as a template image, and randomly selecting one frame from the other video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair. 100000 negative sample pairs are generated in this sampling manner.
2.3 choose 100000 positive sample pairs from DET and COCO by: two different images in the same object are randomly selected from DET and COCO to be used as 1 positive sample pair, and 100000 positive sample pairs are generated according to the sampling mode. One sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection datasets, containing target box labels.
2.4 select 100000 negative sample pairs from DET and COCO by:
2.4.1 select two images of the same kind but not the same object from DET and COCO, one as template image and the other as search area image, to get 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
2.4.2 select two different objects of different classes, one as a template image and the other as a search area image, from DET and COCO to obtain 1 negative sample pair. 50000 negative sample pairs were generated in this sampling manner.
100000 negative sample pairs are finally obtained through 2.4.1 and 2.4.2.
2.5 the template images in all positive and negative sample pairs are scaled to a size of 127 × 127 × 3 and all search area images are scaled to a size of 287 × 287 × 3. Taking all finally obtained positive and negative sample pairs as a first training data set T1
2.6 choosing the training set of GOT-10k as the second training data set T2
Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:
3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method.
3.2 use of T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.1 setting up the standardThe total number of the training iteration rounds is 50, and the epoch is initialized to 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T1The number of samples in (1) is Len (T)1)。
3.2.2 Using T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.2.1 initialization variable d is 1.
3.2.2.2 taking T1The pictures from the d th picture to the (d + batch size) picture as training data are input into the feature extraction module, and the direction of the data flow is the feature extraction module → the mutual correlation response module → the target frame output module. And training the feature extraction module, the cross-correlation response module and the target frame output module by using a Stochastic Gradient Descent (SGD) method to minimize a loss function so as to update network parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module. The loss function is based on the classification loss LclsAnd regression loss LregA combination of the form:
L=Lcls+λLreg
wherein L is the total loss function, LclsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, LregIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box.
3.2.2.3 if d is d + batchsize>Len(T1) Turning to 3.2.2.4; if d is less than or equal to Len (T)1) Turn 3.2.2.2.
3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if epoch >10, go to 3.2.2.6.
3.2.2.5 sets all 16-layer network parameters of the convolutional neural network sub-module to fixed (i.e., untrained), go to 3.2.2.6.
3.2.2.6 sets the first 11 level parameters of the convolutional neural network sub-modules to fixed and the last 5 level parameters to trainable, go to 3.2.2.7.
3.2.2.7 if epoch is less than 50, make lr equal to 0.5 × lr, turn 3.2.2.1; if the epoch is 50, 3.2.2.8 is turned.
3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.
3.3 mixing of T2All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored2Output of each video frame: initial template response of classification module
Figure BDA0002850299370000161
Fused template responses
Figure BDA0002850299370000162
And GT response
Figure BDA0002850299370000163
(using the response between the target template and the search area corresponding to GroudTruth) and the initial template response of the regression model
Figure BDA0002850299370000164
Predicting template responses
Figure BDA0002850299370000165
And GT response
Figure BDA0002850299370000166
Will T2As a third training data set T, six responses of the classification and regression corresponding to each video frame2′。
3.4 use of T2' training a response fusion module, the method comprises the following steps:
3.4.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition ofT2The number of samples in is Len (T)2′)。
3.4.2 use of T2' training a response fusion module, comprising the following specific steps:
3.4.2.1 initializes a variable d to 1.
3.4.2.2 get T2The d < th > to the (d + batch size) pictures as training data, and the SGD algorithm is used for training the response fusion module to optimize the parameters of the response fusion module, the loss function of the fusion module is Euclidean distance loss, and the form is as follows:
Figure BDA0002850299370000167
wherein
Figure BDA0002850299370000171
For fusion response, RGTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible.
3.4.2.3 let d be d + batchsize. If d is>Len(T2'), go to 3.4.2.4; if d is less than or equal to Len (T)2') to 3.4.2.2.
3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epoch is 50, 3.4.2.5 is turned.
3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.
Fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:
4.1 real-time acquisition of video stream I from Camera0,…,Ii,…,INThe target tracking system processes each frame in turn. Wherein IiFor the ith frame in the video, N is the total number of video frames. The initialization variable i is 1.
4.2 feature extraction Module from frame 1I0To obtain a target image Z with the size of 127 multiplied by 30And to Z0Performing feature extraction to obtain a size of6 × 6 × 256 initial template feature z0
4.3 if i is 1, let
Figure BDA0002850299370000172
4.6 is rotated; if i>1, turn 4.4.
4.4 before the ith frame tracking, the target tracking system obtains the fusion template characteristics used by the ith-1 frame tracking
Figure BDA0002850299370000173
Tracking result Z ofi-1. Using feature extraction module pair Zi-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th framei-1
4.5 Linear fusion submodule of feature extraction Module Using initial template z0Fusion template used in tracking of i-1 th frame
Figure BDA0002850299370000174
And the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frame
Figure BDA0002850299370000175
The fusion formula is:
Figure BDA0002850299370000176
wherein λ1=0.99,λ2=0.01,λ12Is a preset parameter.
4.6 feature extraction Module with Zi-1In Ii-1Is centered at the central coordinate ofiSelecting an image area having a size of 287 × 287 × 3 as the target search area XiAnd to XiExtracting the features to obtain the search area features x with the size of 26 multiplied by 256i
4.7 the Cross-correlation response Module receives z from the feature extraction Module0
Figure BDA0002850299370000177
And xiFirst class branch first pair z0And xiProcessing to obtain initial template classification response
Figure BDA0002850299370000178
Followed by a first sort branch pair
Figure BDA0002850299370000179
And xiProcessing to obtain fusion template classification response
Figure BDA00028502993700001710
First regression branch first pair z0And xiProcessing to obtain regression response of initial template
Figure BDA00028502993700001711
Followed by a first regression branch pair
Figure BDA00028502993700001712
And xiProcessing to obtain regression response of fused template
Figure BDA00028502993700001713
The four responses are 22 × 22 × 256 in size.
4.8 response fusion Module reception
Figure BDA0002850299370000181
And
Figure BDA0002850299370000182
using a second sort branch pair
Figure BDA0002850299370000183
And
Figure BDA0002850299370000184
performing fusion to obtain classified fusion response
Figure BDA0002850299370000185
Using a second regression branch pair
Figure BDA0002850299370000186
And
Figure BDA0002850299370000187
performing fusion to obtain regression fusion response
Figure BDA0002850299370000188
The two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:
Figure BDA0002850299370000189
Figure BDA00028502993700001810
wherein
Figure BDA00028502993700001811
Show that
Figure BDA00028502993700001812
And
Figure BDA00028502993700001813
stacking is carried out on the channel dimension, and the channel dimension is input into a second classification branch for fusion.
Figure BDA00028502993700001814
Show that
Figure BDA00028502993700001815
And
Figure BDA00028502993700001816
after fusion, with
Figure BDA00028502993700001817
Carrying out residual error connection, and finally generating classification fusion response after residual error connection
Figure BDA00028502993700001818
Show that
Figure BDA00028502993700001819
And
Figure BDA00028502993700001820
stacking is performed in the channel dimension, and the stacking is input into a second regression branch for fusion.
Figure BDA00028502993700001821
Show that
Figure BDA00028502993700001822
And
Figure BDA00028502993700001823
after fusion, with
Figure BDA00028502993700001824
Performing residual connection, and finally generating regression fusion response
Figure BDA00028502993700001825
4.9 target Box output Module reception
Figure BDA00028502993700001826
And
Figure BDA00028502993700001827
wherein the third branch pair
Figure BDA00028502993700001828
Processing to obtain classification result
Figure BDA00028502993700001829
Dimension 22X 2k, where k isThe number of the anchor frames is 22 multiplied by 2k, wherein the 22 multiplied by 22 k indicates that k anchor frames exist on each point of 22 multiplied by 22, each anchor frame corresponds to 2 classification values, and the 2 classification values of each anchor frame respectively indicate the probability that the anchor frame is a target and is not the target; third regression branch pair
Figure BDA00028502993700001830
Processing to obtain regression result
Figure BDA00028502993700001831
The dimension is 22 × 22 × 4k, and 22 × 22 × 4k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values indicate the correction values of the center position coordinates x and y and the correction values of the length and the width of the anchor box to the actual target box, respectively.
4.10 third branch of Classification on the results of the Classification
Figure BDA00028502993700001832
And counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame.
4.11 third regression Branch on regression results
Figure BDA00028502993700001833
Finding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h) obtained in 4.10, and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:
Figure BDA00028502993700001834
Figure BDA00028502993700001835
Figure BDA0002850299370000191
Figure BDA0002850299370000192
obtained
Figure BDA0002850299370000193
Namely the target frame is obtained,
Figure BDA0002850299370000194
is the coordinates of the center of the target frame,
Figure BDA0002850299370000195
the length and width of the target box. The target image Z of the ith frame can be obtained by using the target framei,ZiI.e. the tracking result of the ith frame.
4.12 if i < N, make i ═ i +1, change 4.4; if i ═ N, 4.13 turns.
4.13 obtaining tracking results Z of all frames of video sequence0,Z1,…,ZNAnd then, the process is ended.
The target tracking field adopts success rate (suc) and confidence level precision (pre) to represent the accurate performance of tracking. SUC denotes the fraction of overlap between the predicted target box and GT box, and PRE denotes the percentage of frames with a center positioning error below some threshold to the total number of frames. Higher both SUC and PRE indicate better tracking performance. The tracking speed is measured by FPS (frames per second), and represents the number of frames processed per second, and the larger the FPS, the faster the tracking speed.
Table 1 shows the results of comparing the present invention with eight other high-performance target tracking methods on an OTB-100 dataset.
TABLE 1 comparison of test indexes of the present invention on OTB-100 data set with other eight high-performance target tracking methods
Figure BDA0002850299370000196
The first row of Table 1 is a shorthand for eight tracking algorithms to compare, the second row is the SUC values measured by these algorithms on the OTB-100 dataset, and the third row represents the PRE values measured by the algorithms. Bold font represents the optimal value. It can be seen from table 1 that the present invention outperforms both SUC and PRE in comparison to these eight high performance algorithms, and that the present invention improves 2.8% on SUC and 1.4% on PRE in comparison to the optimal DaSiamRPN of the eight algorithms. The invention improves the target tracking precision.
While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Any modification which does not depart from the functional and structural principles of the present invention is intended to be included within the scope of the claims.

Claims (11)

1. A target tracking method based on dual-template response fusion is characterized by comprising the following steps:
firstly, a target tracking system is set up, and the target tracking system consists of a feature extraction module, a cross-correlation response module, a response fusion module and a target frame output module;
the characteristic extraction module is connected with the cross-correlation response module and consists of a convolutional neural network sub-module and a linear fusion sub-module; the convolutional neural network submodule is used for extracting the characteristics of the input image and receives the template image Z of the first frame of the video from the outside0And then the target search area image of each frame, for Z separately0And target search region image X of the i-th frameiCarrying out feature extraction, and extracting the obtained initial template features z0And search area feature xiSent together to the cross-correlation response module while simultaneously correlating the initial template features z0Sending the data to a linear fusion submodule; z0Is the target to be tracked, the feature extraction module is on Z0Initial template feature z obtained by feature extraction0(ii) a The task of tracking is at XiIs found with Z0Most similar target, feature extraction Module Pair XiPerforming feature extraction to obtainSearch area feature xi
Linear fusion of submodules with initial template features z0The fusion template feature of the previous frame, i.e. the i-1 th frame
Figure FDA0002850299360000011
And the tracked target feature z of the i-1 th framei-1As input, where z0Is the output of the convolutional neural network sub-module,
Figure FDA0002850299360000012
is the output of the linear fusion submodule in the i-1 frame tracking, zi-1Is the tracking result image characteristic of the (i-1) th frame; linear fusion submodule pair z0
Figure FDA0002850299360000013
And zi-1The three characteristics are fused in a linear weighting mode to obtain the fusion template characteristic of the current frame, namely the ith frame
Figure FDA0002850299360000014
And will be
Figure FDA0002850299360000015
Sending the data to a cross-correlation response module; in the first frame, the fused template feature is the initial template feature z0Then, the fusion template characteristics of the ith frame are all used in the target tracking task of the (i + 1) th frame to obtain the target tracking result of the (i + 1) th frame;
the cross-correlation response module is connected with the feature extraction module and the response fusion module; the module consists of two parallel branches, namely a first classification branch and a first regression branch, wherein the two branches are convolutional neural networks, the network structures are completely the same, but the parameters in the networks are different; the first classification branch is used for generating a classification cross-correlation response and consists of two convolution submodules with the same structure: a classification kernel module and a classification search submodule, wherein the classification kernel module firstly receives z from the convolutional neural network submodule0Generating an integrated initialFeatures of the form
Figure FDA0002850299360000016
The classification nucleon module then receives from the linear fusion submodule
Figure FDA0002850299360000017
Generating fused template features after further integration
Figure FDA0002850299360000018
The classification search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristics
Figure FDA0002850299360000019
Then the first classification branch will first
Figure FDA00028502993600000110
As a convolution kernel, will
Figure FDA00028502993600000111
As a convolved region, performing convolution operation to obtain a classification cross-correlation response of the initial template
Figure FDA00028502993600000112
Then will be
Figure FDA00028502993600000113
As a convolution kernel, will
Figure FDA00028502993600000114
As a convolved region, performing convolution operation to obtain a classification cross-correlation response of a fusion template
Figure FDA00028502993600000115
Final first sort branch output
Figure FDA00028502993600000116
And
Figure FDA00028502993600000117
to a response fusion module;
the first regression branch is used for generating a regression cross-correlation response; the first regression branch contains two sub-modules: a regression kernel module and a regression search submodule, wherein the regression kernel module first receives z from the convolutional neural network submodule0Generating integrated initial template features
Figure FDA0002850299360000021
Then received from the linear fusion submodule
Figure FDA0002850299360000022
Generating integrated fused template features
Figure FDA0002850299360000023
The regression search submodule receives x from the convolutional neural network submoduleiTo xiIntegrating to obtain integrated search region characteristics
Figure FDA0002850299360000024
Then the first regression branch is firstly
Figure FDA0002850299360000025
As a convolution kernel, will
Figure FDA0002850299360000026
As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template
Figure FDA0002850299360000027
Then will be
Figure FDA0002850299360000028
As a convolution kernel, will
Figure FDA0002850299360000029
As the convolved region, performing convolution operation to obtain regression cross-correlation response of the initial template
Figure FDA00028502993600000210
Finally, the first regression branch is output
Figure FDA00028502993600000211
And
Figure FDA00028502993600000212
to a response fusion module;
the response fusion module is a convolutional neural network and is connected with the cross-correlation response module and the target frame output module; the module is branched by two parallel neural networks: the second classification branch and the second regression branch; the second classification branch receives two classification cross-correlation responses from the first classification branch
Figure FDA00028502993600000213
And
Figure FDA00028502993600000214
will be provided with
Figure FDA00028502993600000215
And
Figure FDA00028502993600000216
stacking on channel dimension to generate a classification stacking response, and then performing classification fusion on the classification stacking response to obtain a classification fusion response
Figure FDA00028502993600000217
Will be provided with
Figure FDA00028502993600000218
Sending the data to a target frame output module; the second regression branch receives two regression cross-correlation responses from the first regression branch
Figure FDA00028502993600000219
And
Figure FDA00028502993600000220
will be provided with
Figure FDA00028502993600000221
And
Figure FDA00028502993600000222
stacking on channel dimension to generate regression stacking response, and performing regression fusion on the regression stacking response to obtain regression fusion response
Figure FDA00028502993600000223
Will be provided with
Figure FDA00028502993600000224
Sending the data to a target frame output module;
the target frame output module is connected with the response fusion module; the target frame output module is branched by two parallel neural networks: the third classification branch and the third regression branch; the third classification branch receives the classification fusion response from the second classification branch
Figure FDA00028502993600000225
To pair
Figure FDA00028502993600000226
Performing response classification to obtain the result of the response classification
Figure FDA00028502993600000227
Figure FDA00028502993600000228
Is 22 × 22 × 2k, where k is the number of anchor frames in the regional proposed network RPN, 2k means that there are k anchor frames and each anchor frame corresponds to 2 classification values, and the 2 classification values respectively represent the probability that the image in the anchor frame is a target and not a target; the third regression branch receives the regression fused response from the second regression branch
Figure FDA00028502993600000229
To pair
Figure FDA00028502993600000230
After the response regression is carried out, the result of the response regression is obtained
Figure FDA00028502993600000231
Figure FDA00028502993600000232
Has a size of 22 × 22 × 4 k; 4k means that there are k anchor boxes and each anchor box corresponds to 4 regression values: dx, dy, dw and dh which respectively represent the x and y coordinates and the length and width correction values of the corresponding original anchor frame; the target frame output module selects the anchor frame with the maximum target probability in the classification results as a target prediction frame, and takes out four correction values in the regression results corresponding to the anchor frame for correcting the position and the size of the anchor frame, wherein the corrected anchor frame is a tracking frame of the target;
secondly, preparing a training data set of the target tracking system, wherein the training data set is divided into two parts: first training data set T1And a second training data set T2The method comprises the following steps:
2.1 select 100000 positive sample pairs from VID and YTB by: sampling each video sequence of VID and YTB, randomly selecting a frame from the same video sequence as a template image, randomly selecting a frame within a range not exceeding one hundred frames behind the template image as a search area image, using two images selected in the way as 1 positive sample pair, and generating 100000 positive sample pairs in the sampling way; VID and YTB are video data sets, each video comprises a specific target, each video frame is marked with a target frame, the target frame is marked as the coordinates of the upper left corner point of a matrix frame and the length and width of a rectangular frame, and the rectangular frame frames the target position;
2.2 select 100000 negative sample pairs from VID and YTB by: randomly selecting a frame from a certain video sequence of VID and YTB as a template image, randomly selecting a frame from another video sequence as a search area image, and taking the two images selected in the way as 1 negative sample pair; generating 100000 negative sample pairs in the sampling mode;
2.3 choose 100000 positive sample pairs from DET and COCO by: randomly selecting two different images in the same object from DET and COCO as 1 positive sample pair, and generating 100000 positive sample pairs according to the sampling mode; one sample in the positive sample pair is used as a template image, and the other sample is used as a search area image; DET and COCO are target detection data sets and comprise target box labels;
2.4 select 100000 negative sample pairs from DET and COCO by:
2.4.1 selecting two images of the same type but not the same object from DET and COCO respectively, wherein one image is used as a template image, and the other image is used as a search area image to obtain 1 negative sample pair; 50000 negative sample pairs are generated by the sampling mode;
2.4.2 selecting two different objects of different types from DET and COCO respectively to be pictures, wherein one is used as a template image, and the other is used as a search area image to obtain 1 negative sample pair; 50000 negative sample pairs are generated by the sampling mode;
2.5 scaling the template images in all positive and negative sample pairs to 127 × 127 × 3 size, all search area images to 287 × 287 × 3 size; taking all the positive and negative sample pairs after scaling as T1
2.6 choosing the training set of GOT-10k as T2
Thirdly, training a target tracking system by using a training data set, wherein the specific method comprises the following steps:
3.1 initializing the parameters of the feature extraction module by using AlexNet network parameters pre-trained on ImageNet, and initializing the parameters of the cross-correlation response module, the response fusion module and the target frame output module by using a Kaiming initialization method;
3.2 use of T1Training a feature extraction module, a cross-correlation response module and a target frame output module, wherein the method comprises the following steps:
3.2.1 setting the total number of training iteration rounds to be 50, and initializing the epoch to be 1; initializing data batch input size 128; initializing the learning rate lr to 0.01, setting the learning rate lr to 0.0005 in the last round, exponentially attenuating the learning rate in training, and initializing the hyper-parameter λ to 1.2; definition of T1The number of samples in (1) is Len (T)1);
3.2.2 Using T1Training the feature extraction module, the cross-correlation response module and the target frame output module, and taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module which are updated after training as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network;
3.3 mixing of T2All video frames are input into the convolutional neural network sub-module and the cross-correlation response module, and T is stored2Output of each video frame: initial template response of classification module
Figure FDA0002850299360000041
Fused template responses
Figure FDA0002850299360000042
And GT response
Figure FDA0002850299360000043
And initial template response of regression module
Figure FDA0002850299360000044
Predicting template responses
Figure FDA0002850299360000045
And GT response
Figure FDA0002850299360000046
Will T2As a third training data set T, six responses of the classification and regression corresponding to each video frame2′;
3.4 use of T2Training a response fusion module, and taking a response fusion module parameter obtained by training as a network parameter of a final response fusion module;
fourthly, tracking the target by using the trained target tracking system, wherein the method comprises the following steps:
4.1 real-time acquisition of video stream I from Camera0,…,Ii,…,INThe target tracking system processes each frame in sequence; wherein IiThe frame number is the ith frame in the video, and N is the total frame number of the video; initializing a variable i to 1;
4.2 feature extraction Module from frame 1I0To obtain a target image Z0And to Z0Carrying out feature extraction to obtain initial template features z0
4.3 if i is 1, let
Figure FDA0002850299360000047
4.6 is rotated; if i>1, rotating by 4.4;
4.4 Using feature extraction Module Pair Zi-1Carrying out feature extraction to obtain the tracking result feature z of the i-1 th framei-1
4.5 Linear fusion submodule of feature extraction Module Using initial template z0Fusion template used in tracking of i-1 th frame
Figure FDA0002850299360000048
And the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frame
Figure FDA0002850299360000049
4.6 feature extraction Module with Zi-1In Ii-1Center ofCoordinates as center, at IiUpper selection target search area XiAnd to XiExtracting the features to obtain the features x of the search areai
4.7 the Cross-correlation response Module receives z from the feature extraction Module0
Figure FDA0002850299360000051
And xiFirst class branch first pair z0And xiProcessing to obtain initial template classification response
Figure FDA0002850299360000052
Followed by a first sort branch pair
Figure FDA0002850299360000053
And xiProcessing to obtain fusion template classification response
Figure FDA0002850299360000054
First regression branch first pair z0And xiProcessing to obtain regression response of initial template
Figure FDA0002850299360000055
Followed by a first regression branch pair
Figure FDA0002850299360000056
And xiProcessing to obtain regression response of fused template
Figure FDA0002850299360000057
4.8 response fusion Module reception
Figure FDA0002850299360000058
And
Figure FDA0002850299360000059
using a second sort branch pair
Figure FDA00028502993600000510
And
Figure FDA00028502993600000511
performing fusion to obtain classified fusion response
Figure FDA00028502993600000512
Using a second regression branch pair
Figure FDA00028502993600000513
And
Figure FDA00028502993600000514
performing fusion to obtain regression fusion response
Figure FDA00028502993600000515
4.9 target Box output Module reception
Figure FDA00028502993600000516
And
Figure FDA00028502993600000517
wherein the third branch pair
Figure FDA00028502993600000518
Processing to obtain classification result
Figure FDA00028502993600000519
The dimension is 22 × 22 × 2k, where k is the number of anchor boxes, 22 × 22 × 2k indicates that there are k anchor boxes at each point of 22 × 22, and each anchor box corresponds to 2 classification values, and the 2 classification values of each anchor box respectively indicate the probability that the anchor box is a target and not a target; third regression branch pair
Figure FDA00028502993600000520
Processing to obtain regression result
Figure FDA00028502993600000521
The dimension is 22 × 22 × 4k, 22 × 22 × 4k indicates that there are k anchor frames at each point of 22 × 22, and each anchor frame corresponds to 4 regression values dx, dy, dw, and dh, and these 4 values respectively indicate the corrected values of the coordinates x and y of the center positions of the anchor frames to the actual target frame and the corrected values of the length and the width;
4.10 third branch of Classification on the results of the Classification
Figure FDA00028502993600000522
Counting to obtain the anchor frame (x, y, w, h) with the maximum target probability, wherein x and y represent the coordinates of the center point of the anchor frame in the original image, and w and h represent the length and width of the anchor frame;
4.11 third regression Branch on regression results
Figure FDA00028502993600000523
Finding a correction value (dx, dy, dw, dh) corresponding to the anchor frame (x, y, w, h), and correcting the anchor frame by using the correction value, wherein the correction formula is as follows:
Figure FDA00028502993600000524
Figure FDA00028502993600000525
Figure FDA00028502993600000526
Figure FDA00028502993600000527
obtained
Figure FDA00028502993600000528
Namely the target frame is obtained,
Figure FDA00028502993600000529
is the coordinates of the center of the target frame,
Figure FDA00028502993600000530
the length and width of the target frame; the target image Z of the ith frame can be obtained by using the target framei,ZiThe tracking result of the ith frame is obtained;
4.12 if i < N, make i ═ i +1, change 4.4; if i is equal to N, 4.13 is rotated;
4.13 obtaining tracking results Z of all frames of video sequence0,Z1,…,ZNAnd then, the process is ended.
2. The target tracking method based on dual-template response fusion of claim 1, wherein the convolutional neural network submodule is a modified AlexNet, the modified AlexNet comprises 5 convolutional layers, 2 maximum pooling layers, 5 BatchNorm layers and 4 ReLU activation function layers, and comprises 16 layers, wherein the convolutional layers are respectively the 1 st, 5 th, 9 th, 12 th and 15 th layers, the pooling layers are respectively the 3 rd and 7 th layers, the BatchNorm layers are respectively the 2 nd, 6 th, 10 th, 13 th and 16 th layers, and the rest layers are all ReLU activation function layers.
3. The method of claim 1, wherein Z is the target tracking method based on dual-template response fusion0Has a size of 127 × 127 × 3, the initial template feature z0The size is 6 × 6 × 256; xiHas a size of 287 × 287 × 3, search area feature xiThe size is 26 × 26 × 256; the meaning of the dimensions is: the first two values are the length and width of the image, respectively, and the third value represents the number of channels.
4. The method of claim 1, wherein the classification kernel module and the classification search submodule of the first classification branch each comprise 1 convolution layer, 1 BatchNorm layer, and 1 ReLU activation function layer; the network structures of the regression kernel module and the regression search submodule are respectively the same as the network structure of the classification kernel module.
5. The method of claim 1, wherein the second classification branch comprises 5 layers, wherein the 1 st and 3 rd layers are convolutional layers, the 2 nd and 4 th layers are BatchNorm layers, and the last layer is a ReLU activation function layer; the network structure of the second regression branch is the same as the second classification branch.
6. The target tracking method based on dual-template response fusion of claim 1, wherein the third classification branch comprises 4 layers, wherein the 1 st and 4 th layers are convolution layers (convolution kernel size 1 x 1), the 2 nd layer is a BatchNorm layer, and the 3 rd layer is a ReLU activation function layer; the network structure of the third regression branch is the same as that of the third classification branch.
7. The method of claim 1, wherein the target tracking method based on dual-template response fusion is characterized in that
Figure FDA0002850299360000061
And
Figure FDA0002850299360000062
are all 4 x 256 in size,
Figure FDA0002850299360000063
the size is 24 x 256,
Figure FDA0002850299360000064
and
Figure FDA0002850299360000065
all the sizes of (1) are 22 multiplied by 256; the above-mentioned
Figure FDA0002850299360000066
And
Figure FDA0002850299360000067
are all 4 x 256 in size,
Figure FDA0002850299360000068
the size is 24 x 256,
Figure FDA0002850299360000069
and
Figure FDA00028502993600000610
all the sizes of (1) are 22 multiplied by 256; the size of the classification stacking response is 22 multiplied by 512, and the classification fusion response
Figure FDA00028502993600000611
Has a size of 2 × 22 × 256; the regression stacking response size is 22 multiplied by 512, and the regression fusion response
Figure FDA00028502993600000612
Has a size of 22 × 22 × 256.
8. The target tracking method based on dual-template response fusion as claimed in claim 1, wherein said step 3.2.2 adopts T1The method for training the feature extraction module, the cross-correlation response module and the target frame output module comprises the following steps:
3.2.2.1 initialization variable d ═ 1;
3.2.2.2 taking T1The pictures from the d th picture to the d + batch size picture are used as training data and input into a feature extraction module, and the direction of the data flow is the feature extraction module → a mutual correlation response module → a target frame output module; training the feature extraction module, the cross-correlation response module and the target frame output module by using a stochastic gradient descent method, minimizing a loss function, so as to update the convolutional neural network sub-module, the target frame output module and the target frame output module of the feature extraction module,Network parameters of a cross-correlation response module and a target frame output module; the loss function is based on the classification loss LclsAnd regression loss LregThe combination composition is used for optimizing a target function and minimizing a loss function so as to update network parameters of the convolutional neural network submodule, the cross-correlation response module and the target frame output module; the loss function is based on the classification loss LclsAnd regression loss LregA combination of the form:
L=Lcls+λLreg
wherein L is the total loss function, LclsFor the classification loss function, obtained by calculating a cross entropy loss function between the true target box and the predicted box in the search area, LregIs a regression loss function obtained by calculating the SmoothL1 loss function between each prediction box to the true target box;
3.2.2.3 if d is d + batchsize>Len(T1) Turning to 3.2.2.4; if d is less than or equal to Len (T)1) Turning to 3.2.2.2;
3.2.2.4 turning to 3.2.2.5 if the epoch is equal to or less than 10; if the epoch is greater than 10, switching to 3.2.2.6;
3.2.2.5, setting all 16-layer network parameters of the convolutional neural network submodule to be fixed, and turning to 3.2.2.6;
3.2.2.6, setting the first 11 layer parameters of the convolutional neural network sub-module as fixed and the last 5 layer parameters as trainable, turning to 3.2.2.7;
3.2.2.7 if epoch <50, let lr be 0.5 x lr for 3.2.2.1, if epoch <50, go 3.2.2.8;
3.2.2.8 taking the parameters of the convolutional neural network sub-module, the cross-correlation response module and the target frame output module after training and updating as the parameters in the convolutional neural network sub-module, the cross-correlation response module and the target frame output module network.
9. The method for tracking targets based on fusion of dual-template responses as claimed in claim 1, wherein said step 3.4 uses T2' the method for training the response fusion module comprises the following steps:
3.4.1 set Total number of training iterations to 50, and initiateChanging epoch to 1; initializing data batch input size 64; initializing the learning rate lr to 10e-6, and setting the learning rate lr to 10e-9 in the last round; definition of T2The number of samples in is Len (T)2′);
3.4.2 use of T2' training a response fusion module, comprising the following specific steps:
3.4.2.1 initializing variable d ═ 1;
3.4.2.2 get T2The d < th > picture to the d + batch size picture as training data, and training the response fusion module by using an SGD algorithm to optimize the parameters of the response fusion module, wherein the loss function of the fusion module is Euclidean distance loss, and the form is as follows:
Figure FDA0002850299360000081
wherein
Figure FDA0002850299360000082
For fusion response, RGTThe objective of using euclidean distance for the responses of the GT box and the search area is to make the fused response and the GT box response as similar as possible;
3.4.2.3 let d be d + batchsize; if d is>Len(T2'), go to 3.4.2.4; if d is less than or equal to Len (T)2'), go to 3.4.2.2;
3.4.2.4 if epoch is less than 50, let epoch be epoch +1 and lr be 0.5 lr, turn 3.4.2.1; if the epich is 50, switching to 3.4.2.5;
3.4.2.5, using the parameters of the response fusion module obtained after the last round of training as the network parameters of the final response fusion module.
10. The method of claim 1, wherein the linear fusion submodule of the feature extraction module in step 4.5 uses an initial template z0Fusion template used in tracking of i-1 th frame
Figure FDA0002850299360000083
And the i-1 th frame tracking result zi-1Fusion generation of fusion template features for tracking of ith frame
Figure FDA0002850299360000084
The fusion formula of (a) is:
Figure FDA0002850299360000085
wherein λ1=0.99,λ2=0.01,λ12Is a preset parameter.
11. The target tracking method based on dual-template response fusion of claim 1, wherein the second classification branch pair of the response fusion module in step 4.8
Figure FDA0002850299360000086
And
Figure FDA0002850299360000087
performing fusion to obtain classified fusion response
Figure FDA0002850299360000088
Second regression branch pair
Figure FDA0002850299360000089
And
Figure FDA00028502993600000810
performing fusion to obtain regression fusion response
Figure FDA00028502993600000811
The two times of fusion all adopt a fusion mode of residual connection, and the fusion formula is as follows:
Figure FDA00028502993600000812
Figure FDA00028502993600000813
wherein
Figure FDA00028502993600000814
Show that
Figure FDA00028502993600000815
And
Figure FDA00028502993600000816
stacking on channel dimension, and inputting the stacked channels into a second classification branch for fusion;
Figure FDA00028502993600000817
show that
Figure FDA00028502993600000818
And
Figure FDA00028502993600000819
after fusion, with
Figure FDA00028502993600000820
Carrying out residual error connection, and finally generating classification fusion response after residual error connection
Figure FDA0002850299360000091
Figure FDA0002850299360000092
Show that
Figure FDA0002850299360000093
And
Figure FDA0002850299360000094
stacking on channel dimension, and inputting the stacked channels into a second regression branch for fusion;
Figure FDA0002850299360000095
show that
Figure FDA0002850299360000096
And
Figure FDA0002850299360000097
after fusion, with
Figure FDA0002850299360000098
Performing residual connection, and finally generating regression fusion response
Figure FDA0002850299360000099
CN202011524190.9A 2020-12-22 2020-12-22 Target tracking method based on dual-template response fusion Active CN112541468B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011524190.9A CN112541468B (en) 2020-12-22 2020-12-22 Target tracking method based on dual-template response fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011524190.9A CN112541468B (en) 2020-12-22 2020-12-22 Target tracking method based on dual-template response fusion

Publications (2)

Publication Number Publication Date
CN112541468A true CN112541468A (en) 2021-03-23
CN112541468B CN112541468B (en) 2022-09-06

Family

ID=75019483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011524190.9A Active CN112541468B (en) 2020-12-22 2020-12-22 Target tracking method based on dual-template response fusion

Country Status (1)

Country Link
CN (1) CN112541468B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361329A (en) * 2021-05-11 2021-09-07 浙江大学 Robust single-target tracking method based on example feature perception
CN113592906A (en) * 2021-07-12 2021-11-02 华中科技大学 Long video target tracking method and system based on annotation frame feature fusion
CN113628246A (en) * 2021-07-28 2021-11-09 西安理工大学 Twin network target tracking method based on 3D convolution template updating
CN113658224A (en) * 2021-08-18 2021-11-16 中国人民解放军陆军炮兵防空兵学院 Target contour tracking method and system based on correlated filtering and Deep Snake
CN113808166A (en) * 2021-09-15 2021-12-17 西安电子科技大学 Single-target tracking method based on clustering difference and depth twin convolutional neural network
CN114332151A (en) * 2021-11-05 2022-04-12 电子科技大学 Method for tracking interested target in shadow Video-SAR (synthetic aperture radar)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830170A (en) * 2018-05-24 2018-11-16 杭州电子科技大学 A kind of end-to-end method for tracking target indicated based on layered characteristic
WO2019085377A1 (en) * 2017-11-03 2019-05-09 北京深鉴智能科技有限公司 Target tracking hardware implementation system and method
CN110827327A (en) * 2018-08-13 2020-02-21 中国科学院长春光学精密机械与物理研究所 Long-term target tracking method based on fusion
CN111401178A (en) * 2020-03-09 2020-07-10 蔡晓刚 Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
CN112069896A (en) * 2020-08-04 2020-12-11 河南科技大学 Video target tracking method based on twin network fusion multi-template features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019085377A1 (en) * 2017-11-03 2019-05-09 北京深鉴智能科技有限公司 Target tracking hardware implementation system and method
CN108830170A (en) * 2018-05-24 2018-11-16 杭州电子科技大学 A kind of end-to-end method for tracking target indicated based on layered characteristic
CN110827327A (en) * 2018-08-13 2020-02-21 中国科学院长春光学精密机械与物理研究所 Long-term target tracking method based on fusion
CN111401178A (en) * 2020-03-09 2020-07-10 蔡晓刚 Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
CN112069896A (en) * 2020-08-04 2020-12-11 河南科技大学 Video target tracking method based on twin network fusion multi-template features

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361329A (en) * 2021-05-11 2021-09-07 浙江大学 Robust single-target tracking method based on example feature perception
CN113361329B (en) * 2021-05-11 2022-05-06 浙江大学 Robust single-target tracking method based on example feature perception
CN113592906A (en) * 2021-07-12 2021-11-02 华中科技大学 Long video target tracking method and system based on annotation frame feature fusion
CN113592906B (en) * 2021-07-12 2024-02-13 华中科技大学 Long video target tracking method and system based on annotation frame feature fusion
CN113628246A (en) * 2021-07-28 2021-11-09 西安理工大学 Twin network target tracking method based on 3D convolution template updating
CN113628246B (en) * 2021-07-28 2024-04-12 西安理工大学 Twin network target tracking method based on 3D convolution template updating
CN113658224A (en) * 2021-08-18 2021-11-16 中国人民解放军陆军炮兵防空兵学院 Target contour tracking method and system based on correlated filtering and Deep Snake
CN113658224B (en) * 2021-08-18 2024-02-06 中国人民解放军陆军炮兵防空兵学院 Target contour tracking method and system based on correlation filtering and Deep Snake
CN113808166A (en) * 2021-09-15 2021-12-17 西安电子科技大学 Single-target tracking method based on clustering difference and depth twin convolutional neural network
CN114332151A (en) * 2021-11-05 2022-04-12 电子科技大学 Method for tracking interested target in shadow Video-SAR (synthetic aperture radar)
CN114332151B (en) * 2021-11-05 2023-04-07 电子科技大学 Method for tracking interested target in shadow Video-SAR (synthetic aperture radar)

Also Published As

Publication number Publication date
CN112541468B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
CN112541468B (en) Target tracking method based on dual-template response fusion
CN107609460B (en) Human body behavior recognition method integrating space-time dual network flow and attention mechanism
CN113807187B (en) Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
Kawewong et al. Online and incremental appearance-based SLAM in highly dynamic environments
CN107633226B (en) Human body motion tracking feature processing method
Kim et al. Fast pedestrian detection in surveillance video based on soft target training of shallow random forest
CN108764019A (en) A kind of Video Events detection method based on multi-source deep learning
CN110263855B (en) Method for classifying images by utilizing common-basis capsule projection
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
Zou et al. Microarray camera image segmentation with Faster-RCNN
CN115690152A (en) Target tracking method based on attention mechanism
CN107609571A (en) A kind of adaptive target tracking method based on LARK features
CN114241250A (en) Cascade regression target detection method and device and computer readable storage medium
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
Kim et al. Self-supervised keypoint detection based on multi-layer random forest regressor
Jin et al. Face recognition based on MTCNN and Facenet
Aliakbarian et al. Deep action-and context-aware sequence learning for activity recognition and anticipation
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
Bengamra et al. A comprehensive survey on object detection in Visual Art: taxonomy and challenge
Li et al. IIE-SegNet: Deep semantic segmentation network with enhanced boundary based on image information entropy
CN116883457B (en) Light multi-target tracking method based on detection tracking joint network and mixed density network
Qin et al. Structure-aware feature disentanglement with knowledge transfer for appearance-changing place recognition
Zhang et al. Joint segmentation of images and scanned point cloud in large-scale street scenes with low-annotation cost
CN111144469B (en) End-to-end multi-sequence text recognition method based on multi-dimensional associated time sequence classification neural network
Yang et al. Collaborative strategy for visual object tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant