CN113902991A - Twin network target tracking method based on cascade characteristic fusion - Google Patents

Twin network target tracking method based on cascade characteristic fusion Download PDF

Info

Publication number
CN113902991A
CN113902991A CN202111175907.8A CN202111175907A CN113902991A CN 113902991 A CN113902991 A CN 113902991A CN 202111175907 A CN202111175907 A CN 202111175907A CN 113902991 A CN113902991 A CN 113902991A
Authority
CN
China
Prior art keywords
template
network
target
target tracking
cascade
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111175907.8A
Other languages
Chinese (zh)
Inventor
韩明
王敬涛
孟军英
杨争艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shijiazhuang University
Original Assignee
Shijiazhuang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shijiazhuang University filed Critical Shijiazhuang University
Priority to CN202111175907.8A priority Critical patent/CN113902991A/en
Publication of CN113902991A publication Critical patent/CN113902991A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a twin network target tracking method based on cascade feature fusion, which comprises the following steps: adopting an improved five-stage ResNet-50 network as a backbone network of the twin network, and respectively extracting shallow and deep features of the template image and the search image by utilizing a template branch and a search branch of the backbone network; carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual network of the template branch and the search branch respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively; and respectively carrying out cross-correlation calculation on the three feature maps R3, R4 and R5 of the two branches, and then carrying out classification and regression on the features subjected to cross-correlation calculation through an anchor-free frame network. The invention realizes the effective fusion of the superficial layer appearance characteristic and the deep layer semantic characteristic of the target and improves the effective identification and positioning of the target.

Description

Twin network target tracking method based on cascade characteristic fusion
Technical Field
The invention belongs to the technical field of target tracking, and particularly relates to a twin network target tracking method based on cascade characteristic fusion.
Background
The target tracking technology is one of the main research contents of computer vision, is more and more widely concerned, and has wide application in the fields of intelligent traffic management, video monitoring, automatic driving, military reconnaissance and the like. The task of target tracking is to estimate the trajectory of a target in a sequence of images. Most current target tracking algorithms rely on the first frame image so that target tracking utilizes limited training data to create a tracker that is adaptable to various appearance changes. However, when the conditions of illumination change, target rotation, target size change, background analog interference, occlusion and the like occur, it is difficult to accurately extract rich feature information in the target tracking process, which easily causes target tracking drift or tracking loss. The most popular target tracking algorithm at present is a target tracker based on deep learning and correlation filters.
With the development of deep learning, the tracker based on the Siamese network architecture attracts a lot of attention due to its excellent tracking performance, especially good balance between tracking accuracy and speed. The twin network algorithm adopts two network branches to respectively extract the characteristics of the target and the candidate target and converts the target tracking problem into the similarity calculation problem.
The attention mechanism and the twin network are widely applied to various target tracking tasks. Although twin network based target tracking has been a long-standing development, visual target tracking algorithms still suffer from several problems: firstly, most siemese trackers use a shallow classification network (such as AlexNet) as a backbone network, but fail to utilize stronger feature extraction capability in a deeper network structure; secondly, in the matching tracking, only the last layer of features containing more semantic information are used, the influence of low-layer spatial features on the tracking performance is not fully explored, and although some algorithms adopt feature fusion operation to realize feature extraction, most algorithms are limited to fusion of channel features and spatial features or simple application of simple deep-layer and shallow-layer features, so that the resolution of deep-layer features is low, and the semantic information is insufficiently applied; then, most of the algorithms rely on the first frame image as a template image, and when illumination changes, target deformation, background analog interference and target occlusion occur, the template is easy to fail, and the problem of target tracking loss occurs.
Disclosure of Invention
The invention provides a twin network target tracking method based on cascade feature fusion, aiming at the defects of the prior art.
In order to solve the above technical problems, the present invention comprises:
a twin network target tracking method based on cascade feature fusion comprises the following steps:
s1, adopting an improved five-stage ResNet-50 network as a backbone network of a twin network, and respectively extracting shallow and deep features of a template image and a search image by utilizing a template branch and a search branch of the backbone network;
s2, carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual error network of the template branch and the search branch respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively;
s3, respectively carrying out cross-correlation calculation on three feature maps R3, R4 and R5 of the two branches, and then carrying out classification and regression on the features subjected to cross-correlation calculation through an anchor-free frame network.
Further, in step S1, the improvement on the ResNet-50 network includes: reducing the original steps of the residual blocks Res4 and Res5 from 16 and 32 pixels to 8 pixels and increasing the receptive field by a dilation convolution operation; training the whole network by adopting a spatial perception sampling strategy; the channel of the multi-layer feature map is changed to 256 by a 1 × 1 convolution operation and the central 7 × 7 region is clipped to the template feature, where each feature cell can capture the entire target region.
Further, in step S1, the template branch and the search branch of the backbone network have the same convolution structure and the same network parameters.
Further, in the step S2, the step-by-step cascade fusion of the three residual blocks Res3, Res4 and Res5 includes the following steps:
s2-1, marking a feature map obtained by 3-by-3 convolution of the residual block Res5 as R5;
s2-2, firstly, the output features after passing through the residual block Res4 are sequentially operated by a convolution kernel of 3 x 256 and a Softmax function, then, the feature map R5 is sequentially operated by a convolution kernel of 3 x 256, a ReLU function and a convolution kernel of another 3 x 256, and finally, the integral features obtained by the operation and the features identified by the channels are fused at each position by adopting pixel-by-pixel addition and ReLU operation to obtain a feature map R4;
and S2-3, in the same step as the step S2-2, fusing the residual block Res3 and the feature map R4 to obtain a feature map R3.
Further, in step S3, the backbone network pre-trains the ImageNet-1K dataset, and trains the entire network using the image of the ILSVRC dataset; randomly selecting a frame of image in the ILSVRC data set in the training process, cutting a 127 × 127 area containing a target as a target template, then cutting a 255 × 255 search box size on a search image, and generating a training sample, wherein the maximum interval is 50 frames; the classification and regression of each target and position are realized through the training.
Further, in the step S3, negative sample sampling is added in the target tracking process.
Further, in the target tracking process, the method for distinguishing the positive sample and the negative sample of the sample comprises the following steps:
for the real frame, T, marked on each image in the training setwIndicates the width, ThDenotes height, (x)1,y1) Denotes the coordinates of the upper left corner, (x)0,y0) Denotes the center coordinate, (x)2,y2) Represents the coordinates of the lower right corner;
with central coordinate (x)0,y0) As a center, to
Figure BDA0003295542960000021
Medicine for curing cancer
Figure BDA0003295542960000022
Respectively making an ellipse with axial length E1
Figure BDA0003295542960000023
Wherein (x)i,yj) Representing the coordinate positions of the sampling points;
then using the central coordinate (x)0,y0) As a center, to
Figure BDA0003295542960000024
And
Figure BDA0003295542960000025
respectively making an ellipse with axial length E2
Figure BDA0003295542960000026
If the sample point (x)i,yj) At E2Internal is positive sample, if at E1The outer part is a negative sample, and if the outer part is positioned between the outer part and the positive sample, the sample is ignored; the positions marked as positive samples are used for trace box regression.
Further, in step S3, the regression target calculated in the regression branch is represented by the distance between the target position and the tracking frame;
the tracking box is calculated as follows:
Figure BDA0003295542960000031
wherein l, t, r and b respectively represent the distance from the target position to the boundary of the tracking frame;
then, the IOU between the predicted trace box and the real frame is calculated, only the IOU of the positive sample is calculated, and the IOU in other cases is set to 0, then the regression loss function is defined as:
Figure BDA0003295542960000032
wherein the content of the first and second substances,
Figure BDA0003295542960000033
g (x, y) represents the target real border;
Figure BDA0003295542960000034
loss of classification LclsExpressed using the SmoothL1 loss function:
Figure BDA0003295542960000035
the overall loss function of the network is expressed as:
L=λ1Lcls2Lreg (7)
wherein λ is1And λ2Is a super parameter.
Further, in step S3, a template updating mechanism is introduced in the target tracking process, and a similarity threshold method is used to update the template in real time; the template dynamic updating strategy comprises the following steps:
learning a template updating sub-network by using a simple recurrent neural network, wherein the template updating formula is as follows:
Si=F(S0,Ti,Si-1) (8)
wherein S is0For the first frame image template, for the most realistic template of the entire template update process, TiTemplate extracted for the ith frame, Si-1Historical accumulation template for i-1 frame, SiThe best update template to match for the next frame is required, F isActivating a function; for the first frame, set TiAnd Si-1Are all S0
For threshold updating, Average Peak Correlation Energy (APCE) is adopted for threshold evaluation to realize template updating, and the calculation formula of the Average Peak Correlation Energy (APCE) is as follows:
Figure BDA0003295542960000036
wherein, FmaxAnd FminMaximum and minimum values in the response map, Fw,hIs the corresponding response value at coordinate (w, h).
Further, the updating and limiting are carried out by setting a similarity threshold value between the new template and the old template: whether updating is carried out is judged through an APCE threshold value, and the formula is as follows:
Figure BDA0003295542960000041
wherein mean (APCE) is a historical mean value of the APCE, η is a set threshold value of the APCE, and when the formula (10) is met, the target is possibly changed greatly;
and then compare through the template similarity to prevent to take place misjudgement: the ratio of the response values of the convolution operations between the templates is taken as the similarity S:
Figure BDA0003295542960000042
and if the formula (10) is satisfied and the similarity between the templates is smaller than the threshold set by the formula (11), updating the templates through the formula (8).
The invention has the beneficial effects that:
the invention provides a twin network target tracking method based on cascade feature fusion, which has the main advantages that as the shallow feature of a multilayer neural network has high resolution and is suitable for target positioning, and the deep feature has rich semantic information and is suitable for target classification, the method has the following main advantages:
1) in order to better utilize the feature extraction capability of a deep network, the ResNet-50 network is improved, network step length, receptive field, space sampling strategies and the like are optimized, model parameters and calculated quantity are reduced, and therefore the tracking speed of the model is improved;
2) the method adopts a cascade feature fusion strategy to cascade and fuse three layers of features of the final stage of the improved ResNet-50 network step by step, fully utilizes the effective extraction of high-layer semantic information and shallow spatial information, and realizes the accurate representation of multiple features of the target, thereby realizing the accurate tracking of the target in a complex environment;
3) aiming at the problem that most algorithms only use the first frame as a target template in the target tracking process, so that the target template is degraded in the tracking process, a template updating mechanism is introduced, and the template is updated in real time by using a similarity threshold method;
4) compared experiments are carried out on OBT2015, VOT2016 and VOT2018 standard data sets, and compared with other methods, experimental results show that the method is high in tracking accuracy, strong in robustness in complex scenes and strong in competitive advantages compared with other algorithms.
Drawings
FIG. 1 is a diagram of a cascaded feature fusion network architecture of the present invention;
FIG. 2 is a block diagram of a feature fusion module of the present invention;
fig. 3 is a graph comparing ablation experiments on OBT2015 for methods of the invention with other methods;
FIG. 4 is a comparison graph of the precision results of the method of the present invention and other methods in 11 different scenarios;
fig. 5 is a graph comparing the results of the inventive method with other methods in an OBT2015 video sequence;
fig. 6 is a graph comparing EAO results on VOT2016 and VOT2018 for the method of the present invention with other methods.
Detailed Description
For the purpose of promoting an understanding of the invention, reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings. It should be understood by those skilled in the art that the examples are only for the understanding of the present invention and should not be construed as the specific limitations of the present invention.
The invention is based on SiamRPN, and the whole flow is shown in figure 1. The network architecture comprises a backbone network consisting of an improved ResNet-50 network, a feature fusion module, a result prediction module and a template updating module. The backbone network is mainly responsible for extracting shallow and deep features of the template image and the search image; the feature fusion template mainly realizes the cascade fusion of three layers of features of the last stage of the template branch and the search branch step by step; and a template updating module is introduced into the template branch, and the dynamic updating of the template is realized through a similarity threshold value method, so that the self-adaptive change of the template along with the increase of the tracking time is ensured.
The multi-feature fusion method can effectively improve the algorithm precision in target segmentation and target tracking, and different shallow layer appearance features and deep layer semantic features of the image can be obtained by performing convolution operation on the image in a neural network. Due to the different hierarchical characteristics of the convolution features, the features of different hierarchies can complement each other, so that the method is a direct method for effectively improving the tracking precision through feature fusion. In subsequent twin network trackers such as SiamFC and the like, most of the twin network trackers only utilize the influence of the last layer of characteristics on target tracking, and ignore deep characteristics, so that a large amount of deep detailed information is lost, and especially when the background and the target belong to the same or similar semantic characteristics, the tracking loss phenomenon easily occurs in the target tracking process. The multi-layer feature fusion refers to fusing different-layer features along the channel dimension direction, and more features on a channel can be obtained by adding elements on the channel or directly connecting the elements.
As shown in fig. 1, the twin network target tracking method based on the cascade feature fusion provided by the invention comprises the following steps:
s1, an improved five-stage ResNet-50 network is adopted as a backbone network of a twin network to construct a target tracking frame, the ResNet-50 can extract deeper image feature information along with the deepening of a network level, and shallow and deep features of a template image and a search image are respectively extracted by utilizing template branches and search branches of the backbone network.
The residual network greatly releases the depth of the network, so that the backbone network of the target detection and semantic segmentation task is gradually replaced by a ResNet structure, high-resolution feature extraction of the network is realized by adding a filling structure in the backbone network, and however, if the backbone network AlexNet in a Simese framework is simply replaced by VGG, ResNet or other deeper networks, the performance is reduced. The performance of the twin network tracker can be obviously improved by using a deeper model and a more reasonable training strategy.
The improvement of the ResNet-50 network of the invention comprises the following steps: 1) since the medium and shallow visual features perform well in the siemese network tracker, to balance the accuracy and efficiency of target tracking, the original stride of the residual blocks Res4 and Res5 is reduced from 16 and 32 pixels to 8 pixels, and the receptive field is increased by the dilation convolution operation; 2) training the whole network by adopting a spatial perception sampling strategy, and solving the problem of damage of absolute translation invariants caused by filling in a deep network; 3) to reduce the number of parameters, the channel of the multi-layer feature map is changed to 256 by a 1 × 1 convolution operation; since the spatial size of the template features is 15 × 15, in order to reduce the computational burden of the network, and to crop the central 7 × 7 region into template features, each feature cell can still capture the entire target region.
The template branch and the search branch of the backbone network have the same convolution structure and the same network parameters.
And S2, carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual networks of the template branch and the search branch step by step respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively for the subsequent target tracking process.
The shallow layer features mainly comprise spatial information such as color, shape and edge, and have significance for target position calibration, and the deep layer features contain more semantic information and have more significance for identifying similar interference, shielding and deformation in the target tracking process. Therefore, the invention fully utilizes three layers of the last stage of the ResNet-50 network to carry out the gradual fusion of the characteristics.
And the feature fusion module performs cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual network step by step to realize image feature extraction. The feature fusion template realizes the integral feature extraction based on channel identification, effectively fuses the features of different layers through step-by-step fusion, and the structure of a feature fusion module is shown in figure 2, and the module structure is described by taking R5 and Res4 as examples.
S2-1, the signature obtained by convolving the residual block Res5 by 3 × 3 is denoted as R5, and the signature keeps the spatial resolution unchanged and changes the number of channels to 256.
S2-2, firstly, sequentially carrying out convolution kernel operation on the output feature after passing through a residual block Res4 and a convolution kernel of a 3 x 3 convolution kernel (the number of channels is 256) and a Softmax function operation to sense the attention feature weight of each feature point in the overall feature, wherein the feature weight mainly refers to the weight of each feature point in the context feature; then, the feature map R5 is subjected to convolution kernel operation of a convolution kernel with 3 × 3 (with the number of channels being 256), a ReLU function and another convolution kernel with 3 × 3 (with the number of channels being 256) in sequence, so as to realize feature conversion and obtain the dependency between channels; finally, pixel-by-pixel addition and ReLU operation are adopted, the overall characteristics obtained by the operation and the characteristics identified by the channel are fused at each position, and a fusion result, namely a characteristic map R4 with richer semantics and the same resolution is obtained;
and S2-3, in the same step as the step S2-2, fusing the residual block Res3 and the feature map R4 to obtain a feature map R3.
By using a feature fusion mechanism, richer context feature information and feature graphs with the same resolution can be obtained, so that the effect of follow-up target tracking is improved.
S3, respectively carrying out cross-correlation calculation on three feature maps R3, R4 and R5 of the two branches, and then carrying out classification and regression on the features subjected to cross-correlation calculation through an anchor-free frame network.
The network is trained through end-to-end convolution, wherein the backbone network is pre-trained on an ImageNet-1K data set, and the image of the ILSVRC data set is used for training the whole network. Where the ILSVRC dataset contains approximately 4500 videos, amounting to approximately 100 million annotations to describe the different tracking scenes. During training, randomly selecting a frame of image in the ILSVRC data set, cutting a 127 × 127 area containing a target as a target template, and cutting a 255 × 255 search box size on a search image to generate a training sample, wherein the maximum interval is 50 frames. The classification and regression of each target and position are realized through the training.
Most samples are positive sample samples in the target tracking process and a padding method is used, resulting in loss of semantic information. Although the discrimination capability of the model is enhanced by the existing training method, the model still has difficulty in distinguishing the condition of similar interference in the image, so that the negative sample sampling is added in the target tracking process to learn the similar interference with different semantics.
In the target tracking process, the method for distinguishing the positive sample and the negative sample of the sample comprises the following steps:
for the real frame, T, marked on each image in the training setwIndicates the width, ThDenotes height, (x)1,y1) Denotes the coordinates of the upper left corner, (x)0,y0) Denotes the center coordinate, (x)2,y2) Represents the coordinates of the lower right corner;
with central coordinate (x)0,y0) As a center, to
Figure BDA0003295542960000061
And
Figure BDA0003295542960000062
respectively making an ellipse with axial length E1
Figure BDA0003295542960000063
Wherein (x)i,yj) Representing the coordinate positions of the sampling points;
then using the central coordinate (x)0,y0) As a center, to
Figure BDA0003295542960000064
And
Figure BDA0003295542960000065
respectively making an ellipse with axial length E2
Figure BDA0003295542960000066
If the sample point (x)i,yj) At E2Internal is positive sample, if at E1The outer part is a negative sample, and if the outer part is positioned between the outer part and the positive sample, the sample is ignored; the positions marked as positive samples are used for trace box regression.
The regression target calculated in the regression branch is represented by the distance of the target position from the tracking box, which is calculated as follows:
Figure BDA0003295542960000071
wherein l, t, r and b respectively represent the distance from the target position to the boundary of the tracking frame;
then, calculating the IOU between the predicted tracking frame and the real frame (interaction Over union), calculating only the IOU of the positive sample, and otherwise setting the IOU to 0, then defining the regression loss function as:
Figure BDA0003295542960000072
wherein the content of the first and second substances,
Figure BDA0003295542960000073
g (x, y) represents the target real border;
Figure BDA0003295542960000074
loss of classification LclsExpressed using the SmoothL1 loss function:
Figure BDA0003295542960000075
the overall loss function of the network is expressed as:
L=λ1Lcls2Lreg (7)
wherein λ is1And λ2Is a super parameter. Through multiple experiment parameter adjustment, lambda is set1=1,λ2=2。
The target tracking method based on the twin network mostly uses the first frame image as a template, judges whether the target is a tracked target or not through similarity matching with the subsequent frame, and does not update the template in the tracking process. Due to the fact that the fixed and unchangeable template is used, when the target is severely changed in rotation, shielding, deformation and the like, the template matching similarity is low, and tracking failure is caused. Therefore, the template updating is very necessary in the target tracking process. However, if the template is updated every frame, on one hand, the tracking drift phenomenon is caused by too frequent updating of the template, and on the other hand, the overall real-time performance of the network is reduced due to frequent updating of the template.
In order to solve the problem of dynamic template update, the invention learns a template update sub-network by using a simple Recurrent Neural Network (RNN), and the template update formula is as follows:
Si=F(S0,Ti,Si-1) (8)
wherein S is0For the first frame image template, the most realistic template for the entire template update process,TiTemplate extracted for the ith frame, Si-1Historical accumulation template for i-1 frame, SiThe best update template to match is needed for the next frame, F is the activation function.
For the first frame, set TiAnd Si-1Are all S0(ii) a It can be seen that the template is not only related to the template of the previous frame, but also related to the extracted template of the current frame.
For threshold updating, the invention adopts average peak correlation energy APCE to evaluate the threshold, and realizes the updating of the template: the average peak correlation energy APCE is calculated as:
Figure BDA0003295542960000081
wherein, FmaxAnd FminMaximum and minimum values in the response map, Fw,hIs the corresponding response value at coordinate (w, h).
Under normal conditions, when the foreground object is normal, the peak value of the response graph is high, the APCE value is large, and the response graph is in a unimodal state, but when the object shape changes drastically or is shielded, the APCE value is small, and multiple peaks occur, and in order to avoid frequent updating of the templates, the method is defined by updating in a mode of setting a similarity threshold between an old template and a new template. Whether updating is carried out is judged through an APCE threshold value, and the formula is as follows:
Figure BDA0003295542960000082
wherein mean (APCE) is a historical mean value of the APCE, η is a set threshold value of the APCE, and when the formula (10) is met, the target is possibly changed greatly;
in order to prevent misjudgment, the template similarity is compared: the ratio of the response values of the convolution operations between the templates is taken as the similarity S:
Figure BDA0003295542960000083
and if the formula (10) is satisfied and the similarity between the templates is smaller than the threshold set by the formula (11), updating the templates through the formula (8).
The use of the dynamic template makes full use of rich information of the historical frame, constructs a more stable model, and has stronger robustness to the violent change of the target, especially to the network under the shielding condition.
The effectiveness of the present invention is further illustrated by comparative experiments.
1) The experimental environment is as follows: the algorithm operating platform is configured as Intel (R) Xeon (R) CPU E5-2660V 2@3.50GHz x 40, the video card is two NVIDIA GTX 1080Ti GPUs, and the total memory is 24 GB.
The invention uses ImageNet Large Scale Visual Recognition Change (ILSVRC) data set to train, and uses LaSOT data set to train template updating module. End-to-end training is performed on the video data set of the ILSVRC. The video data set can be safely used to train a tracked depth model without overfitting to the video domain used by the tracking benchmark. Two frames containing the same object are randomly selected. Before entering the tracking network, the template frame image size is adjusted to 127 × 127 in advance, and the search frame image size is adjusted to 255 × 255. LaSOT is a large video data set with 1400 sequences and 280 sequences in the test set. Dense annotations with high quality are provided, LaSOT has a large amount of deformation and shielding conditions, the training of template updating is conveniently realized, and 20 sequences of 20 categories are randomly selected from a LaSOT data set to serve as training set training template updating sub-networks.
The algorithm is evaluated by adopting widely used standard data sets OTB2015, VOT2016 and VOT2018, and compared with the existing mainstream algorithm to test the accuracy and robustness of the algorithm. Also, before entering the tracking network, the template frame image is previously adjusted to 127 × 127 in size, and the search frame image is adjusted to 255 × 255 in size. Among them, OTB2015 is one of the most common benchmarks for visual target tracking, which has 100 completely annotated video sequences, and uses two evaluation indices, the area under the curve (AUC) of the tracking accuracy and success rate map, for this data set. VOT2016 and VOT2018 are widely used benchmarks for visual target tracking, both containing 60 sequences with different challenge factors, and VOT2018 datasets are labeled with a rotating tracking box and evaluated using a reset-based approach.
2) Quantitative experiments on OBT2015 dataset
For the experiment on the OBT2015 reference data set, the method provided by the invention mainly evaluates the algorithm through tracking precision and success rate.
The central position of the target prediction frame is set to (x)p,yp) The central position of the real bounding box is (x)r,yr) Then, the tracking accuracy of the target is measured by the euclidean distance between the two, and the formula is as follows:
Figure BDA0003295542960000091
a smaller value of d indicates a higher tracking accuracy. The evaluation standard of the tracking precision is the proportion of the frame number with the Euclidean distance d smaller than the set threshold value T to all tracking frame numbers, and T is set as 20 pixel points in the invention.
The success rate of target tracking refers to the Area of a target prediction framepArea with real target bounding boxrThe IOU has the following calculation formula:
Figure BDA0003295542960000092
a larger value of IOU indicates a higher tracking success rate of the algorithm. The success rate graph represents the proportion of the video frame number with the overlapping rate larger than the threshold value t to the total frame number, wherein t belongs to [0, 1], and the threshold value of t is taken as 0.5 in the invention.
The tracking accuracy and the tracking success rate in the invention are calculated based on the score of the area under the curve (AUC).
Ablation experiment: in order to evaluate the effectiveness and accuracy of the algorithm, 7 mainstream tracking algorithms are selected to be compared with the algorithm to carry out an ablation experiment, wherein the algorithms are DaSiamRPN, SiamRPN + +, GradNet, SiamVGG, SiamFC and FDSST. The comparative results of the ablation experiments are shown in fig. 3, in which fig. 3(a) is a success rate comparison graph and fig. 3(b) is a precision comparison graph.
It can be seen from fig. 3 that the success rate and accuracy of the algorithm of the present invention are 0.702 and 0.749, respectively. The success rate is 0.084 higher than that of the reference algorithm, SiamRPN, 0.02 higher than that of SiamVGG, and 0.031 higher than that of DiaSamRPN. The precision is 0.037 higher for SiamRPN, 0.011 higher for SiamRPN + + and 0.018 higher for DaSiamRPN.
Through experimental comparison, the algorithm of the invention is obviously improved in both precision and success rate, which shows that the cascade feature fusion and template updating mechanism of the invention is effective. Meanwhile, the speed of the algorithm on an OBT2015 data set reaches 41fps, and the method is effective for stable real-time tracking of the target.
Quantitative experiments: in order to further prove the adaptability of the algorithm of the invention to complex environments, the invention carries out further quantitative experiments. The experimental reference data set OBT2015 contains 11 relevant scenes of illumination change, occlusion, background analog interference, distortion, low resolution, fast motion, in-plane rotation, out-of-plane rotation, motion blur, fast movement, out-of-view. The comparison of the algorithm of the present invention with the above 7 algorithms in these 11 relevant scenarios is shown by a precision map, as shown in fig. 4. Fig. 4(a) is a contrast diagram in a background similar scene, fig. 4(b) is a contrast diagram in an illumination change scene, fig. 4(c) is a contrast diagram in a distortion scene, fig. 4(d) is a contrast diagram in a scale change scene, fig. 4(e) is a contrast diagram in an occlusion scene, fig. 4(f) is a contrast diagram in an out-of-view scene, fig. 4(g) is a contrast diagram in a fast-moving scene, fig. 4(h) is a contrast diagram in a motion blur scene, fig. 4(i) is a contrast diagram in an in-plane rotation scene, fig. 4(j) is a contrast diagram in an out-of-plane rotation scene, and fig. 4(k) is a contrast diagram in a low-resolution scene.
As can be seen from FIG. 4, the algorithm of the invention has relatively low accuracy in two scenes, namely, the occlusion scene and the low resolution scene, is ranked at the second position, and the other 9 cases are superior to the other 7 algorithms, thereby fully proving the effectiveness of the algorithm of the invention.
When the conditions of illumination change, shielding, deformation, rotation, background analogue interference and the like occur, the semantics of the target can change due to the influence of a scene, the extraction of the semantic features is deepened by fully utilizing the cascade features, so that the semantic feature information of the target is richer, and the accuracy of the algorithm is higher, wherein the semantic feature information is 0.896 in the illumination change, 0.830 in the shielding, 0.867 in the deformation, 0.892 in the plane, 0.881 in the out-of-plane rotation and 0.833 in the background analogue interference condition. The comparison precision of different algorithms in each scene in fig. 4 shows that the tracking precision is relatively high, which fully shows that the template updating in the algorithm of the invention has positive effect, so that the tracker can obtain more effective and accurate semantic information, and update the template in time, thereby realizing accurate and effective tracking.
Qualitative analysis experiment: in this experiment, the algorithm of the present invention was compared to SiamRPN + +, DaSiamRPN, SiamFC. From the OBT2015, 4 sets of video sequences with representative scenes are selected, wherein the 4 sets of video sequences are CliffBar, Jogging, Lemming and Motorilling respectively. The four groups of video sequences comprise a plurality of complex scenes such as motion blur, target rotation, size change, similarity between a target and a background, illumination change, occlusion and the like, and the tracking effects of several comparison algorithms are shown in fig. 5. Among them, fig. 5(a) is a contrast map in ClifBar (80,156,260,461) video sequence, fig. 5(b) is a contrast map in Jogging (30,45,53,63) video sequence, fig. 5(c) is a contrast map in Lemming (218,298,345,380,986) video sequence, and fig. 5(d) is a contrast map in moto rolling (1,25,42,71) video sequence.
Background similar interference, target rotation, target size change, illumination change, motion blur, occlusion and other complex conditions appear in CliffBar, Lemming and MotoRolling video sequences. As can be seen from FIG. 5, the algorithm of the invention effectively extracts the semantic features and the position features of the target through the cascade feature fusion, and enhances the accurate expression of the important features of the target, so that the algorithm of the invention can also realize the accurate positioning of the target and the effective tracking of the target aiming at the above complex conditions. The relative tracking accuracy of the SimaRPN and DaSiamRPN algorithms is poor, the overlapping rate and the success rate of tracking are reduced under the conditions of blurring and rotation, the phenomenon of tracking loss occurs in the SimFC, but the target relocation and re-tracking are realized under the subsequent simple background condition, and the overall performance is poor.
In the Jogging and Lemming video sequences, mainly aiming at experimental verification under the shielding condition, the algorithm disclosed by the invention adopts a template updating mechanism, so that the target can be accurately tracked under the shielding condition, but the tracking loss phenomenon occurs in the SimaRPN, DiaSamRPN and SimFC algorithms, when the target appears again, although the SimaRPN and DiaSamRPN can be retraced, the overlapping rate is low, and the SimFC phenomenon of complete tracking failure occurs.
Through the qualitative experimental analysis, the algorithm disclosed by the invention can be effectively adapted to the change of a complex environment, the effectiveness of the algorithm disclosed by the invention is further proved, and the algorithm has stronger robustness in coping with the complex environment.
3) VOT2018 dataset experiments
In order to verify that the algorithm disclosed by the invention can meet the challenges of illumination change, shielding, size change, similar background, target rotation and the like under complex conditions, the performance of the algorithm disclosed by the invention on the VOT2016 and the VOT2018 is tested and compared with the advanced algorithm in recent years. The assessment is performed by the VOT (visual Object tracking) official toolkit, and the evaluation metrics include Accuracy (Accuracy), Robustness (Robustness), and Expected Average Overlap (Expected Average overlay).
The test results are shown in table 1, and the comparison results of different algorithms of EAO on VOT2016 and VOT2018 are shown in fig. 6. As can be seen from table 1, the result of the algorithm of the present invention on the VOT2016 is better than the algorithms such as DaSiamRPN, SPM, etc., is the same as SPM in accuracy, is the same as ECO in robustness, but is better than other algorithms. Compared with the standard algorithm, the SimRPN is improved by 6% in precision and 6% in robustness. The precision of the VOT2018 result is slightly lower than SiamRPN + +, the result is located at the second position and is equal to the SiamRPN + +, the precision is improved by 10% and the robustness is improved by 23% compared with the standard algorithm SiamRPN. From the EAO comparison in FIG. 6, it can be seen that the algorithm of the present invention is higher in VOT2016 than the other algorithms, and relatively lower in VOT2018 than SiamRPN + +, in the second position. From the result analysis, the algorithm of the invention obtains good competitiveness in the contrast tracker.
TABLE 1 test results of different algorithms on VOT2016 and VOT2018
Figure BDA0003295542960000101
Figure BDA0003295542960000111
In conclusion, the invention provides an end-to-end twin network target tracking method based on cascade feature fusion, which takes ResNet-50 as a backbone network and improves ResNet-50 by methods of reducing model parameters, improving calculation speed and the like, thereby improving the feature extraction capability of a tracker; then, the three-layer characteristics of the last stage of ResNet-50 are subjected to cascade fusion step by step through a characteristic fusion module, so that the effective fusion of the superficial appearance characteristics and the deep semantic characteristics of the target is realized, and the effective identification and positioning of the target are improved; meanwhile, in order to solve the problem of target template degradation and adapt to the appearance and state change of a target in real time, a template updating mechanism is introduced, and the problem of template updating is solved through a similarity threshold value. The model training of the invention makes up the defects of different characteristics in tracking effect, and experiments on OBT2015, VOT2016 and VOT2018 show that the cascade characteristic fusion network provided by the invention effectively improves the universality of the tracker and obtains excellent performance in complex scenes such as rapid motion, motion blur, shielding, background similarity, illumination change, deformation and the like.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A twin network target tracking method based on cascade feature fusion is characterized by comprising the following steps:
s1, adopting an improved five-stage ResNet-50 network as a backbone network of a twin network, and respectively extracting shallow and deep features of a template image and a search image by utilizing a template branch and a search branch of the backbone network;
s2, carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual error network of the template branch and the search branch respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively;
s3, respectively carrying out cross-correlation calculation on three feature maps R3, R4 and R5 of the two branches, and then carrying out classification and regression on the features subjected to cross-correlation calculation through an anchor-free frame network.
2. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in the step S1, the improvement of the ResNet-50 network comprises: reducing the original steps of the residual blocks Res4 and Res5 from 16 and 32 pixels to 8 pixels and increasing the receptive field by a dilation convolution operation; training the whole network by adopting a spatial perception sampling strategy; the channel of the multi-layer feature map is changed to 256 by a 1 × 1 convolution operation and the central 7 × 7 region is clipped to the template feature, where each feature cell can capture the entire target region.
3. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in step S1, the template branch and the search branch of the backbone network have the same convolution structure and the same network parameters.
4. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in the step S2, the cascade fusion of three residual blocks Res3, Res4 and Res5 is performed in a stepwise manner, comprising the following steps:
s2-1, marking a feature map obtained by 3-by-3 convolution of the residual block Res5 as R5;
s2-2, firstly, the output features after passing through the residual block Res4 are sequentially operated by a convolution kernel of 3 x 256 and a Softmax function, then, the feature map R5 is sequentially operated by a convolution kernel of 3 x 256, a ReLU function and a convolution kernel of another 3 x 256, and finally, the integral features obtained by the operation and the features identified by the channels are fused at each position by adopting pixel-by-pixel addition and ReLU operation to obtain a feature map R4;
and S2-3, in the same step as the step S2-2, fusing the residual block Res3 and the feature map R4 to obtain a feature map R3.
5. The twin network target tracking method based on cascade feature fusion of claim 1, wherein in step S3, the backbone network pre-trains on ImageNet-1K dataset, trains the whole network using the image of the ILSVRC dataset; randomly selecting a frame of image in the ILSVRC data set in the training process, cutting a 127 × 127 area containing a target as a target template, then cutting a 255 × 255 search box size on a search image, and generating a training sample, wherein the maximum interval is 50 frames; the classification and regression of each target and position are realized through the training.
6. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in step S3, negative sample sampling is added in the target tracking process.
7. The twin network target tracking method based on cascade feature fusion as claimed in claim 6, wherein in the target tracking process, the positive and negative samples of the samples are distinguished by:
for the real frame, T, marked on each image in the training setwIndicates the width, ThDenotes height, (x)1,y1) To representCoordinates of upper left corner, (x)0,y0) Denotes the center coordinate, (x)2,y2) Represents the coordinates of the lower right corner;
with central coordinate (x)0,y0) As a center, to
Figure FDA0003295542950000011
And
Figure FDA0003295542950000012
respectively making an ellipse with axial length E1
Figure FDA0003295542950000021
Wherein (x)i,yj) Representing the coordinate positions of the sampling points;
then using the central coordinate (x)0,y0) As a center, to
Figure FDA0003295542950000022
And
Figure FDA0003295542950000023
respectively making an ellipse with axial length E2
Figure FDA0003295542950000024
If the sample point (x)i,yj) At E2Internal is positive sample, if at E1The outer part is a negative sample, and if the outer part is positioned between the outer part and the positive sample, the sample is ignored; the positions marked as positive samples are used for trace box regression.
8. The tandem feature fusion-based twin network target tracking method according to claim 1, wherein in the step S3, the regression target calculated in the regression branch is represented by the distance from the target position to the tracking box;
the tracking box is calculated as follows:
Figure FDA0003295542950000025
wherein l, t, r and b respectively represent the distance from the target position to the boundary of the tracking frame;
then, the IOU between the predicted trace box and the real frame is calculated, only the IOU of the positive sample is calculated, and the IOU in other cases is set to 0, then the regression loss function is defined as:
Figure FDA0003295542950000026
wherein the content of the first and second substances,
Figure FDA0003295542950000027
g (x, y) represents the target real border;
Figure FDA0003295542950000028
loss of classification LclsExpressed using the SmoothL1 loss function:
Figure FDA0003295542950000029
the overall loss function of the network is expressed as:
L=λ1Lcls2Lreg (7)
wherein λ is1And λ2Is a super parameter.
9. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in step S3, a template updating mechanism is introduced in the target tracking process, and a similarity threshold method is used to update the template in real time; the template dynamic updating strategy comprises the following steps:
learning a template updating sub-network by using a simple recurrent neural network, wherein the template updating formula is as follows:
Si=F(S0,Ti,Si-1) (8)
wherein S is0For the first frame image template, for the most realistic template of the entire template update process, TiTemplate extracted for the ith frame, Si-1Historical accumulation template for i-1 frame, SiF is an activation function for the best update template needing to be matched for the next frame; for the first frame, set TiAnd Si-1Are all S0
For threshold updating, Average Peak Correlation Energy (APCE) is adopted for threshold evaluation to realize template updating, and the calculation formula of the Average Peak Correlation Energy (APCE) is as follows:
Figure FDA0003295542950000031
wherein, FmaxAnd FminMaximum and minimum values in the response map, Fw,hIs the corresponding response value at coordinate (w, h).
10. The twin network target tracking method based on cascade feature fusion as claimed in claim 9, wherein the updating is defined by setting a similarity threshold between the new template and the old template: whether updating is carried out is judged through an APCE threshold value, and the formula is as follows:
Figure FDA0003295542950000032
wherein mean (APCE) is a historical mean value of the APCE, η is a set threshold value of the APCE, and when the formula (10) is met, the target is possibly changed greatly;
and then compare through the template similarity to prevent to take place misjudgement: the ratio of the response values of the convolution operations between the templates is taken as the similarity S:
Figure FDA0003295542950000033
and if the formula (10) is satisfied and the similarity between the templates is smaller than the threshold set by the formula (11), updating the templates through the formula (8).
CN202111175907.8A 2021-10-09 2021-10-09 Twin network target tracking method based on cascade characteristic fusion Pending CN113902991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111175907.8A CN113902991A (en) 2021-10-09 2021-10-09 Twin network target tracking method based on cascade characteristic fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111175907.8A CN113902991A (en) 2021-10-09 2021-10-09 Twin network target tracking method based on cascade characteristic fusion

Publications (1)

Publication Number Publication Date
CN113902991A true CN113902991A (en) 2022-01-07

Family

ID=79190693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111175907.8A Pending CN113902991A (en) 2021-10-09 2021-10-09 Twin network target tracking method based on cascade characteristic fusion

Country Status (1)

Country Link
CN (1) CN113902991A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926498A (en) * 2022-04-26 2022-08-19 电子科技大学 Rapid target tracking method based on space-time constraint and learnable feature matching
CN116128533A (en) * 2023-03-06 2023-05-16 广西螺霸王食品科技有限公司 Food sales data management system
CN116644351A (en) * 2023-06-13 2023-08-25 石家庄学院 Data processing method and system based on artificial intelligence
CN117197249A (en) * 2023-11-08 2023-12-08 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117252904A (en) * 2023-11-15 2023-12-19 南昌工程学院 Target tracking method and system based on long-range space perception and channel enhancement
WO2024093209A1 (en) * 2022-11-05 2024-05-10 北京化工大学 Method for dynamic target tracking by legged robot

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926498A (en) * 2022-04-26 2022-08-19 电子科技大学 Rapid target tracking method based on space-time constraint and learnable feature matching
CN114926498B (en) * 2022-04-26 2023-05-23 电子科技大学 Rapid target tracking method based on space-time constraint and leachable feature matching
WO2024093209A1 (en) * 2022-11-05 2024-05-10 北京化工大学 Method for dynamic target tracking by legged robot
CN116128533A (en) * 2023-03-06 2023-05-16 广西螺霸王食品科技有限公司 Food sales data management system
CN116128533B (en) * 2023-03-06 2023-07-28 广西螺霸王食品科技有限公司 Food sales data management system
CN116644351A (en) * 2023-06-13 2023-08-25 石家庄学院 Data processing method and system based on artificial intelligence
CN116644351B (en) * 2023-06-13 2024-04-02 石家庄学院 Data processing method and system based on artificial intelligence
CN117197249A (en) * 2023-11-08 2023-12-08 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117197249B (en) * 2023-11-08 2024-01-30 北京观微科技有限公司 Target position determining method, device, electronic equipment and storage medium
CN117252904A (en) * 2023-11-15 2023-12-19 南昌工程学院 Target tracking method and system based on long-range space perception and channel enhancement
CN117252904B (en) * 2023-11-15 2024-02-09 南昌工程学院 Target tracking method and system based on long-range space perception and channel enhancement

Similar Documents

Publication Publication Date Title
CN113902991A (en) Twin network target tracking method based on cascade characteristic fusion
Liu et al. Learning deep multi-level similarity for thermal infrared object tracking
Wang et al. Cross-modal pattern-propagation for RGB-T tracking
CN113139620A (en) End-to-end multi-target detection and tracking joint method based on target association learning
CN110473231B (en) Target tracking method of twin full convolution network with prejudging type learning updating strategy
CN107437246B (en) Common significance detection method based on end-to-end full-convolution neural network
US11640714B2 (en) Video panoptic segmentation
Yan et al. Crowd counting via perspective-guided fractional-dilation convolution
CN109087337B (en) Long-time target tracking method and system based on hierarchical convolution characteristics
CN109740537B (en) Method and system for accurately marking attributes of pedestrian images in crowd video images
Zhu et al. Tiny object tracking: A large-scale dataset and a baseline
CN117252904B (en) Target tracking method and system based on long-range space perception and channel enhancement
Li et al. Real-time deep tracking via corrective domain adaptation
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN115861738A (en) Category semantic information guided remote sensing target detection active sampling method
Yao et al. Scale and appearance variation enhanced Siamese network for thermal infrared target tracking
Zhang et al. Joint representation learning with deep quadruplet network for real-time visual tracking
CN110688512A (en) Pedestrian image search algorithm based on PTGAN region gap and depth neural network
CN111144469B (en) End-to-end multi-sequence text recognition method based on multi-dimensional associated time sequence classification neural network
CN116543019A (en) Single-target tracking method based on accurate bounding box prediction
CN116309707A (en) Multi-target tracking algorithm based on self-calibration and heterogeneous network
CN115937654A (en) Single-target tracking method based on multi-level feature fusion
CN116311353A (en) Intensive pedestrian multi-target tracking method based on feature fusion, computer equipment and storage medium
Chen et al. SiamCPN: Visual tracking with the Siamese center-prediction network
Zheng et al. A novel strategy for global lane detection based on key-point regression and multi-scale feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination