CN113902991A

CN113902991A - Twin network target tracking method based on cascade characteristic fusion

Info

Publication number: CN113902991A
Application number: CN202111175907.8A
Authority: CN
Inventors: 韩明; 王敬涛; 孟军英; 杨争艳
Original assignee: Shijiazhuang University
Current assignee: Shijiazhuang University
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-01-07

Abstract

The invention provides a twin network target tracking method based on cascade feature fusion, which comprises the following steps: adopting an improved five-stage ResNet-50 network as a backbone network of the twin network, and respectively extracting shallow and deep features of the template image and the search image by utilizing a template branch and a search branch of the backbone network; carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual network of the template branch and the search branch respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively; and respectively carrying out cross-correlation calculation on the three feature maps R3, R4 and R5 of the two branches, and then carrying out classification and regression on the features subjected to cross-correlation calculation through an anchor-free frame network. The invention realizes the effective fusion of the superficial layer appearance characteristic and the deep layer semantic characteristic of the target and improves the effective identification and positioning of the target.

Description

Twin network target tracking method based on cascade characteristic fusion

Technical Field

The invention belongs to the technical field of target tracking, and particularly relates to a twin network target tracking method based on cascade characteristic fusion.

Background

The target tracking technology is one of the main research contents of computer vision, is more and more widely concerned, and has wide application in the fields of intelligent traffic management, video monitoring, automatic driving, military reconnaissance and the like. The task of target tracking is to estimate the trajectory of a target in a sequence of images. Most current target tracking algorithms rely on the first frame image so that target tracking utilizes limited training data to create a tracker that is adaptable to various appearance changes. However, when the conditions of illumination change, target rotation, target size change, background analog interference, occlusion and the like occur, it is difficult to accurately extract rich feature information in the target tracking process, which easily causes target tracking drift or tracking loss. The most popular target tracking algorithm at present is a target tracker based on deep learning and correlation filters.

With the development of deep learning, the tracker based on the Siamese network architecture attracts a lot of attention due to its excellent tracking performance, especially good balance between tracking accuracy and speed. The twin network algorithm adopts two network branches to respectively extract the characteristics of the target and the candidate target and converts the target tracking problem into the similarity calculation problem.

The attention mechanism and the twin network are widely applied to various target tracking tasks. Although twin network based target tracking has been a long-standing development, visual target tracking algorithms still suffer from several problems: firstly, most siemese trackers use a shallow classification network (such as AlexNet) as a backbone network, but fail to utilize stronger feature extraction capability in a deeper network structure; secondly, in the matching tracking, only the last layer of features containing more semantic information are used, the influence of low-layer spatial features on the tracking performance is not fully explored, and although some algorithms adopt feature fusion operation to realize feature extraction, most algorithms are limited to fusion of channel features and spatial features or simple application of simple deep-layer and shallow-layer features, so that the resolution of deep-layer features is low, and the semantic information is insufficiently applied; then, most of the algorithms rely on the first frame image as a template image, and when illumination changes, target deformation, background analog interference and target occlusion occur, the template is easy to fail, and the problem of target tracking loss occurs.

Disclosure of Invention

The invention provides a twin network target tracking method based on cascade feature fusion, aiming at the defects of the prior art.

In order to solve the above technical problems, the present invention comprises:

a twin network target tracking method based on cascade feature fusion comprises the following steps:

s1, adopting an improved five-stage ResNet-50 network as a backbone network of a twin network, and respectively extracting shallow and deep features of a template image and a search image by utilizing a template branch and a search branch of the backbone network;

s2, carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual error network of the template branch and the search branch respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively;

s3, respectively carrying out cross-correlation calculation on three feature maps R3, R4 and R5 of the two branches, and then carrying out classification and regression on the features subjected to cross-correlation calculation through an anchor-free frame network.

Further, in step S1, the improvement on the ResNet-50 network includes: reducing the original steps of the residual blocks Res4 and Res5 from 16 and 32 pixels to 8 pixels and increasing the receptive field by a dilation convolution operation; training the whole network by adopting a spatial perception sampling strategy; the channel of the multi-layer feature map is changed to 256 by a 1 × 1 convolution operation and the central 7 × 7 region is clipped to the template feature, where each feature cell can capture the entire target region.

Further, in step S1, the template branch and the search branch of the backbone network have the same convolution structure and the same network parameters.

Further, in the step S2, the step-by-step cascade fusion of the three residual blocks Res3, Res4 and Res5 includes the following steps:

s2-1, marking a feature map obtained by 3-by-3 convolution of the residual block Res5 as R5;

s2-2, firstly, the output features after passing through the residual block Res4 are sequentially operated by a convolution kernel of 3 x 256 and a Softmax function, then, the feature map R5 is sequentially operated by a convolution kernel of 3 x 256, a ReLU function and a convolution kernel of another 3 x 256, and finally, the integral features obtained by the operation and the features identified by the channels are fused at each position by adopting pixel-by-pixel addition and ReLU operation to obtain a feature map R4;

and S2-3, in the same step as the step S2-2, fusing the residual block Res3 and the feature map R4 to obtain a feature map R3.

Further, in step S3, the backbone network pre-trains the ImageNet-1K dataset, and trains the entire network using the image of the ILSVRC dataset; randomly selecting a frame of image in the ILSVRC data set in the training process, cutting a 127 × 127 area containing a target as a target template, then cutting a 255 × 255 search box size on a search image, and generating a training sample, wherein the maximum interval is 50 frames; the classification and regression of each target and position are realized through the training.

Further, in the step S3, negative sample sampling is added in the target tracking process.

Further, in the target tracking process, the method for distinguishing the positive sample and the negative sample of the sample comprises the following steps:

for the real frame, T, marked on each image in the training set_wIndicates the width, T_hDenotes height, (x)₁，y₁) Denotes the coordinates of the upper left corner, (x)₀，y₀) Denotes the center coordinate, (x)₂，y₂) Represents the coordinates of the lower right corner;

with central coordinate (x)₀，y₀) As a center, to

Medicine for curing cancer

Respectively making an ellipse with axial length E₁：

Wherein (x)_i，y_j) Representing the coordinate positions of the sampling points;

then using the central coordinate (x)₀，y₀) As a center, to

And

respectively making an ellipse with axial length E₂：

If the sample point (x)_i，y_j) At E₂Internal is positive sample, if at E₁The outer part is a negative sample, and if the outer part is positioned between the outer part and the positive sample, the sample is ignored; the positions marked as positive samples are used for trace box regression.

Further, in step S3, the regression target calculated in the regression branch is represented by the distance between the target position and the tracking frame;

the tracking box is calculated as follows:

wherein l, t, r and b respectively represent the distance from the target position to the boundary of the tracking frame;

then, the IOU between the predicted trace box and the real frame is calculated, only the IOU of the positive sample is calculated, and the IOU in other cases is set to 0, then the regression loss function is defined as:

wherein the content of the first and second substances,

g (x, y) represents the target real border;

loss of classification L_clsExpressed using the SmoothL1 loss function:

the overall loss function of the network is expressed as:

L＝λ₁L_cls+λ₂L_reg (7)

wherein λ is₁And λ₂Is a super parameter.

Further, in step S3, a template updating mechanism is introduced in the target tracking process, and a similarity threshold method is used to update the template in real time; the template dynamic updating strategy comprises the following steps:

learning a template updating sub-network by using a simple recurrent neural network, wherein the template updating formula is as follows:

S_i＝F(S₀，T_i，S_i-1) (8)

wherein S is₀For the first frame image template, for the most realistic template of the entire template update process, T_iTemplate extracted for the ith frame, S_i-1Historical accumulation template for i-1 frame, S_iThe best update template to match for the next frame is required, F isActivating a function; for the first frame, set T_iAnd S_i-1Are all S₀；

For threshold updating, Average Peak Correlation Energy (APCE) is adopted for threshold evaluation to realize template updating, and the calculation formula of the Average Peak Correlation Energy (APCE) is as follows:

wherein, F_maxAnd F_minMaximum and minimum values in the response map, F_w，hIs the corresponding response value at coordinate (w, h).

Further, the updating and limiting are carried out by setting a similarity threshold value between the new template and the old template: whether updating is carried out is judged through an APCE threshold value, and the formula is as follows:

wherein mean (APCE) is a historical mean value of the APCE, η is a set threshold value of the APCE, and when the formula (10) is met, the target is possibly changed greatly;

and then compare through the template similarity to prevent to take place misjudgement: the ratio of the response values of the convolution operations between the templates is taken as the similarity S:

and if the formula (10) is satisfied and the similarity between the templates is smaller than the threshold set by the formula (11), updating the templates through the formula (8).

The invention has the beneficial effects that:

the invention provides a twin network target tracking method based on cascade feature fusion, which has the main advantages that as the shallow feature of a multilayer neural network has high resolution and is suitable for target positioning, and the deep feature has rich semantic information and is suitable for target classification, the method has the following main advantages:

1) in order to better utilize the feature extraction capability of a deep network, the ResNet-50 network is improved, network step length, receptive field, space sampling strategies and the like are optimized, model parameters and calculated quantity are reduced, and therefore the tracking speed of the model is improved;

2) the method adopts a cascade feature fusion strategy to cascade and fuse three layers of features of the final stage of the improved ResNet-50 network step by step, fully utilizes the effective extraction of high-layer semantic information and shallow spatial information, and realizes the accurate representation of multiple features of the target, thereby realizing the accurate tracking of the target in a complex environment;

3) aiming at the problem that most algorithms only use the first frame as a target template in the target tracking process, so that the target template is degraded in the tracking process, a template updating mechanism is introduced, and the template is updated in real time by using a similarity threshold method;

4) compared experiments are carried out on OBT2015, VOT2016 and VOT2018 standard data sets, and compared with other methods, experimental results show that the method is high in tracking accuracy, strong in robustness in complex scenes and strong in competitive advantages compared with other algorithms.

Drawings

FIG. 1 is a diagram of a cascaded feature fusion network architecture of the present invention;

FIG. 2 is a block diagram of a feature fusion module of the present invention;

fig. 3 is a graph comparing ablation experiments on OBT2015 for methods of the invention with other methods;

FIG. 4 is a comparison graph of the precision results of the method of the present invention and other methods in 11 different scenarios;

fig. 5 is a graph comparing the results of the inventive method with other methods in an OBT2015 video sequence;

fig. 6 is a graph comparing EAO results on VOT2016 and VOT2018 for the method of the present invention with other methods.

Detailed Description

For the purpose of promoting an understanding of the invention, reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings. It should be understood by those skilled in the art that the examples are only for the understanding of the present invention and should not be construed as the specific limitations of the present invention.

The invention is based on SiamRPN, and the whole flow is shown in figure 1. The network architecture comprises a backbone network consisting of an improved ResNet-50 network, a feature fusion module, a result prediction module and a template updating module. The backbone network is mainly responsible for extracting shallow and deep features of the template image and the search image; the feature fusion template mainly realizes the cascade fusion of three layers of features of the last stage of the template branch and the search branch step by step; and a template updating module is introduced into the template branch, and the dynamic updating of the template is realized through a similarity threshold value method, so that the self-adaptive change of the template along with the increase of the tracking time is ensured.

The multi-feature fusion method can effectively improve the algorithm precision in target segmentation and target tracking, and different shallow layer appearance features and deep layer semantic features of the image can be obtained by performing convolution operation on the image in a neural network. Due to the different hierarchical characteristics of the convolution features, the features of different hierarchies can complement each other, so that the method is a direct method for effectively improving the tracking precision through feature fusion. In subsequent twin network trackers such as SiamFC and the like, most of the twin network trackers only utilize the influence of the last layer of characteristics on target tracking, and ignore deep characteristics, so that a large amount of deep detailed information is lost, and especially when the background and the target belong to the same or similar semantic characteristics, the tracking loss phenomenon easily occurs in the target tracking process. The multi-layer feature fusion refers to fusing different-layer features along the channel dimension direction, and more features on a channel can be obtained by adding elements on the channel or directly connecting the elements.

As shown in fig. 1, the twin network target tracking method based on the cascade feature fusion provided by the invention comprises the following steps:

s1, an improved five-stage ResNet-50 network is adopted as a backbone network of a twin network to construct a target tracking frame, the ResNet-50 can extract deeper image feature information along with the deepening of a network level, and shallow and deep features of a template image and a search image are respectively extracted by utilizing template branches and search branches of the backbone network.

The residual network greatly releases the depth of the network, so that the backbone network of the target detection and semantic segmentation task is gradually replaced by a ResNet structure, high-resolution feature extraction of the network is realized by adding a filling structure in the backbone network, and however, if the backbone network AlexNet in a Simese framework is simply replaced by VGG, ResNet or other deeper networks, the performance is reduced. The performance of the twin network tracker can be obviously improved by using a deeper model and a more reasonable training strategy.

The improvement of the ResNet-50 network of the invention comprises the following steps: 1) since the medium and shallow visual features perform well in the siemese network tracker, to balance the accuracy and efficiency of target tracking, the original stride of the residual blocks Res4 and Res5 is reduced from 16 and 32 pixels to 8 pixels, and the receptive field is increased by the dilation convolution operation; 2) training the whole network by adopting a spatial perception sampling strategy, and solving the problem of damage of absolute translation invariants caused by filling in a deep network; 3) to reduce the number of parameters, the channel of the multi-layer feature map is changed to 256 by a 1 × 1 convolution operation; since the spatial size of the template features is 15 × 15, in order to reduce the computational burden of the network, and to crop the central 7 × 7 region into template features, each feature cell can still capture the entire target region.

The template branch and the search branch of the backbone network have the same convolution structure and the same network parameters.

And S2, carrying out cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual networks of the template branch and the search branch step by step respectively to obtain three characteristic maps R3, R4 and R5 of the two branches respectively for the subsequent target tracking process.

The shallow layer features mainly comprise spatial information such as color, shape and edge, and have significance for target position calibration, and the deep layer features contain more semantic information and have more significance for identifying similar interference, shielding and deformation in the target tracking process. Therefore, the invention fully utilizes three layers of the last stage of the ResNet-50 network to carry out the gradual fusion of the characteristics.

And the feature fusion module performs cascade fusion on the last three residual blocks Res3, Res4 and Res5 of the residual network step by step to realize image feature extraction. The feature fusion template realizes the integral feature extraction based on channel identification, effectively fuses the features of different layers through step-by-step fusion, and the structure of a feature fusion module is shown in figure 2, and the module structure is described by taking R5 and Res4 as examples.

S2-1, the signature obtained by convolving the residual block Res5 by 3 × 3 is denoted as R5, and the signature keeps the spatial resolution unchanged and changes the number of channels to 256.

S2-2, firstly, sequentially carrying out convolution kernel operation on the output feature after passing through a residual block Res4 and a convolution kernel of a 3 x 3 convolution kernel (the number of channels is 256) and a Softmax function operation to sense the attention feature weight of each feature point in the overall feature, wherein the feature weight mainly refers to the weight of each feature point in the context feature; then, the feature map R5 is subjected to convolution kernel operation of a convolution kernel with 3 × 3 (with the number of channels being 256), a ReLU function and another convolution kernel with 3 × 3 (with the number of channels being 256) in sequence, so as to realize feature conversion and obtain the dependency between channels; finally, pixel-by-pixel addition and ReLU operation are adopted, the overall characteristics obtained by the operation and the characteristics identified by the channel are fused at each position, and a fusion result, namely a characteristic map R4 with richer semantics and the same resolution is obtained;

By using a feature fusion mechanism, richer context feature information and feature graphs with the same resolution can be obtained, so that the effect of follow-up target tracking is improved.

The network is trained through end-to-end convolution, wherein the backbone network is pre-trained on an ImageNet-1K data set, and the image of the ILSVRC data set is used for training the whole network. Where the ILSVRC dataset contains approximately 4500 videos, amounting to approximately 100 million annotations to describe the different tracking scenes. During training, randomly selecting a frame of image in the ILSVRC data set, cutting a 127 × 127 area containing a target as a target template, and cutting a 255 × 255 search box size on a search image to generate a training sample, wherein the maximum interval is 50 frames. The classification and regression of each target and position are realized through the training.

Most samples are positive sample samples in the target tracking process and a padding method is used, resulting in loss of semantic information. Although the discrimination capability of the model is enhanced by the existing training method, the model still has difficulty in distinguishing the condition of similar interference in the image, so that the negative sample sampling is added in the target tracking process to learn the similar interference with different semantics.

In the target tracking process, the method for distinguishing the positive sample and the negative sample of the sample comprises the following steps:

with central coordinate (x)₀，y₀) As a center, to

And

respectively making an ellipse with axial length E₁：

then using the central coordinate (x)₀，y₀) As a center, to

And

respectively making an ellipse with axial length E₂：

The regression target calculated in the regression branch is represented by the distance of the target position from the tracking box, which is calculated as follows:

then, calculating the IOU between the predicted tracking frame and the real frame (interaction Over union), calculating only the IOU of the positive sample, and otherwise setting the IOU to 0, then defining the regression loss function as:

wherein the content of the first and second substances,

g (x, y) represents the target real border;

loss of classification L_clsExpressed using the SmoothL1 loss function:

the overall loss function of the network is expressed as:

L＝λ₁L_cls+λ₂L_reg (7)

wherein λ is₁And λ₂Is a super parameter. Through multiple experiment parameter adjustment, lambda is set₁＝1，λ₂＝2。

The target tracking method based on the twin network mostly uses the first frame image as a template, judges whether the target is a tracked target or not through similarity matching with the subsequent frame, and does not update the template in the tracking process. Due to the fact that the fixed and unchangeable template is used, when the target is severely changed in rotation, shielding, deformation and the like, the template matching similarity is low, and tracking failure is caused. Therefore, the template updating is very necessary in the target tracking process. However, if the template is updated every frame, on one hand, the tracking drift phenomenon is caused by too frequent updating of the template, and on the other hand, the overall real-time performance of the network is reduced due to frequent updating of the template.

In order to solve the problem of dynamic template update, the invention learns a template update sub-network by using a simple Recurrent Neural Network (RNN), and the template update formula is as follows:

S_i＝F(S₀，T_i，S_i-1) (8)

wherein S is₀For the first frame image template, the most realistic template for the entire template update process，T_iTemplate extracted for the ith frame, S_i-1Historical accumulation template for i-1 frame, S_iThe best update template to match is needed for the next frame, F is the activation function.

For the first frame, set T_iAnd S_i-1Are all S₀(ii) a It can be seen that the template is not only related to the template of the previous frame, but also related to the extracted template of the current frame.

For threshold updating, the invention adopts average peak correlation energy APCE to evaluate the threshold, and realizes the updating of the template: the average peak correlation energy APCE is calculated as:

Under normal conditions, when the foreground object is normal, the peak value of the response graph is high, the APCE value is large, and the response graph is in a unimodal state, but when the object shape changes drastically or is shielded, the APCE value is small, and multiple peaks occur, and in order to avoid frequent updating of the templates, the method is defined by updating in a mode of setting a similarity threshold between an old template and a new template. Whether updating is carried out is judged through an APCE threshold value, and the formula is as follows:

in order to prevent misjudgment, the template similarity is compared: the ratio of the response values of the convolution operations between the templates is taken as the similarity S:

The use of the dynamic template makes full use of rich information of the historical frame, constructs a more stable model, and has stronger robustness to the violent change of the target, especially to the network under the shielding condition.

The effectiveness of the present invention is further illustrated by comparative experiments.

1) The experimental environment is as follows: the algorithm operating platform is configured as Intel (R) Xeon (R) CPU E5-2660V 2@3.50GHz x 40, the video card is two NVIDIA GTX 1080Ti GPUs, and the total memory is 24 GB.

The invention uses ImageNet Large Scale Visual Recognition Change (ILSVRC) data set to train, and uses LaSOT data set to train template updating module. End-to-end training is performed on the video data set of the ILSVRC. The video data set can be safely used to train a tracked depth model without overfitting to the video domain used by the tracking benchmark. Two frames containing the same object are randomly selected. Before entering the tracking network, the template frame image size is adjusted to 127 × 127 in advance, and the search frame image size is adjusted to 255 × 255. LaSOT is a large video data set with 1400 sequences and 280 sequences in the test set. Dense annotations with high quality are provided, LaSOT has a large amount of deformation and shielding conditions, the training of template updating is conveniently realized, and 20 sequences of 20 categories are randomly selected from a LaSOT data set to serve as training set training template updating sub-networks.

The algorithm is evaluated by adopting widely used standard data sets OTB2015, VOT2016 and VOT2018, and compared with the existing mainstream algorithm to test the accuracy and robustness of the algorithm. Also, before entering the tracking network, the template frame image is previously adjusted to 127 × 127 in size, and the search frame image is adjusted to 255 × 255 in size. Among them, OTB2015 is one of the most common benchmarks for visual target tracking, which has 100 completely annotated video sequences, and uses two evaluation indices, the area under the curve (AUC) of the tracking accuracy and success rate map, for this data set. VOT2016 and VOT2018 are widely used benchmarks for visual target tracking, both containing 60 sequences with different challenge factors, and VOT2018 datasets are labeled with a rotating tracking box and evaluated using a reset-based approach.

2) Quantitative experiments on OBT2015 dataset

For the experiment on the OBT2015 reference data set, the method provided by the invention mainly evaluates the algorithm through tracking precision and success rate.

The central position of the target prediction frame is set to (x)_p，y_p) The central position of the real bounding box is (x)_r，y_r) Then, the tracking accuracy of the target is measured by the euclidean distance between the two, and the formula is as follows:

a smaller value of d indicates a higher tracking accuracy. The evaluation standard of the tracking precision is the proportion of the frame number with the Euclidean distance d smaller than the set threshold value T to all tracking frame numbers, and T is set as 20 pixel points in the invention.

The success rate of target tracking refers to the Area of a target prediction frame_pArea with real target bounding box_rThe IOU has the following calculation formula:

a larger value of IOU indicates a higher tracking success rate of the algorithm. The success rate graph represents the proportion of the video frame number with the overlapping rate larger than the threshold value t to the total frame number, wherein t belongs to [0, 1], and the threshold value of t is taken as 0.5 in the invention.

The tracking accuracy and the tracking success rate in the invention are calculated based on the score of the area under the curve (AUC).

Ablation experiment: in order to evaluate the effectiveness and accuracy of the algorithm, 7 mainstream tracking algorithms are selected to be compared with the algorithm to carry out an ablation experiment, wherein the algorithms are DaSiamRPN, SiamRPN + +, GradNet, SiamVGG, SiamFC and FDSST. The comparative results of the ablation experiments are shown in fig. 3, in which fig. 3(a) is a success rate comparison graph and fig. 3(b) is a precision comparison graph.

It can be seen from fig. 3 that the success rate and accuracy of the algorithm of the present invention are 0.702 and 0.749, respectively. The success rate is 0.084 higher than that of the reference algorithm, SiamRPN, 0.02 higher than that of SiamVGG, and 0.031 higher than that of DiaSamRPN. The precision is 0.037 higher for SiamRPN, 0.011 higher for SiamRPN + + and 0.018 higher for DaSiamRPN.

Through experimental comparison, the algorithm of the invention is obviously improved in both precision and success rate, which shows that the cascade feature fusion and template updating mechanism of the invention is effective. Meanwhile, the speed of the algorithm on an OBT2015 data set reaches 41fps, and the method is effective for stable real-time tracking of the target.

Quantitative experiments: in order to further prove the adaptability of the algorithm of the invention to complex environments, the invention carries out further quantitative experiments. The experimental reference data set OBT2015 contains 11 relevant scenes of illumination change, occlusion, background analog interference, distortion, low resolution, fast motion, in-plane rotation, out-of-plane rotation, motion blur, fast movement, out-of-view. The comparison of the algorithm of the present invention with the above 7 algorithms in these 11 relevant scenarios is shown by a precision map, as shown in fig. 4. Fig. 4(a) is a contrast diagram in a background similar scene, fig. 4(b) is a contrast diagram in an illumination change scene, fig. 4(c) is a contrast diagram in a distortion scene, fig. 4(d) is a contrast diagram in a scale change scene, fig. 4(e) is a contrast diagram in an occlusion scene, fig. 4(f) is a contrast diagram in an out-of-view scene, fig. 4(g) is a contrast diagram in a fast-moving scene, fig. 4(h) is a contrast diagram in a motion blur scene, fig. 4(i) is a contrast diagram in an in-plane rotation scene, fig. 4(j) is a contrast diagram in an out-of-plane rotation scene, and fig. 4(k) is a contrast diagram in a low-resolution scene.

As can be seen from FIG. 4, the algorithm of the invention has relatively low accuracy in two scenes, namely, the occlusion scene and the low resolution scene, is ranked at the second position, and the other 9 cases are superior to the other 7 algorithms, thereby fully proving the effectiveness of the algorithm of the invention.

When the conditions of illumination change, shielding, deformation, rotation, background analogue interference and the like occur, the semantics of the target can change due to the influence of a scene, the extraction of the semantic features is deepened by fully utilizing the cascade features, so that the semantic feature information of the target is richer, and the accuracy of the algorithm is higher, wherein the semantic feature information is 0.896 in the illumination change, 0.830 in the shielding, 0.867 in the deformation, 0.892 in the plane, 0.881 in the out-of-plane rotation and 0.833 in the background analogue interference condition. The comparison precision of different algorithms in each scene in fig. 4 shows that the tracking precision is relatively high, which fully shows that the template updating in the algorithm of the invention has positive effect, so that the tracker can obtain more effective and accurate semantic information, and update the template in time, thereby realizing accurate and effective tracking.

Qualitative analysis experiment: in this experiment, the algorithm of the present invention was compared to SiamRPN + +, DaSiamRPN, SiamFC. From the OBT2015, 4 sets of video sequences with representative scenes are selected, wherein the 4 sets of video sequences are CliffBar, Jogging, Lemming and Motorilling respectively. The four groups of video sequences comprise a plurality of complex scenes such as motion blur, target rotation, size change, similarity between a target and a background, illumination change, occlusion and the like, and the tracking effects of several comparison algorithms are shown in fig. 5. Among them, fig. 5(a) is a contrast map in ClifBar (80,156,260,461) video sequence, fig. 5(b) is a contrast map in Jogging (30,45,53,63) video sequence, fig. 5(c) is a contrast map in Lemming (218,298,345,380,986) video sequence, and fig. 5(d) is a contrast map in moto rolling (1,25,42,71) video sequence.

Background similar interference, target rotation, target size change, illumination change, motion blur, occlusion and other complex conditions appear in CliffBar, Lemming and MotoRolling video sequences. As can be seen from FIG. 5, the algorithm of the invention effectively extracts the semantic features and the position features of the target through the cascade feature fusion, and enhances the accurate expression of the important features of the target, so that the algorithm of the invention can also realize the accurate positioning of the target and the effective tracking of the target aiming at the above complex conditions. The relative tracking accuracy of the SimaRPN and DaSiamRPN algorithms is poor, the overlapping rate and the success rate of tracking are reduced under the conditions of blurring and rotation, the phenomenon of tracking loss occurs in the SimFC, but the target relocation and re-tracking are realized under the subsequent simple background condition, and the overall performance is poor.

In the Jogging and Lemming video sequences, mainly aiming at experimental verification under the shielding condition, the algorithm disclosed by the invention adopts a template updating mechanism, so that the target can be accurately tracked under the shielding condition, but the tracking loss phenomenon occurs in the SimaRPN, DiaSamRPN and SimFC algorithms, when the target appears again, although the SimaRPN and DiaSamRPN can be retraced, the overlapping rate is low, and the SimFC phenomenon of complete tracking failure occurs.

Through the qualitative experimental analysis, the algorithm disclosed by the invention can be effectively adapted to the change of a complex environment, the effectiveness of the algorithm disclosed by the invention is further proved, and the algorithm has stronger robustness in coping with the complex environment.

3) VOT2018 dataset experiments

In order to verify that the algorithm disclosed by the invention can meet the challenges of illumination change, shielding, size change, similar background, target rotation and the like under complex conditions, the performance of the algorithm disclosed by the invention on the VOT2016 and the VOT2018 is tested and compared with the advanced algorithm in recent years. The assessment is performed by the VOT (visual Object tracking) official toolkit, and the evaluation metrics include Accuracy (Accuracy), Robustness (Robustness), and Expected Average Overlap (Expected Average overlay).

The test results are shown in table 1, and the comparison results of different algorithms of EAO on VOT2016 and VOT2018 are shown in fig. 6. As can be seen from table 1, the result of the algorithm of the present invention on the VOT2016 is better than the algorithms such as DaSiamRPN, SPM, etc., is the same as SPM in accuracy, is the same as ECO in robustness, but is better than other algorithms. Compared with the standard algorithm, the SimRPN is improved by 6% in precision and 6% in robustness. The precision of the VOT2018 result is slightly lower than SiamRPN + +, the result is located at the second position and is equal to the SiamRPN + +, the precision is improved by 10% and the robustness is improved by 23% compared with the standard algorithm SiamRPN. From the EAO comparison in FIG. 6, it can be seen that the algorithm of the present invention is higher in VOT2016 than the other algorithms, and relatively lower in VOT2018 than SiamRPN + +, in the second position. From the result analysis, the algorithm of the invention obtains good competitiveness in the contrast tracker.

TABLE 1 test results of different algorithms on VOT2016 and VOT2018

In conclusion, the invention provides an end-to-end twin network target tracking method based on cascade feature fusion, which takes ResNet-50 as a backbone network and improves ResNet-50 by methods of reducing model parameters, improving calculation speed and the like, thereby improving the feature extraction capability of a tracker; then, the three-layer characteristics of the last stage of ResNet-50 are subjected to cascade fusion step by step through a characteristic fusion module, so that the effective fusion of the superficial appearance characteristics and the deep semantic characteristics of the target is realized, and the effective identification and positioning of the target are improved; meanwhile, in order to solve the problem of target template degradation and adapt to the appearance and state change of a target in real time, a template updating mechanism is introduced, and the problem of template updating is solved through a similarity threshold value. The model training of the invention makes up the defects of different characteristics in tracking effect, and experiments on OBT2015, VOT2016 and VOT2018 show that the cascade characteristic fusion network provided by the invention effectively improves the universality of the tracker and obtains excellent performance in complex scenes such as rapid motion, motion blur, shielding, background similarity, illumination change, deformation and the like.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A twin network target tracking method based on cascade feature fusion is characterized by comprising the following steps:

2. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in the step S1, the improvement of the ResNet-50 network comprises: reducing the original steps of the residual blocks Res4 and Res5 from 16 and 32 pixels to 8 pixels and increasing the receptive field by a dilation convolution operation; training the whole network by adopting a spatial perception sampling strategy; the channel of the multi-layer feature map is changed to 256 by a 1 × 1 convolution operation and the central 7 × 7 region is clipped to the template feature, where each feature cell can capture the entire target region.

3. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in step S1, the template branch and the search branch of the backbone network have the same convolution structure and the same network parameters.

4. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in the step S2, the cascade fusion of three residual blocks Res3, Res4 and Res5 is performed in a stepwise manner, comprising the following steps:

5. The twin network target tracking method based on cascade feature fusion of claim 1, wherein in step S3, the backbone network pre-trains on ImageNet-1K dataset, trains the whole network using the image of the ILSVRC dataset; randomly selecting a frame of image in the ILSVRC data set in the training process, cutting a 127 × 127 area containing a target as a target template, then cutting a 255 × 255 search box size on a search image, and generating a training sample, wherein the maximum interval is 50 frames; the classification and regression of each target and position are realized through the training.

6. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in step S3, negative sample sampling is added in the target tracking process.

7. The twin network target tracking method based on cascade feature fusion as claimed in claim 6, wherein in the target tracking process, the positive and negative samples of the samples are distinguished by:

for the real frame, T, marked on each image in the training set_wIndicates the width, T_hDenotes height, (x)₁,y₁) To representCoordinates of upper left corner, (x)₀,y₀) Denotes the center coordinate, (x)₂,y₂) Represents the coordinates of the lower right corner;

with central coordinate (x)₀,y₀) As a center, to

And

respectively making an ellipse with axial length E₁：

Wherein (x)_i,y_j) Representing the coordinate positions of the sampling points;

then using the central coordinate (x)₀,y₀) As a center, to

And

respectively making an ellipse with axial length E₂：

If the sample point (x)_i,y_j) At E₂Internal is positive sample, if at E₁The outer part is a negative sample, and if the outer part is positioned between the outer part and the positive sample, the sample is ignored; the positions marked as positive samples are used for trace box regression.

8. The tandem feature fusion-based twin network target tracking method according to claim 1, wherein in the step S3, the regression target calculated in the regression branch is represented by the distance from the target position to the tracking box;

the tracking box is calculated as follows:

wherein the content of the first and second substances,

g (x, y) represents the target real border;

loss of classification L_clsExpressed using the SmoothL1 loss function:

the overall loss function of the network is expressed as:

L＝λ₁L_cls+λ₂L_reg (7)

wherein λ is₁And λ₂Is a super parameter.

9. The twin network target tracking method based on cascade feature fusion as claimed in claim 1, wherein in step S3, a template updating mechanism is introduced in the target tracking process, and a similarity threshold method is used to update the template in real time; the template dynamic updating strategy comprises the following steps:

S_i＝F(S₀,T_i,S_i-1) (8)

wherein S is₀For the first frame image template, for the most realistic template of the entire template update process, T_iTemplate extracted for the ith frame, S_i-1Historical accumulation template for i-1 frame, S_iF is an activation function for the best update template needing to be matched for the next frame; for the first frame, set T_iAnd S_i-1Are all S₀；

wherein, F_maxAnd F_minMaximum and minimum values in the response map, F_w,hIs the corresponding response value at coordinate (w, h).

10. The twin network target tracking method based on cascade feature fusion as claimed in claim 9, wherein the updating is defined by setting a similarity threshold between the new template and the old template: whether updating is carried out is judged through an APCE threshold value, and the formula is as follows: