CN108734151A

CN108734151A - Robust long-range method for tracking target based on correlation filtering and the twin network of depth

Info

Publication number: CN108734151A
Application number: CN201810613931.7A
Authority: CN
Inventors: 王菡子; 吴强强; 严严
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2018-11-02
Anticipated expiration: 2038-06-14
Also published as: CN108734151B

Abstract

Robust long-range method for tracking target based on correlation filtering and the twin network of depth, is related to computer vision technique.By the way that under a unified tracking frame, the challenges such as target occlusion in long video, the disappearance visual field can be effectively treated in correlation filtering and the twin network integration of depth.In the tracking, the target candidate position that the expert's evaluation mechanism based on D-expert and C-expert proposed can effectively generate correlation filtering and the twin network of depth jointly carries out assessment screening, obtain best target following result, correlation filtering tracker is updated using the result, is updated by error sample to effectively prevent correlation filtering tracker.The method for tracking target of proposition, being capable of long-time stable tracking target to all kinds of challenges more robust in long video.

Description

Robust long-range target tracking method based on correlation filtering and depth twin network

Technical Field

The invention relates to a computer vision technology, in particular to a robust long-range target tracking method based on correlation filtering and a depth twin network.

Background

As a fundamental research topic in the field of computer vision, target tracking is widely applied in the fields of video monitoring, human-computer interaction, virtual reality, intelligent robots, automatic driving and the like. After long-time research, a large number of excellent target tracking algorithms emerge in the field of target tracking. The target tracking algorithm can be divided into a short-range target tracking algorithm and a long-range target tracking algorithm according to the length of a video to be processed. In practical application, the target often experiences challenges such as long-time shielding, rotation, illumination and the like, so that the short-range target tracking algorithm cannot accurately track the target for a long time. Therefore, a robust long-range tracking algorithm is researched, the challenges of shielding, visual field disappearance and the like can be effectively processed, the tracker can accurately track the target for a long time, and the method has important practical significance.

In recent years, the research of target tracking algorithm based on correlation filtering has made remarkable progress. In 2010, Bolme firstly provides a MOOSE tracking algorithm based on correlation filtering, and the problem of solving ridge regression is switched to a frequency domain through Fourier transformation, so that the calculation speed is greatly increased. The method has the advantages that the Heriques provides a CSK algorithm in 2012, and a cyclic displacement matrix is used for constructing a training sample, so that the tracking speed is further improved. In 2014, the KCF tracking algorithm is proposed again by Heriques, and the gray scale features used in CSK are replaced by the multichannel HOG features, so that the tracking accuracy is effectively improved. Although the above methods can achieve far ultra-real-time tracking speed, their tracking accuracy is still low, and it is difficult to meet practical requirements. In order to further improve the tracking accuracy, some correlation filtering tracking algorithms based on CNN features are proposed in succession in last two or three years. The HCF tracking algorithm is proposed by Ma in 2015, and under the KCF tracking framework, the HOG feature is replaced by the more robust CNN feature, so that the accuracy of related filtering tracking is greatly improved. Danelljan enables the related filtering tracking to be more accurate by solving the problem of the boundary effect of the related filtering and introducing the CNN characteristic under the related filtering tracking framework. In 2016, Qi proposed an improved Hedge algorithm to fuse multiple correlation filtering models trained using different CNN features to obtain a more robust tracking result. In the same year, Danelljan proposes a C-COT tracking algorithm, and CNN characteristic graphs with different resolutions are effectively fused by training continuous convolution kernels, so that higher tracking accuracy is achieved. In order to further improve the tracking accuracy and speed of the C-COT, Danelljan proposes a more effective convolution operation in 2017, solves the problem of characteristic sparsity caused by the original convolution operation, and greatly improves the tracking accuracy and speed. Although the related filtering tracking algorithm based on the CNN features has certain robustness, the related filtering tracking algorithm cannot process challenges such as long-time occlusion and visual field disappearance in a long video, and after the target is occluded, the tracker is updated mistakenly for a long time, so that the tracker finally loses the target.

In order to more robustly handle the challenges of long video, a representative algorithm is the TLD algorithm proposed by Kalal in 2010, which, unlike the conventional tracking algorithm, consists of two parts, a tracker and a detector. The tracker uses optical flow method for target location, and the detector uses a random fern classifier, the former provides online training samples for the latter, and the latter is used for relocating the target after tracking failure and initializing the tracker. The structure solves the problems of target shielding, visual field disappearance and the like in a short time to a certain extent. In 2015, Ma proposed a long-range tracking algorithm (LCT) based on correlation filtering, which has a structure similar to that of TLD, and the LCT uses correlation filtering as a tracker and a random fern classifier as a detector, and therefore has a better effect than TLD in tracking non-rigid bodies because the correlation filtering can more effectively model the apparent changes of objects. However, since the trackers and detectors in the LCT and TLD need to be updated online, when a long-time view is blocked or lost by a target in a long video, the trackers and detectors thereof may fail due to long-time wrong updating, resulting in a tracking failure. Therefore, how to design a robust long-range tracking algorithm can effectively process the long-time blocked or lost view of the target in the long video, and has great significance.

Deep learning has been widely used in recent years for computer vision research. In 2012, the use of alexnety by Krizhevsky won ImageNet's game with absolute advantage, igniting the enthusiasm of people for deep learning. In the next few years, deep learning has been successfully applied to various fields such as object detection, saliency detection, semantic segmentation, metric learning, and pedestrian re-recognition. However, in contrast, the application of deep learning in the field of target tracking has certain limitations, and the main reasons are two reasons: (1) lack of on-line training sample size; (2) training the model on-line is very time consuming. Two reasons restrict the application of deep learning in target tracking. The earliest deep learning approach to target tracking was the DLT tracking algorithm proposed by Wang in 2013, which trained the network using particle filtering to collect positive and negative samples online. In the following years, various target tracking algorithms based on convolutional neural networks, such as SO-DLT, FCNN, MDNet and SANet, are proposed in succession, and the algorithms train the network by collecting positive and negative samples in the tracking process, although the higher precision can be achieved, the requirement of real-time tracking is far from being achieved. In 2016, Bertinetto proposed using a deep twin network for target tracking that was trained offline on the ILSVRC dataset. In tracking, only the first frame is used as a target template, and an area most similar to the target template is found in the test frame as a tracking result. The method can achieve far ultra-real-time tracking speed, but due to lack of on-line updating, when the target appearance changes violently, the tracking effect is not ideal.

Disclosure of Invention

The invention aims to provide a robust long-range target tracking method based on related filtering and a depth twin network, which can effectively process the challenges of target occlusion, visual field disappearance and the like in a long video by combining the related filtering and the depth twin network in a unified tracking frame, can effectively evaluate and screen target candidate positions jointly generated by the related filtering and the depth twin network by using an expert evaluation mechanism based on D-expert and C-expert to obtain an optimal target tracking result, can effectively avoid the related filtering tracker from being influenced by wrong samples by using the result to update the related filtering tracker, is more robust to various challenges in the long video, and can stably track a target for a long time.

The invention comprises the following steps:

1) a frame of training video is given, a training area is defined by taking a target as a center, and the training area completely comprises the target and a part of background area;

in step 1), the method for defining the training area may be: constructing a rectangular training area by taking a target as a center, wherein the length and the width of the rectangular training area are respectively the length and the width of the target; if the rectangular training area exceeds the training video frame, the average pixel is used for filling.

2) Extracting CNN characteristics from the training area obtained in the step 1) by using a pre-trained VGG-Net-19 model;

in step 2), the specific process of extracting the CNN feature from the training region obtained in step 1) by using the pre-trained VGG-Net-19 model may be: changing the size of the rectangular training area obtained in the step 1) by using a bilinear interpolation method to ensure that the size of the rectangular training area is changed to be the input size (224 multiplied by 3) required by the network, and taking the output of the l layer (corresponding to the conv3-4, conv4-4 and conv5-4 layers in the VGG-Net-19 model) as the extracted CNN characteristic which is recorded as x^l，Wherein M, N and D are length and width of the characteristic diagram and channel number respectively.

3) Training a relevant filtering model by using the CNN characteristics obtained in the step 2), wherein the formula (1) is as follows:

where λ is the regularization parameter, y (m, n) is a Gaussian label of continuity,where σ represents the bandwidth of the linear kernel, equation (1) is a typical ridge regression that can be solved for:

wherein,in order to train the resulting correlation filtering model,is composed ofThe conjugate of (a) to (b),and Y is eachAnd the discrete fourier transform of the gaussian label y,is a dot product operation;

in step 3), λ is 10^-4,σ＝10^-1。

4) Giving a frame of test video, responding to the search area by using the trained related filtering model to obtain a response graph, and determining the position with the maximum response value in the response graph as a target initial position;

in step 4), the specific process of determining the position with the largest response value in the response map as the target initial position may be: extracting CNN characteristics of a search region in a layer l in a VGG-Net-19 model in a test video, and recording the CNN characteristics as z^l，z^lAnd x^lSame size, correlation filtering model in z^lThe response map above can be calculated by the following equation (3):

wherein f is_lA response graph representing the characteristics of the correlation filtering model at layer l, F^-1Which represents the inverse fourier transform of the signal,is composed ofDiscrete fourier transform of (d); in order to improve the tracking robustness, the VGG-Net-19 model can be used for extracting features of different layers for target positioning, and a feature map of a common L layer is givenBy the formula (3), response graphs of the correlation filtering model on different layer characteristics can be obtained and are marked asThe initial target position estimated by the correlation filtering model may be calculated as:

wherein,target position, gamma, estimated for a correlation filtering model_lIs the weight of the response graph on the l-level feature.

The given frame of test video may use a correlation filtering model to estimate a target location, comprising the sub-steps of:

a. the number of CNN feature layers L used in formula (4) is set to 3, namely conv3-4, conv4-4 and conv5-4 layers in the VGG-Net-19 model;

b. the weights corresponding to the conv3-4, conv4-4 and conv5-4 layer features of equation (4) are set to 0.5, 1, 0.02, respectively.

5) In a test video, taking the previous frame of target as a center, and constructing a search scale pyramid;

in step 5), the specific process of constructing the search scale pyramid may be: taking the target of the previous frame as the center, constructing Q scale factors on the basis of the scale of the target of the previous frame, multiplying the scale factors by the original target scale to obtain Q search areas with different scales, and changing the size of the search areas into equal size (255 multiplied by 3) by using a bilinear difference value to be recorded as equal size (255 multiplied by 3)And Q is 36.

6) Using a pre-trained deep twin network, taking a target in a first frame video as a template, matching the template on each scale obtained in the step 5) to obtain a candidate target position with the highest confidence level in each scale, and sequencing the candidate target positions with the confidence levels to obtain K candidate target positions with the highest confidence levels, wherein the calculation process is as follows:

wherein, o is the target template,s (,) is a similarity measurement function obtained by offline learning of the deep twin network, and the measurement result is returned to be a similarity graph and is recordedFor the best similarity value of the target template under the q search scale, pairSorting is carried out, the first K candidate target positions can be obtained and recorded as a setOrder toThe set U is used to represent all candidate target positions, then

In step 6), the parameter K of the deep twin network is set to 1 (see document: luca Bertinetto et al, 2016, on ECCVworks hop).

7) Evaluating the candidate target position obtained in the step 6) by using D-expert based on depth similarity to obtain the best candidate target position;

in step 7), the step of evaluating the candidate target position obtained in step 6) by using the D-expert based on the depth similarity may include: constructing an online target appearance model, wherein the online target appearance model mainly comprises three types of samples: (1) a target sample in a first frame; (2) target samples with higher confidence degrees collected in the tracking process; (3) target samples in a previous frame. And extracting the full-connection layer characteristics of the three samples by using a VGG-Net-19 model, and respectively recording the full-connection layer characteristics asAndthe set V is used to represent the three types of samples, thenSimilarly, extracting all-connected features from the candidate targets in U, and recording asFor e_kE, D-expert calculates the cumulative similarity distance of the E, D-expert on V:

by comparing the cumulative similarity distances, the best candidate target obtained by the deep twin network search can be calculated by the following equation (7):

e-expert further evaluates the target position of the correlation filtering estimation and the best candidate target obtained by searching the twin network in different scale ranges:

wherein r is_DFor the evaluation of D-expert, sign (. cndot.) is a sign function if r_DIf the cumulative distance of the best candidate target obtained by twin network search on the appearance model is less than the candidate target estimated by relevant filtering, the best candidate target is more reliable, and then subsequent evaluation is carried out; otherwise, the candidate target of the relevant filtering estimation is used as the final tracking result;

the on-line target apparent model size can be set to | V₀|＝|V₁|＝|V₂|＝1。

8) Respectively evaluating the target position obtained in the step 3) and the optimal candidate target position obtained in the step 6) by using C-expert based on relevant filtering to obtain an optimal target tracking result and finish tracking; c-expert uses two correlation filter models for evaluation, notedAndthe former only trains in the first frame of video, and keeps the original target model, and the latter trains and updates in the whole tracking process, and considers the deformation of the object; let R_t(m, n) and R₁(m, n) respectively represent correlation filtering modelsAnda response value at position (m, n); target position estimated by C-expert correlation filtering modelAnd the best target position obtained by the deep twin network searchEvaluation was carried out:

wherein r is_CThe evaluation value was C-expert. If r_CIs 1, then selectIs the final tracking result; if not, then,as a final tracking result; if it isIndicating the best position from a deep twin network searchDeformation of more objects is considered, and the method is more reliable; if it isIt indicates that the relevant filtering model is likely to have been updated incorrectly, and the result obtained by the deep twin network isWith higher response values and higher confidence, and therefore selects this result as the final tracking result.

According to the invention, related filtering and a depth twin network are combined under a unified tracking frame, so that the challenges of target occlusion, visual field disappearance and the like in a long video can be effectively processed. In the tracking method, the provided D-expert and C-expert based expert evaluation mechanism can effectively evaluate and screen target candidate positions jointly generated by the related filtering and the depth twin network to obtain the optimal target tracking result, and the result is used for updating the related filtering tracker, so that the related filtering tracker is effectively prevented from being updated by wrong samples. The target tracking method provided by the invention is robust to various challenges in long videos and can stably track the target for a long time.

Drawings

Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.

FIG. 2 is a diagram illustrating a result of qualitative tracking in a target occlusion video according to an embodiment of the present invention. Wherein the rectangular frame is the target tracking result obtained by the invention.

Detailed Description

The method of the present invention is described in detail below with reference to the accompanying drawings and examples.

Referring to fig. 1, an implementation of an embodiment of the invention includes the steps of:

1) given a frame of training video, a training area is defined centered on the target, the training area completely containing the target and a portion of the background area. The dividing method comprises the following steps: constructing a rectangular training area by taking a target as a center, wherein the length and the width of the rectangular area are respectively the length and the width of the target; if the rectangular area exceeds the training video frame, the average pixel is used for filling.

2) And (3) extracting the CNN characteristics of the training area obtained in the step 1) by using a pre-trained VGG-Net-19 model. The specific process is as follows: changing the size of the rectangular training area obtained in the step 1) by using bilinear interpolation to ensure that the size of the rectangular training area meets the input size (224 multiplied by 3) required by the network, and taking the output of the l layer (corresponding to the conv3-4, conv4-4 and conv5-4 layers in the VGG-Net-19 model) and marking the output as x^l，Wherein M, N and D are length and width of the characteristic diagram and channel number respectively.

3) Using the CNN features obtained in the step 2) to train a relevant filtering model. Equation (1) is as follows:

where λ is the regularization parameter, y (m, n) is a Gaussian label of continuity,where σ represents the bandwidth of the linear kernel. Equation (1) is a typical ridge regression, and there is a closed-form solution that can be solved:

wherein,in order to train the resulting correlation filtering model,is composed ofIn a common vesselThe yoke is provided with a plurality of yokes,and Y is eachAnd the discrete fourier transform of the gaussian label y,is a dot product operation.

4) And giving a frame of test video, responding to the search area by using the trained related filtering model to obtain a response graph, and determining the position with the maximum value in the response graph as the initial position of the target. The specific process is as follows: extracting CNN characteristics of a search region in a layer l in a VGG-Net-19 model in a test video, and recording the CNN characteristics as z^l，z^lAnd x^lThe sizes are the same. Correlation filtering model in z^lThe response map above can be calculated by the following equation (3):

wherein f is_lA response graph representing the characteristics of the correlation filtering model at layer l, F^-1Which represents the inverse fourier transform of the signal,is composed ofDiscrete fourier transform of (d). In order to improve the robustness of tracking, the target is positioned by the characteristics of different layers. Feature map for a given common L layerBy the formula (3), response graphs of the correlation filtering model on different layer characteristics can be obtained and are marked asThe initial target position estimated by the correlation filtering model may be calculated as:

5) And in the test video, taking the target of the previous frame as the center, and constructing a search scale pyramid. The specific process is as follows: taking the target of the previous frame as a center, constructing Q scale factors on the basis of the scale of the target of the previous frame, multiplying the scale factors by the original target scale to obtain Q search areas with different scales, and changing the size of the search areas into an equal size (255 multiplied by 3) by using a bilinear difference value, and recording the size as:

6) and (3) matching the templates on each scale obtained in the step 5) by using the pre-trained deep twin network and taking the target in the first frame video as the template to obtain the candidate target position with the highest confidence level in each scale, and sequencing by using the confidence levels to obtain K candidate target positions with the highest confidence levels. The calculation process is as follows:

wherein, o is the target template,and S (,) is a similarity measurement function obtained by offline learning of the deep twin network, and a measurement result is returned to be a similarity graph. Note the bookFor the best similarity value of the target template under the q search scale, pairSorting is carried out, the first K candidate target positions can be obtained and recorded as a setOrder toThe set U is used to represent all candidate target positions, then

7) And D-expert based on depth similarity is used for evaluating the candidate target position obtained in the step F to obtain the best candidate target position. The specific process is as follows: an online target appearance model is constructed, and the model mainly comprises three types of samples: (1) a target sample in a first frame; (2) target samples with higher confidence degrees collected in the tracking process; (3) and (5) the latest tracking result. And extracting the full-connection layer characteristics of the three samples by using a VGG-Net-19 model, and respectively recording the full-connection layer characteristics asAndthe set V is used to represent the three types of samples, thenSimilarly, extracting all-connected features from the candidate targets in U, and recording asFor e_kE, D-expert calculates the cumulative similarity distance of the E, D-expert on V:

f-expert further evaluates the target position of the correlation filtering estimation and the best candidate target obtained by searching the twin network in different scale ranges:

wherein r is_DSign () is a sign function for the evaluation value of D-expert. If r_DIf the cumulative distance of the best candidate target obtained by twin network search on the appearance model is less than the candidate target estimated by relevant filtering, the best candidate target is more reliable, and then subsequent evaluation is carried out; otherwise, the candidate target of the relevant filtering estimation is used as the final tracking result.

8) And C-expert based on relevant filtering is used for evaluating the target position obtained in the step 3) and the optimal candidate target position obtained in the step 7), so that an optimal target tracking result is obtained, and tracking is completed. C-expert uses two correlation filter models for evaluation, notedAndthe former only trains in the first frame of video, and keeps the original target model, and the latter trains and updates in the whole tracking process, and considers the deformation of the target. Let R_t(m, n) and R₁(m, n) respectively represent correlation filtering modelsAndthe response value at position (m, n). Target position estimated by C-expert correlation filtering modelAnd the best target position obtained by the deep twin network searchEvaluation was carried out:

wherein r is_CThe evaluation value was C-expert. If r_CIs 1, then selectIs the final tracking result; if not, then,as a final tracking result. If it isIndicating the best position from a deep twin network searchDeformation of more objects is considered, and the method is more reliable, and the deformation is taken as an optimal tracking result; if it isIt indicates that the relevant filtering model is likely to have been updated incorrectly, and the result obtained by the deep twin network isWith higher response values and higher confidence, and therefore selects this result as the final tracking result.

The overall framework of the invention is shown in figure 1. FIG. 2 is a diagram illustrating a result of qualitative tracking in a target occlusion video according to an embodiment of the present invention. Wherein the rectangular frame is the method of the invention; as can be seen from the figure, the method of the invention can effectively process the challenges of target occlusion, field of view disappearance and the like in the long video.

Table 1 shows the comparison of the accuracy, success rate and speed of the OTB-2013 data set of the invention and other 11 target tracking methods. The method obtains a good tracking result on a main stream data set.

TABLE 1

Method of producing a composite material	Precision (%)	Success rate (%)	Speed (FPS)
				The invention	91.5	65.6	8.9
CF2(2015)	89.1	60.5	10.5
				HDT(2016)	88.9	60.3	11.1
SiamFC(2016)	80.1	60.6	68.1
				Staple(2016)	79.3	60.0	62.4
SRDCF(2015)	83.8	62.6	3.8
				KCF(2015)	74.1	51.3	205.3
DSST(2014)	74.0	55.4	23.6
				CSK(2012)	54.5	39.8	458.0
IVT(2008)	49.9	35.8	40.1
				LCT(2015)	84.8	62.8	21.0
CT(2012)	40.6	30.6	53.9

In table 1:

KCF corresponds to the method proposed by J.F.Henriques et al (J.F.Henriques, R.Caseiro, P.Martins, and J.Batista, "High-Speed Tracking with Kernelized correlation filters," IEEE trans.Pattern anal.Mach.Intell., vol.37, No.3, pp.583-596,2015.)

DSST corresponds to the method proposed for d.martin et al (d.martin, g.hager, f.s.khan, and m.felsberg, "dispersive Scale spacing Tracking," IEEE trans.pattern No. mach.inner, vol.39, No.8, pp.1561-1575,2016.)

The Stacke corresponds to the method proposed by L.Bertonitto et al (L.Bertonitto, J.Valldre, S.Golodetz, O.Miksik, and P.H.S.Torr, "Stacke: Complementary Learners for Real-Time Tracking," in Proc.IEEE Conf.Comp.Vis.Pattern recording., 2016, pp.1401-1409.)

SRDCF corresponds to the method proposed by m.danelljan et al (m.danelljan, G.F.S.Khan,andM.Felsberg,“Learning Spatially Regularized Correlation Filters for VisualTracking,”in Proc.IEEE Int.Conf.Comput.Vis.,2015,pp.4310-4318.)

SiamFC corresponds to the method proposed by L.Bertonitto et al (L.Bertonitto, J.Valldre, J.Henriques, A.Vedaldi, and P.Torr, "full-capacitive silicon Networks for Objecting," in Proc.Workshop on Eur.Conf.Computt.Vis., 2016, pp.850-865.)

CF2 corresponds to the method proposed by C.Ma et al (C.Ma, J. -B.Huang, X.K.Yang, and M. -H.Yang, "high efficiency synergistic components for Visual Tracking," in Proc. IEEEInt. Conf.Comp.Vis., 2015, pp.3074-3082.)

HDT corresponds to the method proposed by Y.K.Qi et al (Y.K.Qi, S.P.Zhang, L.Qin, H.X.Yao, Q.M.Huang, J.Lim, and M.H.Yang, "Hedged Deep Tracking," in Proc.IEEEConf.Comp.Vis.Pattern Recognit, 2016, pp.4303-4311.)

LCT corresponds to the method proposed by C.Ma et al (C.Ma, X.K.Yang, C.Y.Zhang, and M. -H.Yang, "Long-Term core-linking Tracking," in Proc.IEEE Conf.Comp.Vis.Pattern Recognition, 2015, pp.5388-5396.)

CSK corresponds to the proposed method for J.F.Henriques et al (J.F.Henriques, R.Caseiro, P.Martins, and J.Batista, "expanding the circular Structure of Tracking-by-Detection with Kernels," in Proc.Eur.Conf.Compout.Vis., 2012, pp.702-715.)

CT corresponds to the method proposed for K.H.Zhang et al (K.H.Zhang, L.Zhang, and M. -H.Yang, "Real-Time Compressive Tracking," in Proc.Eur.Conf.Compout.Vis., 2012, pp.864-877.)

IVT corresponds to the method proposed for D.A.Ross et al (D.A.Ross, J.Lim, R. -S.Lin, and M. -H.Yang, "incorporated Learning for Robust Visual Tracking," int.J.Computt.Vis., vol.77, No.1, pp.125-141,2008.)

Claims

1. The robust long-range target tracking method based on the correlation filtering and the depth twin network is characterized by comprising the following steps of:

where λ is the regularization parameter, y (m, n) is a Gaussian label of continuity,where σ represents the bandwidth of the linear kernel, equation (1) is a typical ridge regression, solved for:

wherein, o is the target template,s (,) is a similarity measurement function obtained by offline learning of the deep twin network, and the measurement result is returned to be a similarity graph and is recordedFor the best similarity value of the target template under the q search scale, pairSorting is carried out to obtain the first K candidate target positions which are recorded as a setOrder toThe set U is used to represent all candidate target positions, then

8) respectively evaluating the target position obtained in the step 3) and the best candidate target position obtained in the step 6) by using C-expert based on correlation filtering to obtainCompleting tracking according to the optimal target tracking result; c-expert uses two correlation filter models for evaluation, notedAndthe former only trains in the first frame of video and keeps the original target model, and the latter trains and updates in the whole tracking process and considers the deformation of the object; let R_t(m, n) and R₁(m, n) respectively represent correlation filtering modelsAnda response value at position (m, n); target position estimated by C-expert correlation filtering modelAnd the best target position obtained by the deep twin network searchEvaluation was carried out:

wherein r is_CAn evaluation value of C-expert; if r_CIs 1, then selectIs the final tracking result; if not, then,as a final tracking result; if it isIndicating the best position from a deep twin network searchThe deformation of more objects is considered, and the method is more reliable; if it isIt indicates that the relevant filtering model is likely to have been updated incorrectly, and the result obtained by the deep twin network isWith higher response values and higher confidence, and therefore selects this result as the final tracking result.

2. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 1, wherein in step 1), the method for defining the training area is: constructing a rectangular training area by taking a target as a center, wherein the length and the width of the rectangular training area are respectively the length and the width of the target; if the rectangular training area exceeds the training video frame, the average pixel is used for filling.

3. The robust long-range target tracking method based on correlation filtering and a deep twin network as claimed in claim 1, wherein in step 2), the specific process of extracting the CNN feature from the training region obtained in step 1) by using the pre-trained VGG-Net-19 model is as follows: using a bilinear interpolation method for the rectangular training area obtained in the step 1), changing the size of the rectangular training area to the input size (224 multiplied by 3) required by the network, and taking the output of the I layer as the extracted CNN characteristic which is recorded as x^l，Wherein M, N and D are respectively the length and width of the feature map andthe number of channels; the l layers correspond to conv3-4, conv4-4 and conv5-4 layers in the VGG-Net-19 model.

4. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 1, wherein in step 3), λ -10^-4,σ＝10^-1。

5. The robust long-range target tracking method based on correlation filtering and a depth twin network as claimed in claim 1, wherein in step 4), the specific process of determining the position with the largest response value in the response map as the initial position of the target is as follows: extracting CNN characteristics of a search region in a layer l in a VGG-Net-19 model in a test video, and recording the CNN characteristics as z^l，z^lAnd x^lSame size, correlation filtering model in z^lThe above response map is calculated by the following formula (3):

wherein f is_lA response graph representing the characteristics of the correlation filtering model at layer l, F^-1Which represents the inverse fourier transform of the signal,is composed ofDiscrete fourier transform of (d); in order to improve the tracking robustness, the VGG-Net-19 model is used for extracting features of different layers for target positioning, and feature maps of the common L layers are givenObtaining a response graph of the correlation filtering model on different layer characteristics through the formula (3), and recording the response graph asThe initial target position estimated by the correlation filtering model is calculated as:

6. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 1, wherein in step 4), the given frame of test video is used to estimate the target position by using a correlation filtering model, comprising the sub-steps of:

7. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 1, wherein in step 5), the specific process of constructing the search scale pyramid is as follows: taking the target of the previous frame as the center, constructing Q scale factors on the basis of the scale of the target of the previous frame, multiplying the scale factors by the original target scale to obtain Q search areas with different scales, and changing the size of the search areas into equal size of 255 multiplied by 3 by using a bilinear difference value, and recording the equal size of 255 multiplied by 3 as the equal size of the search areasAnd Q is 36.

8. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 1, wherein in step 6), the parameter K of the depth twin network is set to 1.

9. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 1, wherein in step 7), the candidate target position obtained in step 6) is evaluated by using D-expert based on depth similarity, and the specific process of obtaining the best candidate target position is as follows: constructing an online target appearance model, wherein the online target appearance model mainly comprises three types of samples: (1) a target sample in a first frame; (2) target samples with higher confidence degrees collected in the tracking process; (3) a target sample in a previous frame; and extracting the full-connection layer characteristics of the three samples by using a VGG-Net-19 model, and respectively recording the full-connection layer characteristics asAndthe set V is used to represent the three types of samples, thenSimilarly, extracting all-connected features from the candidate targets in U, and recording asFor e_kE, D-expert calculates the cumulative similarity distance of the E, D-expert on V:

by comparing the cumulative similarity distances, the best candidate target obtained by the deep twin network search is calculated by the following formula (7):

d-expert further evaluates the target position of the correlation filtering estimation and the best candidate target obtained by searching the twin network in different scale ranges:

wherein r is_DFor the evaluation of D-expert, sign (. cndot.) is a sign function if r_DIf the cumulative distance of the best candidate target obtained by twin network search on the appearance model is less than the candidate target estimated by relevant filtering, the best candidate target is more reliable, and then subsequent evaluation is carried out; otherwise, the candidate target of the relevant filtering estimation is used as the final tracking result.

10. The robust long-range target tracking method based on correlation filtering and depth twin network as claimed in claim 9, wherein the on-line target apparent model size is set to | V₀|＝|V₁|＝|V₂|＝1。