CN113160247A

CN113160247A - Anti-noise twin network target tracking method based on frequency separation

Info

Publication number: CN113160247A
Application number: CN202110433521.6A
Authority: CN
Inventors: 陈飞; 王志伟
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-23
Anticipated expiration: 2041-04-22
Also published as: CN113160247B

Abstract

The invention provides an anti-noise twin network target tracking method based on frequency separation, which is characterized by comprising the following steps of: firstly, a convolutional neural network is utilized to carry out feature extraction on a tracking target and a subsequent frame search area image. And then further generating a high-dimensional feature map for the search area feature map and the template feature map by using an Ocatave convolution structure, fusing the cross-correlation corresponding maps after the cross-correlation operation is completed to obtain a target position regression map, obtaining an object perception classification result map by using target position regression information, obtaining a conventional classification map by using the same method, obtaining a final classification result map, and completing the determination of the target position. The invention utilizes high-frequency and low-frequency information exchange to enhance the anti-noise capability of the network, and simultaneously introduces a new feature fusion method, the feature fusion method can aggregate local and global context information, and the problem of poor tracking effect in a noise environment in the existing target tracking method is solved.

Description

Anti-noise twin network target tracking method based on frequency separation

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an anti-noise twin network target tracking method based on frequency separation.

Background

Due to their wide application in applications such as autopilot, traffic flow monitoring, surveillance, robotics, human-machine interaction, medical diagnostic systems, and activity recognition, target tracking is of interest. The specific task is to determine the position of the object in subsequent video frames, knowing its initial position. In recent years, twin network trackers have attracted considerable attention with their balanced speed and accuracy. Pioneering work to learn similarity measures between object targets and candidate images using a twin network, thereby modeling tracking as a search problem of the target over the entire image. Subsequently, a series of twin network based target trackers worked to achieve better performance, and among these trackers, this type of preselected Anchor frame (Anchor) based tracker has a stronger advantage in terms of accuracy by introducing a regional recommendation network. For noisy image data, low-frequency information is full of a large amount of noise, and a target tracking algorithm suffers from the problem that the accuracy is reduced and a target frame drifts. The reason for this is that, on one hand, the addition of noise causes instability of target feature extraction, and on the other hand, the occurrence of noise directly and adversely affects the accuracy of subsequent target position regression and classification. With the development of deep learning technology, the field of image denoising is rapidly developed, and the existing denoising algorithm based on the convolutional neural network generally utilizes the neural network to learn the mapping from a noisy image to a clean image. The direct application of the convolutional neural network-based denoising algorithm to a target tracking task with noise will result in a large increase in the amount of computation. In addition, for most computer vision tasks, it is very difficult to obtain enough clean and noisy images, so how to improve the noise immunity of the target tracking network itself becomes another method to solve the above-mentioned problems.

Disclosure of Invention

In view of this, the present invention provides an anti-noise twin network target tracking method based on frequency separation, which first performs feature extraction on a tracking target and a subsequent frame search region map by using a convolutional neural network. And then further generating a high-dimensional feature map for the search area feature map x and the template feature map z by using an Ocatave convolution structure, fusing corresponding cross-correlation maps after the cross-correlation operation is completed to obtain a target position regression map, obtaining an object perception classification result map by using target position regression information, obtaining a conventional classification map by using the same method, obtaining a final classification result map, and completing the determination of the target position. The invention utilizes high-frequency and low-frequency information exchange to enhance the anti-noise capability of the network, and simultaneously introduces a new feature fusion method, the feature fusion method can aggregate local and global context information, and the problem of poor tracking effect in a noise environment in the existing target tracking method is solved.

The invention specifically adopts the following technical scheme:

an anti-noise twin network target tracking method based on frequency separation is characterized in that: firstly, extracting the characteristics of a tracking target and a subsequent frame search area image by using a convolutional neural network; and then further generating a high-dimensional feature map for the search area feature map x and the template feature map z by using an Ocatave convolution structure, fusing the cross-correlation response maps after the cross-correlation operation is completed to obtain a target position regression map, obtaining an object perception classification result map by using target position regression information, obtaining a conventional classification map by using the same method, obtaining a final classification result map, and completing the determination of the target position.

Further, the method comprises the following steps:

step 1: inputting the initial frame target into a substrate convolution neural network to extract features, and acquiring and storing a template feature map;

step 2: cutting a search area graph in a subsequent frame according to the target position of the previous frame, and inputting the search area graph into a substrate convolutional neural network for feature extraction;

and step 3: performing cross-correlation operation on the template characteristic diagram and the search area characteristic diagram to generate a regression cross-correlation response diagram and a classification cross-correlation response diagram;

and 4, step 4: performing convolution operation on the regression cross-correlation response graph to generate a target position regression result graph;

and 5: performing convolution operation on the classification cross-correlation response graph to generate a conventional classification result graph;

step 6: generating a symmetrical classification result graph by using the target position regression result graph;

and 7: and adding the conventional classification result graph and the symmetrical classification result graph to obtain a final classification result graph, and selecting a position regression numerical value in the regression result graph of the target position corresponding to the maximum position in the classification values to determine the target position.

And, an anti-noise twin network target tracking method based on frequency separation, characterized by comprising the steps of:

step S1: an object needing to be tracked is specified in a first frame of a video image, and a specified target is cut in a current frame to generate a target template picture; extracting features of the target template graph by utilizing a substrate convolution neural network model to obtain a template feature graph z;

step S2: intercepting a subsequent frame target search area graph, extracting features by using a base convolution neural network model, obtaining a subsequent frame search area feature graph x, and further extracting the features of the search area feature graph x and a template feature graph z by using three independent ocave convolution structures to obtain a feature graph (x)₁₁，x₁₂，x₁₃) And (z)₁₁，z₁₂，z₁₃) The same subscripts indicate that they were generated using the same ocave convolution structure;

step S3: using z₁₁For convolution kernels, at x₁₁Performing convolution operation to obtain cross-correlation response diagram R₁(ii) a Using z₁₂For convolution kernels, at x₁₂Performing convolution operation to obtain cross-correlation response diagram R₂(ii) a Using z₁₃For convolution kernels, at x₁₃Performing convolution operation to obtain cross-correlation response diagram R₃；

Step S4: cross correlation response map R₁And R₂Performing feature fusion operation to obtain a feature fusion result graph R₄(ii) a And R is again introduced₃And R₄Carrying out feature fusion to obtain a feature map R'; the feature map R' is convoluted by five convolution kernels, and the final output size is 25 multiplied by 4]The target tracking position regression graph Reg; reg represents the linear distance from each pixel point in the search area to the frame of the predicted target;

step S5: and for the template feature map z and the search area feature map x, performing further feature extraction on the search area feature map x and the template feature map z by using three independent occlusion convolution structures with different parameters from those in the step S2 to obtain a feature map { x }₂₁ x₂₂ x₂₃And { z }₂₁ z₂₂ z₂₃The same subscript indicates that it was generated using the same ocave convolution structure;

step S6: using z₂₁For convolution kernels, at x₂₁Performing convolution operation to obtain cross-correlation response diagram C₁(ii) a Using z₂₂For convolution kernels, at x₂₂Performing convolution operation to obtain cross-correlation response diagram C₂(ii) a Using z₂₃For convolution kernels, at x₂₃Performing convolution operation to obtain cross-correlation response diagram C₃；

Step S7: aligning the fixed sampling position of the convolution kernel to a predicted regression box, wherein each position alpha on the classification map is (dx, dy), and the regression map Reg has a corresponding regression prediction frame (x) at the target tracking position₁,x₂,y₁,y₂)，(x₁,x₂,y₁,y₂) Representing the distance of the position to the target frame; by using (x)₁,x₂,y₁,y₂) Obtaining M ═ M, my, mw, mh, (mx, my) represents the coordinates of the target center point, and (mw, mh) represents the length and height of the candidate frame, further sampling is carried out from the candidate frame M to obtain the classification score of the characteristic feature prediction position α ═ dx, dy, and the object perception classification result graph Class1 is obtained by the method;

step S8: cross correlation response graph C₁And C₂Performing feature fusion operation to obtain a feature fusion result graph C₄Then, again apply C₃And C₄Carrying out feature fusion to obtain a feature map C₁'; checking features using five convolutionsFIG. C₁' convolution operation is performed to obtain a final output size of [25 × 25 × 1]The conventional classification diagram Class2 uses the parameter ratio to perform soft selection on Class1 and Class2, and the selection equation is as follows: obtaining a final target comprehensive classification diagram Class by Class1+ (1-ratio) Class2, wherein alpha belongs to Class at any point, is more than or equal to 0 and less than or equal to 1, and represents the probability value of alpha as the target foreground;

step S9: selecting the position with the maximum target foreground probability value in the target tracking foreground and background classification diagram Class, and determining the corresponding position in the target tracking position regression diagram Reg to obtain the corresponding target frame information: (x)₁,x₂,y₁,y₂)，(x₁,x₂,y₁,y₂) Representing the distance of the location to the target frame.

Further, step S2 specifically includes the following steps:

step S21: carrying out high-low frequency division operation on the search area characteristic diagram x; taking X as an input feature map, firstly generating a preliminary low-frequency feature map X with the length and width reduced by half by using an average pooling operation with the size of 2X 2_low1Second to X_low1Generating low-frequency characteristic diagram X with half number of channels by utilizing conventional convolution operation_l1(ii) a Convolution operation is carried out on the X to generate a high-frequency characteristic diagram X with half of the number of channels and unchanged length and width_h1；

Step S22: for high frequency characteristic diagram X_h1Firstly, performing average pooling operation with size of 2X 2, and secondly, generating low-frequency feature map X by convolution operation_l2(ii) a For low frequency characteristic diagram X_l1Generating a low frequency feature map X of constant size by convolution operation_l3(ii) a Mixing X_l2And X_l3Adding to generate low-frequency feature map X_l4(ii) a For high frequency characteristic diagram X_h1Generating a high frequency feature map X of constant size by convolution operation_h2(ii) a For low frequency characteristic diagram X_l1First, convolution operation is performed, and then the up-sampling operation with the up-sampling rate of 2 is used to generate the high-frequency feature map X_h3(ii) a Mixing X_h2And X_h3Performing addition operation to generate high-frequency feature map X_h4；

Step S23: for high frequency characteristic diagram X_h4All right (1)Using convolution operation to generate characteristic diagram X with output channel number equal to input characteristic diagram channel number_h5(ii) a For low frequency characteristic diagram X_l4Generating a feature map X with the number of output channels equal to the number of input feature map channels by convolution operation_l5Secondly, generating a high-frequency characteristic diagram X by utilizing an upsampling operation with an upsampling rate of 2_h6(ii) a Mixing X_h5And X_h6Adding to generate an Ocatave convolution structure result x₁₁；

Step S24: repeating the steps S21 to S23 to respectively generate the Ocatave convolution result x₁₂And Ocatave convolution result x₁₃。

Step S25: generating a feature map (z) by using the template feature map z as an input feature map in accordance with steps S21 to S24₁₁，z₁₂，z₁₃)。

The specific operation for step S5 is similar to step S2.

Further, step S4 specifically includes the following steps:

step S41: cross correlation response map R₁And R₂Performing addition operation, setting a result characteristic diagram to be X, and calculating a local context weight diagram L (X) ═ f (delta (f (X))) and a global context weight diagram G (X) ═ f (delta (f (GPoling (X)))) of X, wherein delta is a Relu activation function, f represents a point-by-point convolution method, GPoling represents global average pooling operation, and obtaining an attention weight diagram A (X) ═ L (X) + G (X));

step S42: fused cross-correlation response graph R₁And R₂The result of fusion is R₄＝R₁*A(X)+R₂*(1-A(X))；

Step S43: referring to step S41, the cross-correlation response graph R is fused₃And R₄Obtaining R'; the feature map R' is convolved by five convolution kernels, and the output size is [25 × 25 × 4 ]]Target tracking position regression graph Reg.

Further, in step S1, the base convolutional neural network model is obtained by training the convolutional neural network through the same type of picture data set of the image to be tracked.

Compared with the prior art, the invention and the optimized scheme thereof have the following beneficial effects:

1) by introducing a twin Ocatave convolution characteristic representation method, redundant information of low-frequency information is inhibited on one hand, high-frequency information is reserved on the other hand, and the characteristics extracted by a model have stronger anti-noise capability by utilizing the exchange between the high-frequency information and the low-frequency information; and simultaneously, the template characteristic diagram and the search area characteristic diagram are processed by using the same frequency division processing structure, so that the self-similarity of the characteristics is further improved.

2) A fusion method of aggregating global and local contexts is introduced, and fusion weights with the same size as the cross-correlation response graph are generated for point multiplication, so that dynamic soft selection is performed at the level of elements, and the model has stronger self-adaptive capacity.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of a parathyroid target in accordance with an embodiment of the present invention;

FIG. 3 is a graph showing the effect of tracking parathyroid gland in accordance with an embodiment of the present invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

as shown in fig. 1 to fig. 3, the present embodiment provides an anti-noise twin network target tracking method based on frequency separation, which is implemented by the following steps:

step S1: an object needing to be tracked is specified in a current frame of the video image, and a target area is specified in the current frame to generate a target template picture. Extracting features of the target template graph by utilizing a substrate convolution neural network model to obtain a template feature graph z;

step S2: intercepting a subsequent frame target search area graph, extracting features by using a base convolution neural network model, obtaining a subsequent frame search area feature graph x, and further performing one-step search on the search area feature graph x and a template feature graph z by using three independent occlusion convolution structuresStep-by-step feature extraction to obtain a feature map (x)₁₁，x₁₂，x₁₃) And (z)₁₁，z₁₂，z₁₃) The same subscripts indicate that they were generated using the same ocave convolution structure;

Step S4: cross correlation response map R₁And R₂Performing feature fusion operation to obtain a feature fusion result graph R₄Then, again apply R₃And R₄Carrying out feature fusion to obtain a feature map R'; the feature map R' is convoluted by five convolution kernels, and the final output size is 25 multiplied by 4]Target tracking position regression graph Reg. Reg represents the straight-line distance from each pixel point in the search area to the frame of the predicted target.

Step S5: and for the template feature map z and the search area feature map x, further extracting features of the search area feature map x and the template feature map z by using three independent Ocatave convolution structures to obtain a feature map { x₂₁ x₂₂ x₂₃And { z }₂₁z₂₂ z₂₃The same subscript indicates that the same ocave convolution structure is used for generation, and it should be noted that three independent ocave convolution structures used in this step are the same as the ocave convolution structure used in step S2 and are not parameters of the same;

Step S7: stationary sampling of convolution kernelsThe sample positions are aligned to a predicted regression box, each position alpha on the classification map is (dx, dy), and the regression map Reg has a corresponding regression prediction frame (x, dy) at the target tracking position₁,x₂,y₁,y₂)，(x₁,x₂,y₁,y₂) Representing the distance of the location to the target frame. By using (x)₁,x₂,y₁,y₂) Obtaining M ═ M, my, mw, mh, (mx, my) represents the coordinates of the target center point, and (mw, mh) represents the length and height of the candidate frame, further sampling is carried out from the candidate frame M to obtain the classification score of the characteristic feature prediction position α ═ dx, dy, and the target symmetric classification result graph Class1 is obtained by the method;

step S8: cross correlation response graph C₁And C₂Performing feature fusion operation to obtain a feature fusion result graph C₄Then, again apply C₃And C₄Carrying out feature fusion to obtain a feature map C₁'; checking the feature map C by five convolutions₁' convolution operation is performed to obtain a final output size of [25 × 25 × 1]The conventional classification diagram Class2 uses the parameter ratio to perform soft selection on Class1 and Class2, and the selection equation is as follows: obtaining a final target comprehensive classification diagram Class by Class1+ (1-ratio) Class2, wherein alpha belongs to Class at any point, is more than or equal to 0 and less than or equal to 1, and represents the probability value of alpha as the target foreground;

Specifically, in this embodiment, step S2 specifically includes the following steps:

step S21: and performing high-low frequency division operation on the search area characteristic diagram. Firstly, generating a preliminary low-frequency characteristic diagram with the length and the width reduced by half by using average pooling operation with the size of the initial low-frequency characteristic diagram as an input characteristic diagram, and secondly, generating a low-frequency characteristic diagram with the channel number reduced by half by using conventional convolution operation; generating a high-frequency characteristic diagram with half of the number of channels and unchanged length and width by utilizing convolution operation;

step S22: carrying out average pooling operation with the size of the high-frequency characteristic diagram, and generating a low-frequency characteristic diagram by utilizing convolution operation; for the low-frequency feature map, generating the low-frequency feature map with unchanged size by using convolution operation; adding the sums to generate a low-frequency feature map; for the high-frequency characteristic diagram, generating the high-frequency characteristic diagram with unchanged size by utilizing convolution operation; performing convolution operation on the low-frequency characteristic diagram, and generating a high-frequency characteristic diagram by utilizing up-sampling operation with an up-sampling rate of 2; adding the sums to generate a high-frequency characteristic diagram;

step S23: for the high-frequency characteristic diagram, generating a characteristic diagram with the number of output channels equal to the number of input channels of the characteristic diagram by using convolution operation; for the low-frequency characteristic diagram, generating a characteristic diagram with the number of output channels equal to the number of input channels of the characteristic diagram by convolution operation, and then generating a high-frequency characteristic diagram by upsampling operation with the upsampling rate of 2; adding the sums to generate an Ocatave convolution structure result;

step S24: repeating the steps S21 to S23 to generate an Ocatave convolution result and an Ocatave convolution result;

step S25: similarly, steps S21 to S24 are repeated, and the template feature map is used as the input feature map to generate the feature map.

In the present embodiment, the specific operation of step S5 is similar to step S2.

Specifically, in this embodiment, step S4 specifically includes the following steps:

step S42: fused cross-correlation response graph R₁And R₂Fusion result R₄＝R₁*A(X)+R₂*(1-A(X))；

Step S43: similar to step S41, the cross-correlation response graph R is fused₃And R₄Obtaining R'; the feature map R' is convolved by five convolution kernels, and the output size is [25 × 25 × 4 ]]The target tracking position regression graph Reg;

the following shows a specific embodiment of the present invention.

The specific steps of the algorithm provided by the invention for tracking the parathyroid target are as follows:

1. establishing a recognition data set q of a prior parathyroid gland₁,q₂,…,q_NCutting each data set picture into target pictures with the size of 255 × 255 and search area pictures with the size of 511 × 511;

2. transmitting the pair of data graphs obtained in the last step into a network model for forward transmission, and outputting a frame regression result and a classification result;

3. calculating a loss function, wherein the regression branch loss function is L_reg＝-∑_iln (Iou (preg, true)), the conventional classification branch penalty function is L_class1＝-∑p₁log(p₁)+(1-p₁)log(1-p₁) Symmetric classification of the branch loss function as L_class2＝-∑p₂log(p₂)+(1-p₂)log(1-p₂)；

4. Carrying out reverse transmission by using an SGD method, and updating network model parameters;

5. repeating the steps 2) -4) for a plurality of times to train the network model, and obtaining network parameters after the training is finished;

6. inputting the initial frame target into a substrate network to extract features, and storing the template feature map;

7. cutting out a search area image in a subsequent frame according to the target position of the previous frame (for example, the second frame is according to the object position of the first frame, and the third frame is according to the object predicted position of the second frame), and inputting the search area image into a substrate network for feature extraction;

8. performing cross-correlation operation on the template characteristic diagram and the search area characteristic diagram to generate a regression cross-correlation response diagram and a classification cross-correlation response diagram;

9. performing convolution operation on the regression cross-correlation response graph to generate a target position regression result graph, performing convolution operation on the classification cross-correlation response graph to generate a conventional classification result graph, and generating a symmetrical classification result graph by using the target position regression result graph;

10. and adding the conventional classification result graph and the symmetrical classification result graph to obtain a final classification result graph, and selecting a position regression numerical value in the target position regression result graph corresponding to the maximum position in the classification values to determine the target position.

FIG. 3 is a graph illustrating the effects of an example of the target tracking algorithm described above, and the box in FIG. 3 shows the results of the target location for the algorithm. The embodiment provides a twinning Octave convolution characteristic representation method, which utilizes a mode of information exchange between image high-frequency and low-frequency information to retain high-frequency information while suppressing low-frequency component noise information, thereby achieving the purpose of enhancing the anti-noise capability of a network. Meanwhile, a feature fusion method for aggregating global and local contexts is further combined, so that dynamic soft selection is performed on the level of elements, and the model has stronger self-adaptive capacity.

The present invention is not limited to the above preferred embodiments, and other various anti-noise twin network target tracking methods based on frequency separation can be derived by anyone based on the teaching of the present invention, and all equivalent changes and modifications made according to the claims of the present invention shall fall within the scope of the present invention.

Claims

1. An anti-noise twin network target tracking method based on frequency separation is characterized in that: firstly, extracting the characteristics of a tracking target and a subsequent frame search area image by using a convolutional neural network; and then further generating a high-dimensional feature map for the search area feature map x and the template feature map z by using an Ocatave convolution structure, fusing the cross-correlation response maps after the cross-correlation operation is completed to obtain a target position regression map, obtaining an object perception classification result map by using target position regression information, obtaining a conventional classification map by using the same method, obtaining a final classification result map, and completing the determination of the target position.

2. The anti-noise twin network target tracking method based on frequency separation according to claim 1, characterized by comprising the steps of:

3. An anti-noise twin network target tracking method based on frequency separation is characterized by comprising the following steps:

step S2: intercepting a subsequent frame target search area graph, extracting features by using a base convolution neural network model, obtaining a subsequent frame search area feature graph x, and further extracting the features of the search area feature graph x and a template feature graph z by using three independent ocave convolution structures to obtainCharacteristic diagram (x)₁₁，x₁₂，x₁₃) And (z)₁₁，z₁₂，z₁₃) The same subscripts indicate that they were generated using the same ocave convolution structure;

Step S7: aligning the fixed sampling position of the convolution kernel to a predicted regression box, wherein each position alpha on the classification map is (dx, dy), and the regression map Reg has a corresponding regression prediction frame at the target tracking position(x₁,x₂,y₁,y₂)，(x₁,x₂,y₁,y₂) Representing the distance of the position to the target frame; by using (x)₁,x₂,y₁,y₂) Obtaining M ═ M, my, mw, mh, (mx, my) represents the coordinates of the target center point, and (mw, mh) represents the length and height of the candidate frame, further sampling is carried out from the candidate frame M to obtain the classification score of the characteristic feature prediction position α ═ dx, dy, and the object perception classification result graph Class1 is obtained by the method;

step S8: cross correlation response graph C₁And C₂Performing feature fusion operation to obtain a feature fusion result graph C₄Then, again apply C₃And C₄Performing feature fusion to obtain a feature map C'₁(ii) a Checking feature map C 'with five convolutions'₁Convolution operation is performed to obtain final output size of 25 × 25 × 1]The conventional classification diagram Class2 uses the parameter ratio to perform soft selection on Class1 and Class2, and the selection equation is as follows: obtaining a final target comprehensive classification diagram Class by Class1+ (1-ratio) Class2, wherein alpha belongs to Class at any point, is more than or equal to 0 and less than or equal to 1, and represents the probability value of alpha as the target foreground;

4. The anti-noise twin network target tracking method based on frequency separation according to claim 3, wherein the step S2 specifically comprises the following steps:

step S21: carrying out high-low frequency division operation on the search area characteristic diagram x; taking X as an input feature map, firstly generating a preliminary low-frequency feature map X with the length and width reduced by half by using an average pooling operation with the size of 2X 2_low1Second to X_low1Generating low-frequency characteristic diagram X with half number of channels by utilizing conventional convolution operation_l1(ii) a Using volumes for xGenerating high-frequency characteristic diagram X with half of channel number, length and width unchanged by product operation_h1；

Step S23: for high frequency characteristic diagram X_h4Generating a feature map X with the number of output channels equal to the number of input feature map channels by convolution operation_h5(ii) a For low frequency characteristic diagram X_l4Generating a feature map X with the number of output channels equal to the number of input feature map channels by convolution operation_l5Secondly, generating a high-frequency characteristic diagram X by utilizing an upsampling operation with an upsampling rate of 2_h6(ii) a Mixing X_h5And X_h6Adding to generate an Ocatave convolution structure result x₁₁；

5. The anti-noise twin network target tracking method based on frequency separation according to claim 3, wherein the step S4 specifically comprises the following steps:

step S41: cross correlation response map R₁And R₂Adding to obtain a result characteristic diagram of X, and calculating the part of XA global context weight map g (x) f (δ (f (x))), where δ is the Relu activation function, f represents a point-by-point convolution method, gpoling represents a global average pooling operation, and an attention weight map a (x) l (x)) + g (x) is obtained;

6. The anti-noise twin network target tracking method based on frequency separation of claim 3, characterized in that: in step S1, the base convolutional neural network model is obtained by training the same type of picture data set of the image to be tracked in a convolutional neural network.