CN111191555B

CN111191555B - Target tracking method, medium and system combining high-low spatial frequency characteristics

Info

Publication number: CN111191555B
Application number: CN201911349832.3A
Authority: CN
Inventors: 李伟生; 伍蔚帆
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2022-05-03
Anticipated expiration: 2039-12-24
Also published as: CN111191555A

Abstract

The invention requests to protect a target tracking method, medium and system combining high and low spatial frequency characteristics, which comprises the following steps: s1, inputting an original video image sequence, and extracting high spatial frequency characteristics and low spatial frequency characteristics of a true value image and a search area image; s2, exchanging information of the extracted high-frequency and low-frequency characteristics; s3, obtaining a correlation diagram of the target truth-value characteristic and the search area characteristic by utilizing correlation calculation; s4, obtaining a group of classification characteristics and a group of regression characteristics through the regional suggestion network, and positioning the target by combining the classification characteristics and the regression characteristics. The method combines high-low frequency feature extraction and a twin target tracking method based on a regional suggestion network, reduces redundant information in the image by using low spatial frequency features, reduces the probability of tracking failure caused by the redundant information generated in the target moving process and the offset degree of a tracking frame, and reduces the calculated amount of the tracking method; the fine features and global features that facilitate tracking are swapped.

Description

Target tracking method, medium and system combining high-low spatial frequency characteristics

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a target tracking method for extracting characteristics of high and low spatial frequencies.

Background

Tracking is a fundamental task in the field of computer vision and has received a great deal of attention in the last decade. It aims to locate the position of the subsequent object in a video sequence only by the position information of the object of the initial frame. Tracking has wide application in the fields of human-computer interaction, video tagging, automatic driving and the like. The current tracker is divided into two branches. One is a target tracking method based on a correlation filter. Such as COT, ECO, etc. The correlation filtering method can perform online training and offline weight updating. The precision can be further improved by the related filtering method after the depth characteristic is combined. At present, the method of utilizing deep learning correlation filtering is still one of the methods with stronger performance on the OTB data set. Another type of tracker uses strong depth features to locate the target with high accuracy. Such as MDNet, CFNet, SiamFC, etc. Typically, the weights of the feature-extracted network portions of such trackers are not updated. Most recent methods for tracking by using depth features aim to off-line train a feature network parameter with strong robustness, and then acquire the position of a target in an on-line tracking manner. However, most trackers use AlexNet as an extraction of feature networks, and the representation capability of shallow networks is inferior to that of deep networks in terms of feature representation of targets due to parameter limitation. Therefore, many trackers do not take good advantage of deep networks in terms of feature extraction.

The twin network (siernet) method is currently an excellent series of methods that rely on depth features for tracking. It does not use a deep-level network to extract more robust information for tracking purposes. Therefore, an attempt is made to introduce ResNet as a feature extraction network instead of AlexNet network in the tracking method, so that the depth of the network is deepened to improve the richness of extracted features.

However, for the tracking task, the extracted features need to have detail information of the target fast transformation, and also need to have structural information of the target overall. ResNet is deep, finally output information is too rich, the extracted final features are not completely required by a tracking task, and direct replacement easily causes overfitting of a tracking model. Various trackers have now shown that shallow information is still important for tracking. Such as MDNet, the used features are to enhance the feature information by combining the multi-layer features, which achieves better effect. However, the number of convolution layers of the ResNet network is large, and how to select multilayer features to enhance feature information is a difficult problem. To solve this problem, the features of each frame of the image of the tracking sequence are decomposed. A natural image, which can be divided into a high spatial frequency component, which is used to represent the detail feature with fast change; a low spatial frequency component that describes the general structure of image smoothing.

Therefore, we improve the convolution part of ResNet: first, the features extracted by convolution are divided into high-frequency sampling features and low-frequency sampling features. Then, the high-frequency characteristic and the low-frequency characteristic output by each layer of convolution are subjected to information exchange. In this way the feature extraction network of the tracker is optimized. Although the tracking methods are numerous, the method has the disadvantages that even if a section of the tracking method using the deep convolutional network obtains a better result, the deep networks such as the deep ResNet and the like can extract excessive detail features, on one hand, the calculation cost is increased, on the other hand, the effect of replacing the shallow convolutional network with the deep convolutional network is reduced due to the fact that the deep networks such as the deep ResNet and the like extract the excessive detail features, and the integral features are lack of representation, so that the convolution extraction is optimized by combining the high-low frequency feature convolution and the twin network tracking method, the general appearance structure and the function of the integral feature network are not changed, and the method has a wide application prospect in the field of target tracking.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The target tracking method, medium and system of the twin network are provided, wherein the target tracking method, medium and system of the twin network are effectively based on deep convolutional network characteristics, avoid the problem that the tracking precision is reduced after characteristics are extracted from a deeper convolutional network, and are combined with high-low spatial frequency convolutional calculation. The technical scheme of the invention is as follows:

a target tracking method combining high and low spatial frequency characteristics comprises the following steps:

s1, firstly, inputting an original video image sequence, cutting out a target image truth value and a search area image sequence according to a target truth value diagram of a first frame provided by a data set, performing feature extraction on the image sequence by adopting high-low spatial frequency convolution to obtain a high-frequency feature component and a low-frequency feature component of each layer, and taking the extracted features as the features for judging a follow-up tracking target;

s2, secondly, exchanging information of the high-frequency characteristic component and the low-frequency characteristic component of each layer, and finally outputting a characteristic value of the tracking sequence after characteristic enhancement;

s3, obtaining a correlation diagram of the target truth value characteristic and the search area characteristic by utilizing correlation calculation according to the target image and the search image sequence characteristic extracted by the high and low spatial frequency characteristic network;

and S4, finally, reprocessing through the regional suggestion network to obtain classification features and regression features, and correcting the target position positioned by the classification features through regression feature parameters to obtain the target position to be tracked.

Further, the clipping of the input original video image sequence in step S1 is to clip the target position proposed by the first frame of the video sequence, and clip the image sequence of the search area in an area 2 times the target position predicted by the previous frame, so as to obtain the feature map of the target to be tracked and the feature map of the search area in the subsequent tracking process through the feature extraction network.

Further, the extracting features of high and low spatial frequencies in step S1 is a method for processing channel features of a clipped image, and specifically includes: firstly, expanding an input image channel into c through layer common convolution calculation, wherein c is the number of characteristic channels generated after the convolution operation of a common characteristic extraction network, and c is the number of the characteristic channels generated after the convolution operation of the common characteristic extraction network₁Number of channels processed for high frequency convolution, c₂The number of channels processed for low frequency convolution, wherein c ═ c is satisfied₁+c₂. The method is characterized in that the multi-channel features generated by a general convolution network are subjected to partition processing, the high-frequency convolution processing mode is the same as the extraction of the general features, and the low-frequency convolution is performed on the first general featureAnd after downsampling is carried out on the layer convolution output, carrying out normal convolution operation.

Further, in step S2, the information exchange of the high and low features is performed by dividing the total number of channels c into c by the feature tensor X obtained by the first block convolution of the feature extraction network₁A channel characteristic and c₂A channel feature, c₁The characteristic of each channel is used as the characteristic component X of the high-frequency convolution^HInput, c₂The characteristic of each channel is used as the characteristic component X of the low-frequency convolution^LAfter input and convolution calculation, the characteristic output of high-frequency and low-frequency components is decomposed into Y ═ Y { (Y)^H，Y^LAnd by up-down sampling, unifying the spatial resolution of the output by high-low frequency sampling convolution calculation, wherein the output characteristics can be expressed as:

in the formula (1), Y^H→HAnd Y^L→LIndicating the transfer of characteristic information of the same spatial resolution after high and low frequency convolution calculation, Y^H ^→LAnd Y^L→HRepresenting high and low frequency information exchange.

Further, in order to calculate the output feature tensor of the new convolution, a convolution kernel W ═ W is provided^H→H，W^L→H，W^L→L，W^H ^→LIs responsible for the input features X^LAnd X^HConvolution is performed, so the feature calculation can be refined as:

in the formula (2), f (X; W) represents convolution with a parameter W, and pool (X, 2) represents average pooling as down-sampling;

a) output function f (X)^H；W^H→H) And f (X)^L；W^L→L) Y of formula (1)^H→HAnd Y^L→LIs used for transmitting the high-frequency characteristic and the low-frequency characteristicCalculating the next layer of convolution network; b) by upsample (f (X)^L；W^L→H) And 2) finishing down-sampling of high-frequency characteristics, and transmitting high-frequency information to a low-frequency convolution of a next layer. By f (pool (X)^H，2)；W^H→L) And finishing up-sampling of the low-frequency characteristics, and transmitting the low-frequency information to the high-frequency convolution of the next layer.

Further, the step S3 of obtaining a correlation diagram of the target truth feature and the search area feature by using correlation calculation specifically includes:

using a function f_θ(z, x) comparing the similarity of the target sample image z and the search image x, and outputting a correlation mapping map by the following formula:

in equation (3), z represents the image of interest, x represents the search area in the subsequent frame of the video, z represents a w × h crop centered on a given object truth, x is a larger crop centered on the object position estimate, and the two crops are input to the same convolutional network

The processing is performed, which corresponds to the following steps in the search image x: an exhaustive search with template image z is done above. Function f obtained by searching_θThe larger the response value of (z, x), the greater the likelihood that the mapped location is a target.

Further, the area recommendation network in step S4 is a method for supervising the correlation map outputted after the correlation calculation, and the supervising part has two branches, one for classifying the foreground object and the background, and one for recommending regression, and features

And features

After successive convolutional layers by a supervision blockIt will be divided into two branches: one is the cls branch feature used for classification, i.e.

And

one is the reg branch feature used for regression, i.e.

And

the classification branch and regression branch are then calculated separately by the following formulas:

in the equation (4), the calculation of the two branches combines the correlation calculation of the step S3 to output F^clsAnd F^regAnd (4) performing over-parameter debugging to accurately obtain the positioning result of S3.

A medium having stored therein a computer program which, when read by a processor, performs any of the methods described above.

A target tracking system incorporating high and low spatial frequency characteristics, comprising:

a feature extraction module: the system comprises a first frame, a second frame, a third frame, a fourth frame, a fifth frame, a sixth frame, a seventh frame, a fifth frame, a sixth frame, a seventh frame, a sixth frame, a seventh frame, a fifth frame, a sixth frame, a fourth frame, a fifth frame, a sixth frame, a fourth frame, a fifth frame, a fourth frame, a fifth frame, a sixth frame, a fourth frame, a fourth frame, a fourth frame, a fourth frame, a fourth frame, a fourth, a frame, a fourth, a;

the information exchange module: the tracking sequence characteristic value is used for exchanging information of the high-frequency characteristic component and the low-frequency characteristic component of each layer and finally outputting the characteristic-enhanced tracking sequence characteristic value;

a correlation calculation module: the correlation diagram is used for obtaining a correlation diagram of a target truth value characteristic and a search area characteristic by utilizing correlation calculation according to a target image and a search image sequence characteristic extracted by a high-low space frequency characteristic network;

a characteristic correction module: and the target position locating method is used for reprocessing through the regional suggestion network to obtain the classification characteristic and the regression characteristic, and correcting the target position located by the classification characteristic through the regression characteristic parameter to obtain the target position to be tracked.

The invention has the following advantages and beneficial effects:

according to the invention, high and low spatial frequency feature extraction and a target tracking method are combined, the feature extraction of a tracking image frame is optimized by using high and low frequency features, and the introduction of low frequency features can solve the problem that the features in a deep network are too prone to detail features and lack of overall feature information on one hand, so that the features are more beneficial to tracking, and on the other hand, because the calculation cost of the low frequency features is lower than that of the high frequency features, the calculation cost of the deep network can be effectively reduced by giving partial channel features to low frequency feature convolution calculation. Meanwhile, by means of information exchange of high-frequency and low-frequency characteristics, multi-scale characteristic information is well combined, interference of single detail characteristics on objects in tracking movement is reduced, and characteristics extracted by a network structure are more beneficial to tracking requirements.

The innovation points of the invention are as follows: firstly, a low-frequency feature extraction convolution is introduced into the feature extraction module, compared with a conventional convolution calculation means, the low-frequency feature convolution is a normal convolution calculation method for performing normal convolution on the features after the down-sampling of the normal convolution, the features of the down-sampling convolution are smaller than those of the normal convolution calculation, and the information contained after each feature point is mapped back to the original image is more than the information represented by a single feature point calculated by the conventional convolution, so that more global features are provided. Compared with the conventional convolution method, the low-frequency convolution is to perform downsampling and normal convolution on one part of the features of the multiple channels generated by the conventional convolution, and perform conventional convolution only without downsampling on the rest of the features, which is called high-frequency convolution. Because the length and width of the feature graph generated after the low-frequency convolution downsamples the conventional convolution is only half of the length and width of the conventional convolution, the low-frequency convolution calculation amount is small, and the whole convolution calculation amount is lower than that of the conventional convolution.

Secondly, an information exchange module is introduced, compared with a conventional convolution calculation method, the conventional convolution calculation only has the high-frequency convolution characteristics (the high frequency of the high-frequency convolution characteristics is relative to the low-frequency convolution characteristics) of the characteristic extraction module, the characteristics generated by each layer of low-frequency convolution are acquired through upsampling and have the same size as the characteristics generated by the corresponding high-frequency convolution, and then the characteristics are added, so that the characteristics generated by each layer of high-frequency convolution obtain the global information of the low-frequency convolution characteristics; in addition, the features generated by each layer of high-frequency convolution are acquired through downsampling and have the same size as the features generated by the corresponding low-frequency convolution, and then the features are added, so that the features generated by each layer of low-frequency convolution obtain the detail information of the high-frequency convolution features. Through information exchange, the finally output features have global information and detail information at the same time, and the method is a novel multi-scale feature method for tracking. In a conventional multi-scale method for tracking, a plurality of layers of conventional convolution features are generally selected, generally, 3 to 5 layers of conventional convolution feature information are extracted, and then feature addition fusion is performed by aligning the feature size through up-sampling or down-sampling. The method has the characteristics that multi-scale information is added into the high-frequency and low-frequency characteristic output of each layer, and the method can be used by combining with a conventional method and can further improve the performance according to the specific target tracking task requirement.

Drawings

FIG. 1 is a flow chart of a preferred embodiment method provided by the present invention;

fig. 2 is a modified example of the convolution calculation of the extracted portion of the trace network according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the process flow of the present invention is shown in FIG. 1.

The invention relates to a modification of a backbone network block of a deep convolution calculation part into a figure 2.

The method comprises the following specific steps:

step S1: firstly, an input original video image sequence is cut, considering that a target position proposed by a first frame of the video sequence is cut, and a 2-time area of the target position predicted by the previous frame is used as a search area image sequence for cutting, so that a feature map of a target to be tracked and a feature map of a search area in a subsequent tracking process can be obtained through a feature extraction network. Specifically, the truth feature given by the first frame of the target is selected, and the center point of the selected truth feature is taken as the standard, so that the mean value is filled into the initial image input with the length and the width of m, and m is generally 127 as the default. Regarding the first frame information of the search area, taking the central point of the selected truth-value feature as a reference, filling the mean value of the first frame information into initial image input with the length and the width being n, and generally defaulting n to be 256; and filling the information of the subsequent frames of the search area by taking the central point predicted by the previous frame tracking algorithm as a reference.

In addition, for the high and low spatial frequency extraction features, firstly, the input image channel is expanded into c through layer common convolution calculation. c is the number of feature channels generated after the convolution operation of a general feature extraction network, c₁Number of channels processed for high frequency convolution, c₂The number of channels processed for low frequency convolution. Wherein, c ═ c is satisfied₁+c₂. Namely, by carrying out partition processing on the multi-channel features generated by a common convolution network, the high-frequency convolution processing mode is the same as the common feature extraction mode. The low frequency convolution is a way to perform normal convolution operations after downsampling the first layer convolution output.

Step S2: the information exchange of the high and low frequency features is realized by dividing the total number of channels C into C through a feature tensor X obtained by the convolution of a first block of a feature extraction network₁The individual channel characteristics as characteristic components X for high-frequency convolution^HInput sum C₂The individual channel features are used as the feature component X for the low-frequency convolution^LAfter input and convolution calculation, the characteristic output of high-frequency and low-frequency components is decomposed into Y ═ Y { (Y)^H，Y^L}. Through up-down sampling, the spatial resolution after the convolution calculation output of the high-frequency and low-frequency samples is unified, and the output characteristics can be expressed as:

in the formula (1), Y^a→bIndicating that the features are mapped into b groups and the size of the output features is equal to the spatial resolution of the b groups. Y is^H→HAnd Y^L→LIndicating the transfer of characteristic information of the same spatial resolution after high and low frequency convolution calculation, Y^H→LAnd Y^L→HRepresenting high and low frequency information exchange. In order to calculate the output feature tensor of the new convolution, a convolution kernel W ═ W is provided^H→H，W^L→H，W^L→L，W^H→LIs responsible for the input features X^LAnd X^HAnd performing convolution. Thus, the feature calculation can be refined as:

in formula (2), f (X; W) represents a convolution with a parameter W, pool (X, 2): the representation is downsampled using average pooling.

Equation (2) is a refinement of the convolution method of equation (1), in which the spatial resolution of the output of the high-frequency convolution feature is 2 times greater than that of the low-frequency convolution feature, so for the tracking method, the acceptance domain of the low-frequency convolution feature of low frequency is 2 times greater than that of high frequency. a) Output function f (X)^H；W^H→H) And f (X)^L；W^L→L) Y is of formula (1)^H→HAnd Y^L→LRepresents passing the high frequency features and the low frequency features to the next layer of convolutional network computation. b) By upsample (f (X)^L；W^L→H) And 2) finishing down-sampling of high-frequency characteristics, and transmitting high-frequency information to a low-frequency convolution of a next layer. By f (pool (X)^H，2)；W^H→L) And finishing up-sampling of the low-frequency characteristics, and transmitting the low-frequency information to the high-frequency convolution of the next layer. This exchange further convolvesThe layer obtains more morning and afternoon information from farther positions, and the performance of the network characteristics is improved.

Specifically, as shown in fig. 2 of the specification, the backbone block of a general deep convolutional network is modified. In the left part of fig. 2, a general feature extraction convolution is shown, conv denotes convolution calculation, conv1x1 denotes convolution of size 1, and conv3x3 denotes convolution of size 3x 3. A typical deeper depth network consists of a convolution structure where s1 denotes the step size of the convolution calculation as 1. And in the right part, low-frequency convolution calculation is added on the basis of general characteristic convolution, and the crossed connecting lines represent the operation of the formula (2). c. C₁The number of high-frequency convolution channels, wherein the convolution calculation belongs to high-frequency convolution calculation; c. C₁The number of low-frequency convolution channels, and the convolution calculation belongs to low-frequency convolution calculation. Wherein c is c₁+c₂。

Step S3: using a function f_θ(z, x) to compare the similarity of the target sample image z and the search image x. Outputting the correlation mapping map by the following formula:

in equation (3), z represents the image of interest and x represents the search area in the subsequent frame of the video. Specifically, z represents a w × h crop centered on a given target truth value, and x is a larger crop centered on the target position estimation result. These two clips are input to the same convolutional network

And (6) processing. At this time, the convolutional network

Is a characteristic of high and low frequency convolution networks. Corresponding to an exhaustive search with template image z on search image x. Function f obtained by searching_θThe response value of (z, x) is used for tracking.

Step S4: output after introducing regional proposal network supervision related calculationA correlation map. The supervision part is provided with two branches, one branch is used for classifying foreground objects and backgrounds, and the other branch is used for suggesting regression. Feature(s)

And features

After passing through the successive convolutional layers of the supervision section, it is divided into two branches: one is the cls branch feature used for classification, i.e.

And

one is the reg branch feature used for regression, i.e.

And

in the equation (4), the calculation of the two branches combines the correlation calculation of the step S3 to output F^clsAnd F^regThe ratio is generally 0.5 and 0.5 influenced by the parameters.

In addition, since the deep network needs to pre-train the network parameters, the loss function Cross-entropy loss of the classification branch and the loss of the regression branch are smooth L in the training process₁。

In order to evaluate the performance of the algorithm, the invention adopts a tracking data set OTB (OTB-2013 and OTB-2015) commonly used for target tracking to perform preliminary evaluation. Giving a comparison value of AUC and Pre of the two evaluation indexes

In the experiment, a SiamRPN-Res method named as a SiamRPN tracking method and a method for replacing a SiamRPN feature extraction part with a deeper feature network and a twin network tracking method combining high and low frequency features are selected for quantitative evaluation. The comparative table of the experimental results is shown in table 1.

TABLE 1

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A target tracking method combining high and low spatial frequency characteristics is characterized by comprising the following steps:

s1, firstly, inputting an original video image sequence, cutting out a target image truth value and a search area image sequence according to a target truth value diagram of a first frame provided by the original image sequence, extracting the characteristics of the image sequence by adopting high-low spatial frequency convolution to obtain high-frequency characteristic components and low-frequency characteristic components of each layer, and taking the extracted characteristics as the characteristics for judging a follow-up tracking target;

2. The method for tracking the target according to claim 1, wherein the step S1 cuts the input original video image sequence by cutting the target position proposed by the first frame of the video sequence and cutting the image sequence of the search area from an area 2 times larger than the predicted target position of the previous frame, so as to obtain the feature map of the target to be tracked and the feature map of the search area in the subsequent tracking process through the feature extraction network.

3. The method for tracking the target by combining the high and low spatial frequency features according to claim 1 or 2, wherein the high and low spatial frequency extraction features in step S1 are methods for processing channel features of the clipped image, and specifically include: firstly, expanding an input image channel into c through layer common convolution calculation, wherein c is the number of characteristic channels generated after the convolution operation of a common characteristic extraction network, and c is the number of the characteristic channels generated after the convolution operation of the common characteristic extraction network₁Number of channels processed for high frequency convolution, c₂The number of channels processed for low frequency convolution, wherein c ═ c is satisfied₁+c₂(ii) a The method is characterized in that the multi-channel features generated by a common convolution network are processed in a partitioning mode, the high-frequency convolution processing mode is the same as the common feature extraction mode, and the low-frequency convolution is a mode of performing normal convolution operation after downsampling the first layer of convolution output.

4. The method for tracking the target by combining the high-low spatial frequency features as claimed in claim 3, wherein in step S2, the information exchange for the high-low features is performed by using a feature tensor X obtained by a first block convolution of the feature extraction network to divide the total number of channels c into c₁A channel characteristic and c₂A channel feature, c₁The characteristic of each channel is used as the characteristic component X of the high-frequency convolution^HInput, c₂The characteristic of each channel is used as the characteristic component X of the low-frequency convolution^LAfter input and convolution calculation, the characteristic output of high-frequency and low-frequency components is decomposed into Y ═ Y { (Y)^H，Y^LBy up-down samplingAnd the spatial resolution after the output is calculated by unifying the high and low frequency sampling convolution, and the output characteristics can be expressed as:

in the formula (1), Y^H→HAnd Y^L→LIndicating the transfer of characteristic information of the same spatial resolution after high and low frequency convolution calculation, Y^H→LAnd Y^L ^→HRepresenting high and low frequency information exchange.

5. The method as claimed in claim 4, wherein a convolution kernel W ═ W is provided for calculating the output feature tensor of the new convolution^H→H，W^L→H，W^L→L，W^H→LIs responsible for the input features X^LAnd X^HConvolution is performed, so the feature calculation can be refined as:

a) output function f (X)^H；W^H→H) And f (X)^L；W^L→L) Y of formula (1)^H→HAnd Y^L→LThe specific calculation of (2) represents that the high-frequency characteristic and the low-frequency characteristic are transmitted to the next layer of convolution network for calculation; b) by upsample (f (X)^L；W^L→H) 2) finishing down-sampling of high-frequency characteristics, and transmitting high-frequency information to a low-frequency convolution of a next layer; by f (pool (X)^H，2)；W^H→L) And finishing up-sampling of the low-frequency characteristics, and transmitting the low-frequency information to the high-frequency convolution of the next layer.

6. The method as claimed in claim 5, wherein the step S3 of obtaining the correlation diagram of the target truth feature and the search area feature by using correlation calculation specifically includes:

Processing is carried out, which is equivalent to exhaustive search of a template image z on a search image x; function f obtained by searching_θThe larger the response value of (z, x), the greater the likelihood that the mapped location is a target.

7. The method as claimed in claim 5, wherein the area suggestion network in step S4 is a method for supervising the correlation map outputted after the correlation calculation, the supervising part has two branches, one is used for classifying the foreground object and the background, one is used for suggesting regression, and the features are characterized in that

And features

And

one is the reg branch feature used for regression, i.e.

And

8. A medium having a computer program stored therein, wherein the computer program, when read by a processor, performs the method of any of the preceding claims 1 to 7.

9. An object tracking system combining high and low spatial frequency characteristics, comprising: