CN111191555B - Target tracking method, medium and system combining high-low spatial frequency characteristics - Google Patents

Target tracking method, medium and system combining high-low spatial frequency characteristics Download PDF

Info

Publication number
CN111191555B
CN111191555B CN201911349832.3A CN201911349832A CN111191555B CN 111191555 B CN111191555 B CN 111191555B CN 201911349832 A CN201911349832 A CN 201911349832A CN 111191555 B CN111191555 B CN 111191555B
Authority
CN
China
Prior art keywords
characteristic
frequency
frame
convolution
low
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911349832.3A
Other languages
Chinese (zh)
Other versions
CN111191555A (en
Inventor
李伟生
伍蔚帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201911349832.3A priority Critical patent/CN111191555B/en
Publication of CN111191555A publication Critical patent/CN111191555A/en
Application granted granted Critical
Publication of CN111191555B publication Critical patent/CN111191555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Abstract

The invention requests to protect a target tracking method, medium and system combining high and low spatial frequency characteristics, which comprises the following steps: s1, inputting an original video image sequence, and extracting high spatial frequency characteristics and low spatial frequency characteristics of a true value image and a search area image; s2, exchanging information of the extracted high-frequency and low-frequency characteristics; s3, obtaining a correlation diagram of the target truth-value characteristic and the search area characteristic by utilizing correlation calculation; s4, obtaining a group of classification characteristics and a group of regression characteristics through the regional suggestion network, and positioning the target by combining the classification characteristics and the regression characteristics. The method combines high-low frequency feature extraction and a twin target tracking method based on a regional suggestion network, reduces redundant information in the image by using low spatial frequency features, reduces the probability of tracking failure caused by the redundant information generated in the target moving process and the offset degree of a tracking frame, and reduces the calculated amount of the tracking method; the fine features and global features that facilitate tracking are swapped.

Description

Target tracking method, medium and system combining high-low spatial frequency characteristics
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a target tracking method for extracting characteristics of high and low spatial frequencies.
Background
Tracking is a fundamental task in the field of computer vision and has received a great deal of attention in the last decade. It aims to locate the position of the subsequent object in a video sequence only by the position information of the object of the initial frame. Tracking has wide application in the fields of human-computer interaction, video tagging, automatic driving and the like. The current tracker is divided into two branches. One is a target tracking method based on a correlation filter. Such as COT, ECO, etc. The correlation filtering method can perform online training and offline weight updating. The precision can be further improved by the related filtering method after the depth characteristic is combined. At present, the method of utilizing deep learning correlation filtering is still one of the methods with stronger performance on the OTB data set. Another type of tracker uses strong depth features to locate the target with high accuracy. Such as MDNet, CFNet, SiamFC, etc. Typically, the weights of the feature-extracted network portions of such trackers are not updated. Most recent methods for tracking by using depth features aim to off-line train a feature network parameter with strong robustness, and then acquire the position of a target in an on-line tracking manner. However, most trackers use AlexNet as an extraction of feature networks, and the representation capability of shallow networks is inferior to that of deep networks in terms of feature representation of targets due to parameter limitation. Therefore, many trackers do not take good advantage of deep networks in terms of feature extraction.
The twin network (siernet) method is currently an excellent series of methods that rely on depth features for tracking. It does not use a deep-level network to extract more robust information for tracking purposes. Therefore, an attempt is made to introduce ResNet as a feature extraction network instead of AlexNet network in the tracking method, so that the depth of the network is deepened to improve the richness of extracted features.
However, for the tracking task, the extracted features need to have detail information of the target fast transformation, and also need to have structural information of the target overall. ResNet is deep, finally output information is too rich, the extracted final features are not completely required by a tracking task, and direct replacement easily causes overfitting of a tracking model. Various trackers have now shown that shallow information is still important for tracking. Such as MDNet, the used features are to enhance the feature information by combining the multi-layer features, which achieves better effect. However, the number of convolution layers of the ResNet network is large, and how to select multilayer features to enhance feature information is a difficult problem. To solve this problem, the features of each frame of the image of the tracking sequence are decomposed. A natural image, which can be divided into a high spatial frequency component, which is used to represent the detail feature with fast change; a low spatial frequency component that describes the general structure of image smoothing.
Therefore, we improve the convolution part of ResNet: first, the features extracted by convolution are divided into high-frequency sampling features and low-frequency sampling features. Then, the high-frequency characteristic and the low-frequency characteristic output by each layer of convolution are subjected to information exchange. In this way the feature extraction network of the tracker is optimized. Although the tracking methods are numerous, the method has the disadvantages that even if a section of the tracking method using the deep convolutional network obtains a better result, the deep networks such as the deep ResNet and the like can extract excessive detail features, on one hand, the calculation cost is increased, on the other hand, the effect of replacing the shallow convolutional network with the deep convolutional network is reduced due to the fact that the deep networks such as the deep ResNet and the like extract the excessive detail features, and the integral features are lack of representation, so that the convolution extraction is optimized by combining the high-low frequency feature convolution and the twin network tracking method, the general appearance structure and the function of the integral feature network are not changed, and the method has a wide application prospect in the field of target tracking.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The target tracking method, medium and system of the twin network are provided, wherein the target tracking method, medium and system of the twin network are effectively based on deep convolutional network characteristics, avoid the problem that the tracking precision is reduced after characteristics are extracted from a deeper convolutional network, and are combined with high-low spatial frequency convolutional calculation. The technical scheme of the invention is as follows:
a target tracking method combining high and low spatial frequency characteristics comprises the following steps:
s1, firstly, inputting an original video image sequence, cutting out a target image truth value and a search area image sequence according to a target truth value diagram of a first frame provided by a data set, performing feature extraction on the image sequence by adopting high-low spatial frequency convolution to obtain a high-frequency feature component and a low-frequency feature component of each layer, and taking the extracted features as the features for judging a follow-up tracking target;
s2, secondly, exchanging information of the high-frequency characteristic component and the low-frequency characteristic component of each layer, and finally outputting a characteristic value of the tracking sequence after characteristic enhancement;
s3, obtaining a correlation diagram of the target truth value characteristic and the search area characteristic by utilizing correlation calculation according to the target image and the search image sequence characteristic extracted by the high and low spatial frequency characteristic network;
and S4, finally, reprocessing through the regional suggestion network to obtain classification features and regression features, and correcting the target position positioned by the classification features through regression feature parameters to obtain the target position to be tracked.
Further, the clipping of the input original video image sequence in step S1 is to clip the target position proposed by the first frame of the video sequence, and clip the image sequence of the search area in an area 2 times the target position predicted by the previous frame, so as to obtain the feature map of the target to be tracked and the feature map of the search area in the subsequent tracking process through the feature extraction network.
Further, the extracting features of high and low spatial frequencies in step S1 is a method for processing channel features of a clipped image, and specifically includes: firstly, expanding an input image channel into c through layer common convolution calculation, wherein c is the number of characteristic channels generated after the convolution operation of a common characteristic extraction network, and c is the number of the characteristic channels generated after the convolution operation of the common characteristic extraction network1Number of channels processed for high frequency convolution, c2The number of channels processed for low frequency convolution, wherein c ═ c is satisfied1+c2. The method is characterized in that the multi-channel features generated by a general convolution network are subjected to partition processing, the high-frequency convolution processing mode is the same as the extraction of the general features, and the low-frequency convolution is performed on the first general featureAnd after downsampling is carried out on the layer convolution output, carrying out normal convolution operation.
Further, in step S2, the information exchange of the high and low features is performed by dividing the total number of channels c into c by the feature tensor X obtained by the first block convolution of the feature extraction network1A channel characteristic and c2A channel feature, c1The characteristic of each channel is used as the characteristic component X of the high-frequency convolutionHInput, c2The characteristic of each channel is used as the characteristic component X of the low-frequency convolutionLAfter input and convolution calculation, the characteristic output of high-frequency and low-frequency components is decomposed into Y ═ Y { (Y)H,YLAnd by up-down sampling, unifying the spatial resolution of the output by high-low frequency sampling convolution calculation, wherein the output characteristics can be expressed as:
Figure BDA0002334369620000031
in the formula (1), YH→HAnd YL→LIndicating the transfer of characteristic information of the same spatial resolution after high and low frequency convolution calculation, YH →LAnd YL→HRepresenting high and low frequency information exchange.
Further, in order to calculate the output feature tensor of the new convolution, a convolution kernel W ═ W is providedH→H,WL→H,WL→L,WH →LIs responsible for the input features XLAnd XHConvolution is performed, so the feature calculation can be refined as:
Figure BDA0002334369620000041
in the formula (2), f (X; W) represents convolution with a parameter W, and pool (X, 2) represents average pooling as down-sampling;
a) output function f (X)H;WH→H) And f (X)L;WL→L) Y of formula (1)H→HAnd YL→LIs used for transmitting the high-frequency characteristic and the low-frequency characteristicCalculating the next layer of convolution network; b) by upsample (f (X)L;WL→H) And 2) finishing down-sampling of high-frequency characteristics, and transmitting high-frequency information to a low-frequency convolution of a next layer. By f (pool (X)H,2);WH→L) And finishing up-sampling of the low-frequency characteristics, and transmitting the low-frequency information to the high-frequency convolution of the next layer.
Further, the step S3 of obtaining a correlation diagram of the target truth feature and the search area feature by using correlation calculation specifically includes:
using a function fθ(z, x) comparing the similarity of the target sample image z and the search image x, and outputting a correlation mapping map by the following formula:
Figure BDA0002334369620000042
in equation (3), z represents the image of interest, x represents the search area in the subsequent frame of the video, z represents a w × h crop centered on a given object truth, x is a larger crop centered on the object position estimate, and the two crops are input to the same convolutional network
Figure BDA0002334369620000043
The processing is performed, which corresponds to the following steps in the search image x: an exhaustive search with template image z is done above. Function f obtained by searchingθThe larger the response value of (z, x), the greater the likelihood that the mapped location is a target.
Further, the area recommendation network in step S4 is a method for supervising the correlation map outputted after the correlation calculation, and the supervising part has two branches, one for classifying the foreground object and the background, and one for recommending regression, and features
Figure BDA0002334369620000052
And features
Figure BDA0002334369620000053
After successive convolutional layers by a supervision blockIt will be divided into two branches: one is the cls branch feature used for classification, i.e.
Figure BDA0002334369620000054
And
Figure BDA0002334369620000055
one is the reg branch feature used for regression, i.e.
Figure BDA0002334369620000056
And
Figure BDA0002334369620000057
the classification branch and regression branch are then calculated separately by the following formulas:
Figure BDA0002334369620000051
in the equation (4), the calculation of the two branches combines the correlation calculation of the step S3 to output FclsAnd FregAnd (4) performing over-parameter debugging to accurately obtain the positioning result of S3.
A medium having stored therein a computer program which, when read by a processor, performs any of the methods described above.
A target tracking system incorporating high and low spatial frequency characteristics, comprising:
a feature extraction module: the system comprises a first frame, a second frame, a third frame, a fourth frame, a fifth frame, a sixth frame, a seventh frame, a fifth frame, a sixth frame, a seventh frame, a sixth frame, a seventh frame, a fifth frame, a sixth frame, a fourth frame, a fifth frame, a sixth frame, a fourth frame, a fifth frame, a fourth frame, a fifth frame, a sixth frame, a fourth frame, a fourth frame, a fourth frame, a fourth frame, a fourth frame, a fourth, a frame, a fourth, a;
the information exchange module: the tracking sequence characteristic value is used for exchanging information of the high-frequency characteristic component and the low-frequency characteristic component of each layer and finally outputting the characteristic-enhanced tracking sequence characteristic value;
a correlation calculation module: the correlation diagram is used for obtaining a correlation diagram of a target truth value characteristic and a search area characteristic by utilizing correlation calculation according to a target image and a search image sequence characteristic extracted by a high-low space frequency characteristic network;
a characteristic correction module: and the target position locating method is used for reprocessing through the regional suggestion network to obtain the classification characteristic and the regression characteristic, and correcting the target position located by the classification characteristic through the regression characteristic parameter to obtain the target position to be tracked.
The invention has the following advantages and beneficial effects:
according to the invention, high and low spatial frequency feature extraction and a target tracking method are combined, the feature extraction of a tracking image frame is optimized by using high and low frequency features, and the introduction of low frequency features can solve the problem that the features in a deep network are too prone to detail features and lack of overall feature information on one hand, so that the features are more beneficial to tracking, and on the other hand, because the calculation cost of the low frequency features is lower than that of the high frequency features, the calculation cost of the deep network can be effectively reduced by giving partial channel features to low frequency feature convolution calculation. Meanwhile, by means of information exchange of high-frequency and low-frequency characteristics, multi-scale characteristic information is well combined, interference of single detail characteristics on objects in tracking movement is reduced, and characteristics extracted by a network structure are more beneficial to tracking requirements.
The innovation points of the invention are as follows: firstly, a low-frequency feature extraction convolution is introduced into the feature extraction module, compared with a conventional convolution calculation means, the low-frequency feature convolution is a normal convolution calculation method for performing normal convolution on the features after the down-sampling of the normal convolution, the features of the down-sampling convolution are smaller than those of the normal convolution calculation, and the information contained after each feature point is mapped back to the original image is more than the information represented by a single feature point calculated by the conventional convolution, so that more global features are provided. Compared with the conventional convolution method, the low-frequency convolution is to perform downsampling and normal convolution on one part of the features of the multiple channels generated by the conventional convolution, and perform conventional convolution only without downsampling on the rest of the features, which is called high-frequency convolution. Because the length and width of the feature graph generated after the low-frequency convolution downsamples the conventional convolution is only half of the length and width of the conventional convolution, the low-frequency convolution calculation amount is small, and the whole convolution calculation amount is lower than that of the conventional convolution.
Secondly, an information exchange module is introduced, compared with a conventional convolution calculation method, the conventional convolution calculation only has the high-frequency convolution characteristics (the high frequency of the high-frequency convolution characteristics is relative to the low-frequency convolution characteristics) of the characteristic extraction module, the characteristics generated by each layer of low-frequency convolution are acquired through upsampling and have the same size as the characteristics generated by the corresponding high-frequency convolution, and then the characteristics are added, so that the characteristics generated by each layer of high-frequency convolution obtain the global information of the low-frequency convolution characteristics; in addition, the features generated by each layer of high-frequency convolution are acquired through downsampling and have the same size as the features generated by the corresponding low-frequency convolution, and then the features are added, so that the features generated by each layer of low-frequency convolution obtain the detail information of the high-frequency convolution features. Through information exchange, the finally output features have global information and detail information at the same time, and the method is a novel multi-scale feature method for tracking. In a conventional multi-scale method for tracking, a plurality of layers of conventional convolution features are generally selected, generally, 3 to 5 layers of conventional convolution feature information are extracted, and then feature addition fusion is performed by aligning the feature size through up-sampling or down-sampling. The method has the characteristics that multi-scale information is added into the high-frequency and low-frequency characteristic output of each layer, and the method can be used by combining with a conventional method and can further improve the performance according to the specific target tracking task requirement.
Drawings
FIG. 1 is a flow chart of a preferred embodiment method provided by the present invention;
fig. 2 is a modified example of the convolution calculation of the extracted portion of the trace network according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the process flow of the present invention is shown in FIG. 1.
The invention relates to a modification of a backbone network block of a deep convolution calculation part into a figure 2.
The method comprises the following specific steps:
step S1: firstly, an input original video image sequence is cut, considering that a target position proposed by a first frame of the video sequence is cut, and a 2-time area of the target position predicted by the previous frame is used as a search area image sequence for cutting, so that a feature map of a target to be tracked and a feature map of a search area in a subsequent tracking process can be obtained through a feature extraction network. Specifically, the truth feature given by the first frame of the target is selected, and the center point of the selected truth feature is taken as the standard, so that the mean value is filled into the initial image input with the length and the width of m, and m is generally 127 as the default. Regarding the first frame information of the search area, taking the central point of the selected truth-value feature as a reference, filling the mean value of the first frame information into initial image input with the length and the width being n, and generally defaulting n to be 256; and filling the information of the subsequent frames of the search area by taking the central point predicted by the previous frame tracking algorithm as a reference.
In addition, for the high and low spatial frequency extraction features, firstly, the input image channel is expanded into c through layer common convolution calculation. c is the number of feature channels generated after the convolution operation of a general feature extraction network, c1Number of channels processed for high frequency convolution, c2The number of channels processed for low frequency convolution. Wherein, c ═ c is satisfied1+c2. Namely, by carrying out partition processing on the multi-channel features generated by a common convolution network, the high-frequency convolution processing mode is the same as the common feature extraction mode. The low frequency convolution is a way to perform normal convolution operations after downsampling the first layer convolution output.
Step S2: the information exchange of the high and low frequency features is realized by dividing the total number of channels C into C through a feature tensor X obtained by the convolution of a first block of a feature extraction network1The individual channel characteristics as characteristic components X for high-frequency convolutionHInput sum C2The individual channel features are used as the feature component X for the low-frequency convolutionLAfter input and convolution calculation, the characteristic output of high-frequency and low-frequency components is decomposed into Y ═ Y { (Y)H,YL}. Through up-down sampling, the spatial resolution after the convolution calculation output of the high-frequency and low-frequency samples is unified, and the output characteristics can be expressed as:
Figure BDA0002334369620000081
in the formula (1), Ya→bIndicating that the features are mapped into b groups and the size of the output features is equal to the spatial resolution of the b groups. Y isH→HAnd YL→LIndicating the transfer of characteristic information of the same spatial resolution after high and low frequency convolution calculation, YH→LAnd YL→HRepresenting high and low frequency information exchange. In order to calculate the output feature tensor of the new convolution, a convolution kernel W ═ W is providedH→H,WL→H,WL→L,WH→LIs responsible for the input features XLAnd XHAnd performing convolution. Thus, the feature calculation can be refined as:
Figure BDA0002334369620000082
in formula (2), f (X; W) represents a convolution with a parameter W, pool (X, 2): the representation is downsampled using average pooling.
Equation (2) is a refinement of the convolution method of equation (1), in which the spatial resolution of the output of the high-frequency convolution feature is 2 times greater than that of the low-frequency convolution feature, so for the tracking method, the acceptance domain of the low-frequency convolution feature of low frequency is 2 times greater than that of high frequency. a) Output function f (X)H;WH→H) And f (X)L;WL→L) Y is of formula (1)H→HAnd YL→LRepresents passing the high frequency features and the low frequency features to the next layer of convolutional network computation. b) By upsample (f (X)L;WL→H) And 2) finishing down-sampling of high-frequency characteristics, and transmitting high-frequency information to a low-frequency convolution of a next layer. By f (pool (X)H,2);WH→L) And finishing up-sampling of the low-frequency characteristics, and transmitting the low-frequency information to the high-frequency convolution of the next layer. This exchange further convolvesThe layer obtains more morning and afternoon information from farther positions, and the performance of the network characteristics is improved.
Specifically, as shown in fig. 2 of the specification, the backbone block of a general deep convolutional network is modified. In the left part of fig. 2, a general feature extraction convolution is shown, conv denotes convolution calculation, conv1x1 denotes convolution of size 1, and conv3x3 denotes convolution of size 3x 3. A typical deeper depth network consists of a convolution structure where s1 denotes the step size of the convolution calculation as 1. And in the right part, low-frequency convolution calculation is added on the basis of general characteristic convolution, and the crossed connecting lines represent the operation of the formula (2). c. C1The number of high-frequency convolution channels, wherein the convolution calculation belongs to high-frequency convolution calculation; c. C1The number of low-frequency convolution channels, and the convolution calculation belongs to low-frequency convolution calculation. Wherein c is c1+c2
Step S3: using a function fθ(z, x) to compare the similarity of the target sample image z and the search image x. Outputting the correlation mapping map by the following formula:
Figure BDA0002334369620000092
in equation (3), z represents the image of interest and x represents the search area in the subsequent frame of the video. Specifically, z represents a w × h crop centered on a given target truth value, and x is a larger crop centered on the target position estimation result. These two clips are input to the same convolutional network
Figure BDA0002334369620000093
And (6) processing. At this time, the convolutional network
Figure BDA0002334369620000094
Is a characteristic of high and low frequency convolution networks. Corresponding to an exhaustive search with template image z on search image x. Function f obtained by searchingθThe response value of (z, x) is used for tracking.
Step S4: output after introducing regional proposal network supervision related calculationA correlation map. The supervision part is provided with two branches, one branch is used for classifying foreground objects and backgrounds, and the other branch is used for suggesting regression. Feature(s)
Figure BDA0002334369620000095
And features
Figure BDA0002334369620000096
After passing through the successive convolutional layers of the supervision section, it is divided into two branches: one is the cls branch feature used for classification, i.e.
Figure BDA0002334369620000097
And
Figure BDA0002334369620000098
one is the reg branch feature used for regression, i.e.
Figure BDA0002334369620000099
And
Figure BDA00023343696200000910
the classification branch and regression branch are then calculated separately by the following formulas:
Figure BDA0002334369620000091
in the equation (4), the calculation of the two branches combines the correlation calculation of the step S3 to output FclsAnd FregThe ratio is generally 0.5 and 0.5 influenced by the parameters.
In addition, since the deep network needs to pre-train the network parameters, the loss function Cross-entropy loss of the classification branch and the loss of the regression branch are smooth L in the training process1
In order to evaluate the performance of the algorithm, the invention adopts a tracking data set OTB (OTB-2013 and OTB-2015) commonly used for target tracking to perform preliminary evaluation. Giving a comparison value of AUC and Pre of the two evaluation indexes
In the experiment, a SiamRPN-Res method named as a SiamRPN tracking method and a method for replacing a SiamRPN feature extraction part with a deeper feature network and a twin network tracking method combining high and low frequency features are selected for quantitative evaluation. The comparative table of the experimental results is shown in table 1.
TABLE 1
Figure BDA0002334369620000101
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (9)

1. A target tracking method combining high and low spatial frequency characteristics is characterized by comprising the following steps:
s1, firstly, inputting an original video image sequence, cutting out a target image truth value and a search area image sequence according to a target truth value diagram of a first frame provided by the original image sequence, extracting the characteristics of the image sequence by adopting high-low spatial frequency convolution to obtain high-frequency characteristic components and low-frequency characteristic components of each layer, and taking the extracted characteristics as the characteristics for judging a follow-up tracking target;
s2, secondly, exchanging information of the high-frequency characteristic component and the low-frequency characteristic component of each layer, and finally outputting a characteristic value of the tracking sequence after characteristic enhancement;
s3, obtaining a correlation diagram of the target truth value characteristic and the search area characteristic by utilizing correlation calculation according to the target image and the search image sequence characteristic extracted by the high and low spatial frequency characteristic network;
and S4, finally, reprocessing through the regional suggestion network to obtain classification features and regression features, and correcting the target position positioned by the classification features through regression feature parameters to obtain the target position to be tracked.
2. The method for tracking the target according to claim 1, wherein the step S1 cuts the input original video image sequence by cutting the target position proposed by the first frame of the video sequence and cutting the image sequence of the search area from an area 2 times larger than the predicted target position of the previous frame, so as to obtain the feature map of the target to be tracked and the feature map of the search area in the subsequent tracking process through the feature extraction network.
3. The method for tracking the target by combining the high and low spatial frequency features according to claim 1 or 2, wherein the high and low spatial frequency extraction features in step S1 are methods for processing channel features of the clipped image, and specifically include: firstly, expanding an input image channel into c through layer common convolution calculation, wherein c is the number of characteristic channels generated after the convolution operation of a common characteristic extraction network, and c is the number of the characteristic channels generated after the convolution operation of the common characteristic extraction network1Number of channels processed for high frequency convolution, c2The number of channels processed for low frequency convolution, wherein c ═ c is satisfied1+c2(ii) a The method is characterized in that the multi-channel features generated by a common convolution network are processed in a partitioning mode, the high-frequency convolution processing mode is the same as the common feature extraction mode, and the low-frequency convolution is a mode of performing normal convolution operation after downsampling the first layer of convolution output.
4. The method for tracking the target by combining the high-low spatial frequency features as claimed in claim 3, wherein in step S2, the information exchange for the high-low features is performed by using a feature tensor X obtained by a first block convolution of the feature extraction network to divide the total number of channels c into c1A channel characteristic and c2A channel feature, c1The characteristic of each channel is used as the characteristic component X of the high-frequency convolutionHInput, c2The characteristic of each channel is used as the characteristic component X of the low-frequency convolutionLAfter input and convolution calculation, the characteristic output of high-frequency and low-frequency components is decomposed into Y ═ Y { (Y)H,YLBy up-down samplingAnd the spatial resolution after the output is calculated by unifying the high and low frequency sampling convolution, and the output characteristics can be expressed as:
Figure FDA0002334369610000021
in the formula (1), YH→HAnd YL→LIndicating the transfer of characteristic information of the same spatial resolution after high and low frequency convolution calculation, YH→LAnd YL →HRepresenting high and low frequency information exchange.
5. The method as claimed in claim 4, wherein a convolution kernel W ═ W is provided for calculating the output feature tensor of the new convolutionH→H,WL→H,WL→L,WH→LIs responsible for the input features XLAnd XHConvolution is performed, so the feature calculation can be refined as:
Figure FDA0002334369610000022
in the formula (2), f (X; W) represents convolution with a parameter W, and pool (X, 2) represents average pooling as down-sampling;
a) output function f (X)H;WH→H) And f (X)L;WL→L) Y of formula (1)H→HAnd YL→LThe specific calculation of (2) represents that the high-frequency characteristic and the low-frequency characteristic are transmitted to the next layer of convolution network for calculation; b) by upsample (f (X)L;WL→H) 2) finishing down-sampling of high-frequency characteristics, and transmitting high-frequency information to a low-frequency convolution of a next layer; by f (pool (X)H,2);WH→L) And finishing up-sampling of the low-frequency characteristics, and transmitting the low-frequency information to the high-frequency convolution of the next layer.
6. The method as claimed in claim 5, wherein the step S3 of obtaining the correlation diagram of the target truth feature and the search area feature by using correlation calculation specifically includes:
using a function fθ(z, x) comparing the similarity of the target sample image z and the search image x, and outputting a correlation mapping map by the following formula:
Figure FDA0002334369610000031
in equation (3), z represents the image of interest, x represents the search area in the subsequent frame of the video, z represents a w × h crop centered on a given object truth, x is a larger crop centered on the object position estimate, and the two crops are input to the same convolutional network
Figure FDA0002334369610000032
Processing is carried out, which is equivalent to exhaustive search of a template image z on a search image x; function f obtained by searchingθThe larger the response value of (z, x), the greater the likelihood that the mapped location is a target.
7. The method as claimed in claim 5, wherein the area suggestion network in step S4 is a method for supervising the correlation map outputted after the correlation calculation, the supervising part has two branches, one is used for classifying the foreground object and the background, one is used for suggesting regression, and the features are characterized in that
Figure FDA0002334369610000033
And features
Figure FDA0002334369610000034
After passing through the successive convolutional layers of the supervision section, it is divided into two branches: one is the cls branch feature used for classification, i.e.
Figure FDA0002334369610000035
And
Figure FDA0002334369610000036
one is the reg branch feature used for regression, i.e.
Figure FDA0002334369610000037
And
Figure FDA0002334369610000038
the classification branch and regression branch are then calculated separately by the following formulas:
Figure FDA0002334369610000039
in the equation (4), the calculation of the two branches combines the correlation calculation of the step S3 to output FclsAnd FregAnd (4) performing over-parameter debugging to accurately obtain the positioning result of S3.
8. A medium having a computer program stored therein, wherein the computer program, when read by a processor, performs the method of any of the preceding claims 1 to 7.
9. An object tracking system combining high and low spatial frequency characteristics, comprising:
a feature extraction module: the system comprises a first frame, a second frame, a third frame, a fourth frame, a fifth frame, a sixth frame, a seventh frame, a fifth frame, a sixth frame, a seventh frame, a sixth frame, a seventh frame, a fifth frame, a sixth frame, a fourth frame, a fifth frame, a sixth frame, a fourth frame, a fifth frame, a fourth frame, a fifth frame, a sixth frame, a fourth frame, a fourth frame, a fourth frame, a fourth frame, a fourth frame, a fourth, a frame, a fourth, a;
the information exchange module: the tracking sequence characteristic value is used for exchanging information of the high-frequency characteristic component and the low-frequency characteristic component of each layer and finally outputting the characteristic-enhanced tracking sequence characteristic value;
a correlation calculation module: the correlation diagram is used for obtaining a correlation diagram of a target truth value characteristic and a search area characteristic by utilizing correlation calculation according to a target image and a search image sequence characteristic extracted by a high-low space frequency characteristic network;
a characteristic correction module: and the target position locating method is used for reprocessing through the regional suggestion network to obtain the classification characteristic and the regression characteristic, and correcting the target position located by the classification characteristic through the regression characteristic parameter to obtain the target position to be tracked.
CN201911349832.3A 2019-12-24 2019-12-24 Target tracking method, medium and system combining high-low spatial frequency characteristics Active CN111191555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911349832.3A CN111191555B (en) 2019-12-24 2019-12-24 Target tracking method, medium and system combining high-low spatial frequency characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911349832.3A CN111191555B (en) 2019-12-24 2019-12-24 Target tracking method, medium and system combining high-low spatial frequency characteristics

Publications (2)

Publication Number Publication Date
CN111191555A CN111191555A (en) 2020-05-22
CN111191555B true CN111191555B (en) 2022-05-03

Family

ID=70707585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911349832.3A Active CN111191555B (en) 2019-12-24 2019-12-24 Target tracking method, medium and system combining high-low spatial frequency characteristics

Country Status (1)

Country Link
CN (1) CN111191555B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561956B (en) * 2020-11-25 2023-04-28 中移(杭州)信息技术有限公司 Video target tracking method and device, electronic equipment and storage medium
CN113160247B (en) * 2021-04-22 2022-07-05 福州大学 Anti-noise twin network target tracking method based on frequency separation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103456030A (en) * 2013-09-08 2013-12-18 西安电子科技大学 Target tracking method based on scattering descriptor
CN105844669A (en) * 2016-03-28 2016-08-10 华中科技大学 Video target real-time tracking method based on partial Hash features
CN107545582A (en) * 2017-07-04 2018-01-05 深圳大学 Video multi-target tracking and device based on fuzzy logic
CN109271865A (en) * 2018-08-17 2019-01-25 西安电子科技大学 Motion target tracking method based on scattering transformation multilayer correlation filtering
CN110084773A (en) * 2019-03-25 2019-08-02 西北工业大学 A kind of image interfusion method based on depth convolution autoencoder network
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110570458A (en) * 2019-08-12 2019-12-13 武汉大学 Target tracking method based on internal cutting and multi-layer characteristic information fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10902243B2 (en) * 2016-10-25 2021-01-26 Deep North, Inc. Vision based target tracking that distinguishes facial feature targets

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103456030A (en) * 2013-09-08 2013-12-18 西安电子科技大学 Target tracking method based on scattering descriptor
CN105844669A (en) * 2016-03-28 2016-08-10 华中科技大学 Video target real-time tracking method based on partial Hash features
CN107545582A (en) * 2017-07-04 2018-01-05 深圳大学 Video multi-target tracking and device based on fuzzy logic
CN109271865A (en) * 2018-08-17 2019-01-25 西安电子科技大学 Motion target tracking method based on scattering transformation multilayer correlation filtering
CN110084773A (en) * 2019-03-25 2019-08-02 西北工业大学 A kind of image interfusion method based on depth convolution autoencoder network
CN110210551A (en) * 2019-05-28 2019-09-06 北京工业大学 A kind of visual target tracking method based on adaptive main body sensitivity
CN110570458A (en) * 2019-08-12 2019-12-13 武汉大学 Target tracking method based on internal cutting and multi-layer characteristic information fusion

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Multi-Task Hierarchical Feature Learning for Real-Time Visual Tracking;Yangliu Kuai etc.;《IEEE Sensors Journal》;20181127;全文 *
Target Tracking Algorithm for Specific Person Based on Multi-loss of Siamese Network;Xinhua Liu etc.;《2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)》;20191007;全文 *
Trajectory Factory: Tracklet Cleaving and Re-Connection by Deep Siamese Bi-GRU for Multiple Object Tracking;Cong Ma etc.;《2018 IEEE International Conference on Multimedia and Expo (ICME)》;20181011;全文 *
采用高效卷积算子的长期目标追踪算法;李国友 等;《小型微型计算机系统》;20190915;全文 *

Also Published As

Publication number Publication date
CN111191555A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN109685831B (en) Target tracking method and system based on residual layered attention and correlation filter
CN107689052B (en) Visual target tracking method based on multi-model fusion and structured depth features
CN111160407B (en) Deep learning target detection method and system
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111191555B (en) Target tracking method, medium and system combining high-low spatial frequency characteristics
CN111583300B (en) Target tracking method based on enrichment target morphological change update template
CN110992401A (en) Target tracking method and device, computer equipment and storage medium
CN114691973A (en) Recommendation method, recommendation network and related equipment
CN112215079B (en) Global multistage target tracking method
CN112258558A (en) Target tracking method based on multi-scale twin network, electronic device and medium
CN110889864A (en) Target tracking method based on double-layer depth feature perception
CN110349176B (en) Target tracking method and system based on triple convolutional network and perceptual interference learning
CN109685830A (en) Method for tracking target, device and equipment and computer storage medium
Xie et al. Feature-guided spatial attention upsampling for real-time stereo matching network
CN109584267B (en) Scale adaptive correlation filtering tracking method combined with background information
CN113436224B (en) Intelligent image clipping method and device based on explicit composition rule modeling
CN111008992B (en) Target tracking method, device and system and storage medium
CN108460383A (en) Saliency refined method based on neural network and image segmentation
WO2022142084A1 (en) Match screening method and apparatus, and electronic device, storage medium and computer program
CN114066935A (en) Long-term target tracking method based on correlation filtering
CN113298850A (en) Target tracking method and system based on attention mechanism and feature fusion
CN113705325A (en) Deformable single-target tracking method and device based on dynamic compact memory embedding
Shen et al. Anti-distractors: two-branch siamese tracker with both static and dynamic filters for object tracking
CN112529081A (en) Real-time semantic segmentation method based on efficient attention calibration
CN106792510B (en) A kind of prediction type fingerprint image searching method in fingerprint location

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant