CN108665481B

CN108665481B - Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion

Info

Publication number: CN108665481B
Application number: CN201810259132.4A
Authority: CN
Inventors: 秦翰林; 王婉婷; 王春妹; 延翔; 程文雄; 彭昕; 胡壮壮; 周慧鑫
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-03-27
Filing date: 2018-03-27
Publication date: 2022-05-31
Anticipated expiration: 2038-03-27
Also published as: CN108665481A

Abstract

The invention discloses a self-adaptive anti-blocking infrared target tracking method with fusion of multilayer depth features, which comprises the following steps of firstly, obtaining a series of multilayer depth feature maps with the same size and different levels; then, converting the multilayer depth feature map from a time domain to a frequency domain according to related filtering, performing filter training and response map calculation according to fast Fourier transform, performing merging and dimension reduction processing on the multilayer depth feature map according to intra-layer feature weighting fusion, constructing feature response maps of different levels, and solving a maximum related response value, namely a target estimation position; finally, extracting dense features of the target, obtaining a maximum feature response value according to the relevant filtering, and obtaining a response confidence coefficient of the central position of the target estimated through the depth convolution features; when the response confidence of the target center position is less than the re-detection threshold T₀And then, evaluating the obtained target estimated position through online target re-detection and carrying out self-adaptive updating on the position of the target according to the evaluation result.

Description

Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion.

Background

Visual tracking is one of the hot spots of research in the field of computer vision, and is widely applied to civil fields such as video monitoring and intelligent transportation. In recent years, with the rapid development of the computer vision field, the comprehensive performance of the tracking algorithm is remarkably improved. Meanwhile, the infrared imaging system detects by using energy generated by the target and identifies the target by acquiring energy information of the target, so that the infrared imaging system has the capabilities of passive detection and all-time detection and is widely applied to target sensing equipment; tracking an object of interest is a main task of an infrared detection system, and therefore, tracking an infrared object is a research hotspot problem nowadays.

The current tracking algorithm can be divided into a classical target tracking method and a target tracking method based on deep learning, wherein the classical target tracking method can be divided into two categories of a generative type and a discriminant type, and the target tracking method based on deep learning can be divided into the following categories according to different training strategies: (1) the method comprises the steps of assisting a picture data pre-training model, and carrying out fine adjustment during online tracking; (2) the pre-trained CNN classification network extracts deep features.

The generative method in the classical tracking method describes the apparent features of the target using a generative model, and then minimizes the reconstruction error by searching for candidate targets. The algorithm for comparison is typically sparse coding, online density estimation, principal component analysis and the like. The generative method focuses on the description of the target itself, ignores background information, and easily generates drift when the target itself changes violently or is blocked.

In contrast, the discriminant method distinguishes between the target and the background by training the classifier. This method is also often referred to as pre-tracking detection. In recent years, various machine learning algorithms have been applied to discriminant methods, and among them, there are many representative examples of learning methods, structural support vector machines, and the like. The discriminant method can obviously distinguish information of the background and the foreground, is more robust in performance, and gradually occupies a mainstream position in the field of target tracking. Most of the current deep learning target tracking methods can also belong to a discriminant framework.

The tracking algorithm based on deep learning uses auxiliary non-tracking training data to pre-train under the condition that training data tracked by a target are very limited, and obtains a general representation of object features. During actual tracking, the pre-training model is finely adjusted by using the limited sample information of the current tracking target, so that the model has stronger classification performance on the current tracking target, the migration learning thought greatly reduces the requirement on the tracking target training sample, and the performance of the tracking algorithm is also improved. The method is used as a first tracking algorithm for applying a deep network to single-target tracking, firstly, an idea of 'offline pre-training + online fine tuning' is provided, the problem of insufficient training samples in tracking is solved to a great extent, and the problem of insufficient samples of a large-scale convolutional neural network in direct training still exists.

The idea of another deep learning tracking method is to directly use a CNN network trained on a large-scale classification database such as ImageNet to obtain deep characteristic representation of a target, and then classify the target by using an observation model to obtain a tracking result. The method not only avoids the over-fitting problem caused by insufficient training samples, but also fully utilizes the strong characterization capability of the depth feature.

In recent years, tracking methods based on correlation filtering have attracted the attention of many researchers because of their high speed and good effect. The correlation filter trains a filter bank by regressing the input features to a target gaussian distribution and locates the position of the target by finding the response peak in the predicted distribution in subsequent tracking. The fast Fourier transform is applied in the processing to obtain great speed improvement, and at present, a plurality of related filtering-based expanding methods are provided, including a coring related filter, a scaling estimated related filter and the like. In recent years, tracking methods combining depth features and related filters gradually appear, and the main idea is to extract deep features of an interested region and determine a final target position by using the related filters, so that the real-time problem which is difficult to solve by the existing deep learning target tracking method can be well solved.

The current target tracking method based on deep learning is mainly characterized in that a network model capable of distinguishing a target and background information is trained, and non-target similar objects in the background are obviously inhibited; therefore, when the target is shielded by a complex scene for a long time, the tracking target is lost, so that the stability of target tracking is reduced, and robustness is not provided for re-tracking after the target appears again.

Disclosure of Invention

In view of the above, the present invention mainly aims to provide a method for tracking an anti-blocking infrared target adaptively by multi-layer depth feature fusion.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a self-adaptive anti-blocking infrared target tracking method with multi-layer depth feature fusion, which comprises the following steps: firstly, a multilayer depth feature map of a video image target candidate region is obtained through VGG-Net, and then a series of multilayer depth feature maps with the same size and different levels are obtained through up-sampling of the multilayer depth feature map; then, converting the multilayer depth feature map from a time domain to a frequency domain according to related filtering, performing filter training and response map calculation according to fast Fourier transform, performing merging and dimension reduction processing on the multilayer depth feature map according to intra-layer feature weighting fusion, constructing feature response maps of different levels, and solving a maximum related response value, namely a target estimation position; finally, extracting the dense features of the target, obtaining a maximum response value of the features according to the related filtering, and obtaining a response confidence coefficient of the central position of the target estimated through the depth convolution features; when the response confidence of the target center position is less than the re-detection threshold T₀Then, the obtained estimated position of the target is evaluated through on-line target re-detection and the position of the target is self-adapted according to the evaluation resultAnd (6) updating.

In the above scheme, the obtaining of the multilayer depth feature map of the video image target candidate region through VGG-Net specifically includes: taking a VGG-Net-19 deep convolutional network as a core network, and directly taking the multidimensional image as network input; where "19" represents the number of levels of weight to be learned in the network; each convolutional layer from Conv1 to Conv5 contains 2, 4 convolutions respectively, all convolutional layers use the same convolution kernel of 3 × 3 size, and each convolutional layer in VGG-Net-19 obtains a multi-layer depth feature map of a video image target candidate region by training on an ImageNet dataset.

In the foregoing solution, the obtaining a series of multilayer depth feature maps with the same size and different levels by upsampling the multilayer depth feature map specifically includes: the output of each convolutional layer is a set of multi-channel feature maps

M, N and D respectively represent the width, height and channel number of the characteristic diagram; and performing upsampling operation on the feature maps of different levels according to bilinear interpolation, so that the feature maps of all convolutional layers have the same size.

In the above scheme, the feature map f is up-sampled, and the feature vector at the position i is expressed as formula (1):

wherein f is a feature map, x is an up-sampled feature map, alpha_ikIs a difference weight whose value is related to the neighborhood feature vector at location i and k.

In the foregoing solution, the converting the multilayer depth feature map from the time domain to the frequency domain according to the correlation filtering, and performing filter training and response map calculation according to the fast fourier transform specifically include: and giving a multi-dimensional characteristic input x after the target is tracked and is sampled, obtaining an optimal correlation filter w through learning training data based on a tracking algorithm of correlation filtering, and estimating the position of the target according to the maximum correlation response value in a candidate region searched by the filter.

In the foregoing solution, the converting the multilayer depth feature map from the time domain to the frequency domain according to the correlation filtering, and performing filter training and response map calculation according to the fast fourier transform further specifically includes: in the t frame image, the multi-dimensional convolution characteristic of the target is input as

All cyclic shifts of x in the vertical and horizontal directions are used as samples for training the correlation filter, and each sample can be represented as x_m,n(M, N) is an element {0,1,. M-1} × {0,1,. N-1 }; given the desired output y (m, n) of each sample at the same time, the optimal correlation filter in the t-th frame image can be obtained by minimizing the output error, see equation (2):

where λ is a regularization parameter and λ is not less than zero, y (m, n) is a two-dimensional gaussian kernel function with a peak at the center position, and its expression can be represented by formula (3):

wherein, (M × N) is formed by {0,1,., M-1} × {0,1,., N-1}, and sigma is the width of a gaussian kernel; the frequency domain representation of the above formula can be obtained according to the Pasaval theorem as formula (4):

wherein X, Y and W are discrete Fourier transforms of X, Y and W, respectively,

complex conjugate of X, dot product operation of an element(ii) a Finding the optimal filter on each eigen channel d can be expressed by equation (5):

when a multidimensional convolution feature map Z of a target candidate region in a t +1 th frame is given, and discrete Fourier transform is performed on the feature map Z, a correlation response map H of the t frame can be obtained, and the correlation response map H can be expressed by a formula (6):

wherein, F^-1Representing inverse discrete Fourier transform operation, and finding a maximum response value in H, namely the estimated position of the target in the t +1 th frame;

multiplying each pixel by a raised cosine window brings the pixel value near the edge close to zero, which can be expressed by equation (7):

in the above scheme, the merging and dimensionality reduction processing is performed on the multilayer depth feature map according to the intra-layer feature weighted fusion, feature response maps of different levels are constructed, and the maximum correlation response value is obtained and is the target estimated position, specifically:

extracting different convolution characteristics of 3 layers of Conv3-4, Conv4-4 and Conv5-4 through VGG-Net and obtaining maximum response value H of each layer₃，H₄，H₅(ii) a And performing weighted fusion on the response data to obtain a related response graph H after multilayer feature fusion:

H＝β₁H₃+β₂H₄+β₃H₅wherein, β₁，β₂，β₃Respectively corresponding fusion weighted values of different convolution layers; searching out a maximum response value on the fused correlation response graph H, wherein the position of the maximum response value is an estimated central position p of the target, and p is argmaxH (m, n);where (m, n) is the pixel point location in the candidate region.

In the above scheme, the extracting dense features of the target, obtaining a maximum response value of the features according to the correlation filtering, and obtaining a response confidence of the target center position estimated by the depth convolution features specifically includes:

firstly, on a t +1 frame image, sampling by taking the estimated target central position as a central coordinate to obtain a search area block with the size of a multiplied by b, using the search area block as a dense feature extraction range, calculating by using formulas (10) and (11) to obtain gradient components of each pixel in the horizontal and vertical directions, and obtaining the length and the angle of each pixel gradient vector by using formulas (12) and (13);

G₁＝pixel_{(pos_x+1,pos_y)}-pixel_{(pos_x-1,pos_y)} (10)

G₂＝pixel_{(pos_x,pos_y+1)}-pixel_{(pos_x,pos_y-1)} (11)

θ＝arctan(G₁/G₂) (13)

wherein the pixel_{(pos_x+1,pos_y)}，pixel_{(pos_x-1,pos_y)}，pixel_{(pos_x,pos_y+1)}，pixel_{(pos_x,pos_y-1)}Respectively representing the positions of 4 pixels, pos _ x and pos _ y being estimated target positions, G₁，G₂Respectively representing the distance of 2 pixel points in the horizontal direction and the vertical direction, and S and theta represent the length of a gradient vector and a gradient vector angle;

then, dividing the search area into cells with the same size, respectively calculating gradient information of pixels in each cell patch, wherein the gradient information comprises the size and the direction, the gradient size of each pixel contributes different weights to the direction of each pixel, and then the weights are accumulated to all gradient directions;

dividing the search area into blocks, each cell is a/4 x b/4 pixels in size, dividing the blocks into 4 x 4 cells, and counting each cell in 9 directionsSo as to use a 9-dimensional vector to represent its image information, and count the feature map of each cell to obtain dense features

Finally, Gaussian correlation filtering is carried out on the extracted multilayer dense features, and the maximum response value is obtained, so that the response Confidence factor Confidence of the target center position estimated through the deep convolution features can be obtained, the value reflects the reliability degree of each tracking result,

Confidence＝max(F^-1(E)) (15)

zf and xf are dense feature sets extracted from the current frame and the previous frame respectively, and F is Fourier transform.

In the above solution, when the confidence of the response of the target center position is smaller than the redetection threshold T₀And then, evaluating the obtained target estimation position through online target redetection, specifically: the core of the re-detection module is a linear two-classifier, and the obtained target estimated position is subjected to two-term classification by a calculation formula (16):

f(p)＝<s_w,p>+s_b (16)

wherein < > is a vector inner product sign, s _ w is a weight vector which is a normal direction of the hyperplane, s _ b is offset, values of s _ w and s _ b are obtained after training, and a Lagrangian method is used for solving, so that a Lagrangian function is a formula (17):

wherein alpha is_lLagrange multiplier for each sample is ≧ 0, (p)₁,q₁),...,(p_l,q_l),(p_k,q_k) Is the histogram equalized sample, k is sampleThis number, q_lEqual to 1 or-1, p_lD-dimensional vectors;

the optimal classification function obtained after the solution can be represented by equation (18):

f(p)＝sgn[(s_w^*·p)+s_b^*] (18)

and s _ w and s _ b are respectively corresponding optimal solutions.

In the above scheme, the adaptively updating the position of the target according to the evaluation result specifically includes: evaluating the obtained target estimated position through on-line target redetection to obtain a series of score values scorcs of the sample, and taking the maximum value of the score values, wherein the position of the sample corresponding to the score values is the re-estimated position tpos of the target after redetection; processing the sample by the formulas (14) and (15) to obtain a detected tracking Confidence Confidence₂When the value meets the formula (19), pos is replaced by tpos to obtain the re-detected target position, and if the value does not meet the formula (19), the tracking position is not changed;

Confidence₂>1.1Confidence&&max(scorcs)>0 (19)。

compared with the prior art, the invention has the following beneficial effects:

the method can not only stably track the target under the condition of target deformation, but also solve the problem of long-time shielding, and has better robustness.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a 1 st frame of an experimental image sequence, the image having 1 object, marked by a white frame;

FIG. 3 is a sequence diagram (frame 62 to 88) of the occlusion of the target, showing the center position of the target as a red dot for the salient position change;

FIG. 4 shows the tracking results of different methods for an experimental image sequence; wherein fig. 4(a), 4(b), 4(c) are the tracking results of the 70 th frame, 90 th frame, 180 th frame image fdst method, respectively;

FIG. 5 shows the results of different methods for tracking experimental image sequences; wherein, fig. 5(a), 5(b), 5(c) are the tracking results of the 70 th frame, 90 th frame, 180 th frame image HCF method, respectively;

FIG. 6 shows the tracking results of different methods for an experimental image sequence; wherein, fig. 6(a), 6(b), 6(c) are the tracking results of the method of the present invention for the 70 th frame image, the 90 th frame image, and the 180 th frame image, respectively.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides a self-adaptive anti-occlusion target tracking method based on multi-layer depth feature fusion, which is specifically realized by the following steps as shown in figure 1:

step 101: and acquiring multi-layer depth feature expression of the video image target candidate region.

The VGG-Net-19 deep convolutional network is used as a core network, and the multidimensional image is directly used as network input, so that the complex processes of feature extraction and data reconstruction are avoided.

VGG-Net-19 is composed of 5 sets (16 layers in total) of convolution layers, 2 fully-connected feature layers and 1 fully-connected classification layer. Each set of convolutional layers from Conv1 to Conv5 included 2, 4, and 4 convolutions, respectively, and all convolutional layers used the same convolution kernel of size 3 × 3. By training on the ImageNet dataset, each convolutional layer in VGG-Net-19 can get a different level of feature expression for the target.

Step 102: and performing upsampling to obtain a series of semantic information and detail information with the same size and different levels.

The output of each convolutional layer is a set of multi-channel feature maps

M, N, D respectively represent the width, height and channel number of the feature map. But due to the special pooling operation of the VGG series convolutional network, the size of the obtained characteristic diagram is different between different hierarchies, and the deeper layersThe smaller the size of the resulting feature map of the stage. Therefore, in order to better fuse the convolution feature maps between different levels, the feature maps of different levels are upsampled so that the feature maps of all convolution layers have the same size.

The feature map f is up-sampled, and the feature vector at the position i is expressed as formula (1):

wherein f is a feature map, x is an up-sampled feature map, and alpha_ikIs the difference weight whose value is related to the location i and k neighborhood feature vectors.

Step 103: filter training and response map calculation are performed in the fourier domain using correlation filtering.

Giving an up-sampled multidimensional characteristic input x of a tracking target, obtaining an optimal correlation filter w through learning training data based on a tracking algorithm of correlation filtering, and searching the maximum correlation response value in a candidate region by using the filter to estimate the position of the target.

In the t frame image, the multi-dimensional convolution characteristic of the target is input as

All cyclic shifts of x in the vertical and horizontal directions are used as samples for training the correlation filter, and each sample can be represented as x_m,nAnd (M, N) is an element {0,1,.. M-1} × {0,1,. N-1 }. Given the desired output y (m, n) of each sample at the same time, the optimal correlation filter in the t-th frame image can be obtained by minimizing the output error, see equation (2):

where (M × N) ∈ {0, 1., M-1} × {0, 1., N-1}, σ being the width of the gaussian kernel. The frequency domain from the above formula can be obtained according to the Pasaval theorem as formula (4):

wherein X, Y and W are discrete Fourier transforms of X, Y and W, respectively,

a complex conjugate of X, a dot product of elements. The optimal filtering on each eigenchannel d can be found as shown in equation (5):

when a multidimensional convolution feature map Z of a target candidate region in the t +1 th frame is given, and discrete fourier transform of the feature map Z is converted into Z, a correlation response map H of the t th frame can be obtained, and can be represented by formula (6):

wherein, F^-1Representing an inverse discrete fourier transform operation. And finding the maximum response value in the H, namely the estimated position of the target in the t +1 th frame.

To account for the boundary effect, multiplying each pixel by a raised cosine window brings the pixel value near the edge close to zero, which can be expressed by equation (7).

Step 104: and carrying out merging and dimensionality reduction on the multilayer features by utilizing intra-layer feature weighted fusion to construct a feature response graph.

Extracting different convolution characteristics of 3 layers of Conv3-4, Conv4-4 and Conv5-4 by using VGG-Net, and obtaining the maximum response value H of each layer according to the method₃，H₄，H₅. The weighted fusion is performed to obtain a multi-layer feature fused correlation response graph H, which can be represented by formula (8):

H＝β₁H₃+β₂H₄+β₃H₅ (8)

wherein, beta₁，β₂，β₃The fusion weights corresponding to different convolutional layers.

Step 105: and solving the maximum correlation response value to obtain the estimated position of the target.

Searching out a maximum response value on the fused correlation response graph H, wherein the position of the maximum response value is the estimated center position p of the target, and the formula (9) shows that:

p＝arg max H(m,n) (9)

where (m, n) is the pixel point location in the candidate region.

Step 106: extracting the dense features of the target, and obtaining a maximum response value of the features by using related filtering to obtain a response Confidence Confidence of the central position of the target estimated by the deep convolution features.

On the t +1 frame image, a search area block of a size of a × b is sampled with the estimated target center position as the center coordinate, and as a range of dense feature extraction, gradient components in the horizontal and vertical directions of each pixel are calculated by equations (10) and (11), and the length and angle of the gradient vector of each pixel can be obtained by equations (12) and (13).

G₁＝pixel_{(pos_x+1,pos_y)}-pixel_{(pos_x-1,pos_y)} (10)

G₂＝pixel_{(pos_x,pos_y+1)}-pixel_{(pos_x,pos_y-1)} (11)

θ＝arctan(G₁/G₂) (13)

Wherein the pixel_{(pos_x+1,pos_y)}，pixel_{(pos_x-1,pos_y)}，pixel_{(pos_x,pos_y+1)}，pixel_{(pos_x,pos_y-1)}Respectively representing the positions of 4 pixels, pos _ x and pos _ y being estimated target positions, G₁，G₂And respectively representing the distance of the 2 pixel points in the horizontal and vertical directions, and S and theta represent the length of a gradient vector and a gradient vector angle.

Then, the search area is divided into cells with the same size, and gradient information including the size and the direction of pixels in each cell patch is calculated respectively. The magnitude of the gradient of each pixel contributes a different weight to its direction, which is then accumulated over all gradient directions. Increasing the number of gradient directions will improve the detection performance, when the gradient is divided into 9 directions to be respectively counted, the detection is most effective (0-20 degrees, 21-40 degrees, 161-180 degrees), and when the gradient exceeds 9 directions, the heavy detection effect is not obviously improved.

Dividing the search area into blocks, each cell is a/4 x b/4 pixel in size and divided into 4 x 4 cells, counting the gradient information of each cell in 9 directions, thereby representing the image information by a 9-dimensional vector, and counting the feature map of each cell to obtain dense feature map

Finally, performing gaussian correlation filtering on the extracted multilayer dense features and solving for a maximum response value, so as to obtain a response Confidence factor of the target center position estimated by the deep convolution features, where the response Confidence factor reflects the reliability of each tracking result and can be represented by formulas (14) and (15).

Confidence＝max(F^-1(E)) (15)

Step 107: and evaluating the obtained target estimated position through online target redetection.

Setting a redetection threshold T₀And evaluating the estimated target position, repositioning the estimated target position pos when the tracking confidence coefficient is smaller than the threshold value, and starting a re-detection module at the moment.

The core of the re-detection module is a linear two-classifier, the aim is to construct a classification decision function to classify positive and negative samples as correctly as possible, the aim of linear classification is to find one or a group of hyperplanes to completely separate the positive and negative samples around the target, and the binomial classification is carried out through a calculation formula (16):

f(p)＝<s_w,p>+s_b (16)

where < > is the vector inner product sign, s _ w is the weight vector, which is the normal direction of the hyperplane, and s _ b is the offset, and the values of s _ w and s _ b are obtained after training. This is an optimization problem with constraint conditions, which can be solved by the lagrangian method, making the lagrangian function as formula (17):

wherein alpha is_lLagrange multiplier for each sample (p ≧ 0₁,q₁),...,(p_l,q_l),(p_k,q_k) Is the histogram equalized sample, k is the number of samples, q_lEqual to 1 or-1, p_lIs a d-dimensional vector.

f(p)＝sgn[(s_w^*·p)+s_b^*] (18)

and s _ w and s _ b are respectively corresponding optimal solutions.

Step 108: and performing self-adaptive updating on the position of the target according to the evaluation result.

And (4) obtaining a series of score values scorcs of the samples through the processing, taking the maximum value of the score values, wherein the position of the sample corresponding to the score values is the re-estimated position tpos of the re-detected target. Processing the sample by the formulas (14) and (15) to obtain a detected tracking Confidence Confidence₂When the value satisfies the formula (19), pos is replaced by tpos, that is, the target position after re-detection is obtained, and if the value does not satisfy the formula, the tracking position is not changed.

Confidence₂>1.1Confidence&&max(scorcs)>0 (19)

The beneficial effects of the invention are specifically explained by simulation experiments:

1. conditions of the experiment

The CPU used in the experiment was Intel Core (TM) i 7-41702.50 GHz memory 8GB, and the programming platform was MATLAB R2015 b. The experiment used a real image sequence containing the target, data collected from DARPA VIVID, thermal infrared data for a series of vehicles, some trees occluded and passed through shadows. The size of the image is 640 x 480 as shown in fig. 2.

In order to effectively illustrate the superiority of the invention, the invention is compared with the fDSST and HCF tracking methods which are superior in the last two years. Experiments show that the subjective vision and objective evaluation index of the method are superior to those of a comparison method. Fig. 4 to 6 show the tracking results of the target of fig. 2 by the method of the present invention and two comparative tracking methods. The HCF tracking method based on deep learning is effective for tracking the slow change of the target form of the front frame image and the rear frame image of the video, but when the target is shielded for a long time, the method is easy to lose the target. Although the real-time performance of the fDSST tracking is high, the problem of long-time tracking of the target cannot be solved. The method adopts a self-adaptive anti-occlusion tracking method of multi-layer depth feature fusion, utilizes the multi-layer depth convolution features of the target, combines the semantic information and the detail information of the target, simultaneously adds confidence coefficient evaluation in the tracking process, starts a target re-detection module when the confidence coefficient does not meet the condition, re-determines the position of the target, and can still accurately track the target by correcting the central position of the target when the target is occluded for a long time.

The objective evaluation index of the tracking result is given in table 1, CLE is the center position error, the average euclidean distance between the target center position estimated by the tracking method and the real target center position is calculated, the unit is a pixel, and the smaller the value is, the better the tracking effect is indicated; OP represents the overlapping ratio of the boundary frames, the average overlapping degree of the areas of the target frame predicted by the tracking method and the actual target frame is calculated, and the larger the value is, the better the tracking effect is; DP represents the measurement precision, the ratio of the frame number with the center position error smaller than a certain threshold value to the total frame number of the video is calculated, the larger the value is, the better the value is, and in the experimental process, the threshold value is set to be 20 pixels; fps represents the frame rate, and the larger the value, the better the tracking effect. As can be seen from table 1, the method of the present invention has greater advantages in terms of center position error, tracking success rate and measurement accuracy than the two tracking methods with the widest application, and although the tracking real-time performance is still slightly worse than fdst, the method is basically equivalent to the target tracking method based on deep learning.

TABLE 1 Objective index of tracking results

Method	Average CLE (Pixel)	Average OP (%)	Average DP (%)	Average velocity (fps)
					fDSST	58.9	24	25.7	25
HCF	7.58	90.7	88	1.6
					The method of the invention	4.5	96.3	93	2.2

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion is characterized by comprising the following steps: firstly, acquiring a multilayer depth feature map of a video image target candidate region through VGG-Net, and then performing up-sampling on the multilayer depth feature map to obtain a series of multilayer depth feature maps with the same size and different levels; then, converting the multilayer depth feature map from a time domain to a frequency domain according to related filtering, performing filter training and response map calculation according to fast Fourier transform, performing merging and dimension reduction processing on the multilayer depth feature map according to intra-layer feature weighting fusion, constructing feature response maps of different levels, and solving a maximum related response value, namely a target estimation position; finally, extracting the dense features of the target, obtaining a maximum response value of the features according to the related filtering, and obtaining a response confidence coefficient of the central position of the target estimated through the depth convolution features; when the confidence of the response of the target center position is less than the redetectionThreshold value T₀Then, the obtained target estimation position is evaluated through online target redetection, and the position of the target is adaptively updated according to the evaluation result;

the extracting of the dense features of the target, obtaining the maximum response value of the features according to the related filtering, and obtaining the response confidence of the target center position estimated by the depth convolution features specifically include:

G₁＝pixel_{(pos_x+1,pos_y)}-pixel_{(pos_x-1,pos_y)} (10)

G₂＝pixel_{(pos_x,pos_y+1)}-pixel_{(pos_x,pos_y-1)} (11)

θ＝arctan(G₁/G₂) (13)

then, dividing the search area into cells with the same size, respectively calculating gradient information of pixels in each unit patch, wherein the gradient information comprises the size and the direction, the gradient size of each pixel contributes different weights to the direction of each pixel, and then the weights are accumulated to all gradient directions;

to search areaDividing the unit into blocks, each unit cell is a/4 x b/4 pixel in size, dividing the unit cell into 4 x 4 unit cells, counting the gradient information of each unit cell in 9 directions, thereby representing the image information by a 9-dimensional vector, and counting the feature map of each unit cell to obtain dense features

Confidence＝max(F^-1(E)) (15)

wherein zf and xf are dense feature sets extracted from a current frame and a previous frame respectively, and F is Fourier transform;

when the response confidence of the target center position is less than the redetection threshold T₀And then, evaluating the obtained target estimation position through online target redetection, specifically: the core of the re-detection module is a linear two-classifier, and the obtained target estimated position is subjected to two-term classification by a calculation formula (16):

f(p)＝<s_w,p>+s_b (16)

wherein alpha is_lLagrange multiplier for each sample is ≧ 0, (p)₁,q₁),...,(p_l,q_l),(p_k,q_k) Is the histogram equalized sample, k is the number of samples, q_lEqual to 1 or-1, p_lIs a d-dimensional vector;

f(p)＝sgn[(s_w^*·p)+s_b^*] (18)

wherein s _ w, s _ b are respectively corresponding optimal solutions;

the self-adaptive updating of the position of the target according to the evaluation result specifically comprises: evaluating the obtained target estimated position through on-line target redetection to obtain a series of score values scorcs of the sample, and taking the maximum value of the score values, wherein the position of the sample corresponding to the score values is the re-estimated position tpos of the target after redetection; processing the sample by the formulas (14) and (15) to obtain a tracking Confidence after detection₂When the value meets the formula (19), pos is replaced by tpos to obtain the re-detected target position, and if the value does not meet the formula (19), the tracking position is not changed;

Confidence₂>1.1Confidence&&max(scorcs)>0 (19)。

2. the method for tracking the multi-layer depth feature fused adaptive anti-occlusion infrared target according to claim 1, wherein the obtaining of the multi-layer depth feature map of the video image target candidate region through VGG-Net specifically comprises: a VGG-Net-19 deep convolutional network is used as a core network, and the multidimensional image is directly used as network input; where "19" represents the number of levels of weight to be learned in the network; each convolutional layer from Conv1 to Conv5 contains 2, 4 convolutions, respectively, all of which use the same convolution kernel of 3 × 3 size, and each convolutional layer in VGG-Net-19 obtains a multi-layer depth feature map of a target candidate region of a video image by training on an ImageNet data set.

3. Adaptation of multi-layer depth feature fusion according to claim 1 or 2The anti-blocking infrared target tracking method is characterized in that a series of multilayer depth feature maps with the same size and different levels are obtained by up-sampling the multilayer depth feature maps, and specifically comprises the following steps: the output of each convolutional layer is a set of multi-channel feature maps

4. The multi-layer depth feature fusion adaptive anti-occlusion infrared target tracking method according to claim 3, wherein a feature map f is up-sampled, and a feature vector at a position i is represented as formula (1):

wherein f is a feature map, x is an up-sampled feature map, alpha_ikIs the difference weight whose value is related to the location i and k neighborhood feature vectors.

5. The multi-layer depth feature fusion adaptive anti-occlusion infrared target tracking method according to claim 4, wherein the multi-layer depth feature map is converted from a time domain to a frequency domain according to a correlation filter, and filter training and response map calculation are performed according to a fast Fourier transform, specifically: and giving a multi-dimensional characteristic input x after the target is tracked and is sampled, obtaining an optimal correlation filter w through learning training data based on a tracking algorithm of correlation filtering, and estimating the position of the target according to the maximum correlation response value in a candidate region searched by the filter.

6. The multi-layer depth feature fused adaptive anti-occlusion infrared target tracking method according to claim 5, characterized in thatConverting the multilayer depth feature map from a time domain to a frequency domain according to the correlation filtering, and performing filter training and response map calculation according to the fast fourier transform, further specifically: in the t frame image, the multi-dimensional convolution characteristic of the target is input as

wherein (M × N) is within {0,1, M-1} × {0,1, N-1}, and σ is the width of the gaussian kernel; the frequency domain from the above formula can be obtained according to the Pasaval theorem as formula (4):

wherein X, Y and W are discrete Fourier transforms of X, Y and W, respectively,

a complex conjugate of X, a dot product of an element; finding the optimal filtering available on each eigen channel dEquation (5) represents:

7. the method for tracking the multi-layer depth feature fused self-adaptive anti-blocking infrared target according to claim 6, wherein the multi-layer depth feature map is subjected to merging and dimensionality reduction according to intra-layer feature weighted fusion to construct feature response maps of different levels and obtain a maximum correlation response value, which is the target estimated position, and specifically comprises the following steps:

H＝β₁H₃+β₂H₄+β₃H₅wherein, β₁，β₂，β₃Respectively corresponding fusion weighted values of different convolution layers; in meltSearching out a maximum response value on the combined related response graph H, wherein the position of the maximum response value is the estimated central position p of the target, and p is arg max H (m, n); where (m, n) is the pixel point location in the candidate region.