CN108665481B - Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion - Google Patents
Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion Download PDFInfo
- Publication number
- CN108665481B CN108665481B CN201810259132.4A CN201810259132A CN108665481B CN 108665481 B CN108665481 B CN 108665481B CN 201810259132 A CN201810259132 A CN 201810259132A CN 108665481 B CN108665481 B CN 108665481B
- Authority
- CN
- China
- Prior art keywords
- target
- pos
- response
- feature
- pixel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10048—Infrared image
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a self-adaptive anti-blocking infrared target tracking method with fusion of multilayer depth features, which comprises the following steps of firstly, obtaining a series of multilayer depth feature maps with the same size and different levels; then, converting the multilayer depth feature map from a time domain to a frequency domain according to related filtering, performing filter training and response map calculation according to fast Fourier transform, performing merging and dimension reduction processing on the multilayer depth feature map according to intra-layer feature weighting fusion, constructing feature response maps of different levels, and solving a maximum related response value, namely a target estimation position; finally, extracting dense features of the target, obtaining a maximum feature response value according to the relevant filtering, and obtaining a response confidence coefficient of the central position of the target estimated through the depth convolution features; when the response confidence of the target center position is less than the re-detection threshold T0And then, evaluating the obtained target estimated position through online target re-detection and carrying out self-adaptive updating on the position of the target according to the evaluation result.
Description
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to a self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion.
Background
Visual tracking is one of the hot spots of research in the field of computer vision, and is widely applied to civil fields such as video monitoring and intelligent transportation. In recent years, with the rapid development of the computer vision field, the comprehensive performance of the tracking algorithm is remarkably improved. Meanwhile, the infrared imaging system detects by using energy generated by the target and identifies the target by acquiring energy information of the target, so that the infrared imaging system has the capabilities of passive detection and all-time detection and is widely applied to target sensing equipment; tracking an object of interest is a main task of an infrared detection system, and therefore, tracking an infrared object is a research hotspot problem nowadays.
The current tracking algorithm can be divided into a classical target tracking method and a target tracking method based on deep learning, wherein the classical target tracking method can be divided into two categories of a generative type and a discriminant type, and the target tracking method based on deep learning can be divided into the following categories according to different training strategies: (1) the method comprises the steps of assisting a picture data pre-training model, and carrying out fine adjustment during online tracking; (2) the pre-trained CNN classification network extracts deep features.
The generative method in the classical tracking method describes the apparent features of the target using a generative model, and then minimizes the reconstruction error by searching for candidate targets. The algorithm for comparison is typically sparse coding, online density estimation, principal component analysis and the like. The generative method focuses on the description of the target itself, ignores background information, and easily generates drift when the target itself changes violently or is blocked.
In contrast, the discriminant method distinguishes between the target and the background by training the classifier. This method is also often referred to as pre-tracking detection. In recent years, various machine learning algorithms have been applied to discriminant methods, and among them, there are many representative examples of learning methods, structural support vector machines, and the like. The discriminant method can obviously distinguish information of the background and the foreground, is more robust in performance, and gradually occupies a mainstream position in the field of target tracking. Most of the current deep learning target tracking methods can also belong to a discriminant framework.
The tracking algorithm based on deep learning uses auxiliary non-tracking training data to pre-train under the condition that training data tracked by a target are very limited, and obtains a general representation of object features. During actual tracking, the pre-training model is finely adjusted by using the limited sample information of the current tracking target, so that the model has stronger classification performance on the current tracking target, the migration learning thought greatly reduces the requirement on the tracking target training sample, and the performance of the tracking algorithm is also improved. The method is used as a first tracking algorithm for applying a deep network to single-target tracking, firstly, an idea of 'offline pre-training + online fine tuning' is provided, the problem of insufficient training samples in tracking is solved to a great extent, and the problem of insufficient samples of a large-scale convolutional neural network in direct training still exists.
The idea of another deep learning tracking method is to directly use a CNN network trained on a large-scale classification database such as ImageNet to obtain deep characteristic representation of a target, and then classify the target by using an observation model to obtain a tracking result. The method not only avoids the over-fitting problem caused by insufficient training samples, but also fully utilizes the strong characterization capability of the depth feature.
In recent years, tracking methods based on correlation filtering have attracted the attention of many researchers because of their high speed and good effect. The correlation filter trains a filter bank by regressing the input features to a target gaussian distribution and locates the position of the target by finding the response peak in the predicted distribution in subsequent tracking. The fast Fourier transform is applied in the processing to obtain great speed improvement, and at present, a plurality of related filtering-based expanding methods are provided, including a coring related filter, a scaling estimated related filter and the like. In recent years, tracking methods combining depth features and related filters gradually appear, and the main idea is to extract deep features of an interested region and determine a final target position by using the related filters, so that the real-time problem which is difficult to solve by the existing deep learning target tracking method can be well solved.
The current target tracking method based on deep learning is mainly characterized in that a network model capable of distinguishing a target and background information is trained, and non-target similar objects in the background are obviously inhibited; therefore, when the target is shielded by a complex scene for a long time, the tracking target is lost, so that the stability of target tracking is reduced, and robustness is not provided for re-tracking after the target appears again.
Disclosure of Invention
In view of the above, the present invention mainly aims to provide a method for tracking an anti-blocking infrared target adaptively by multi-layer depth feature fusion.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a self-adaptive anti-blocking infrared target tracking method with multi-layer depth feature fusion, which comprises the following steps: firstly, a multilayer depth feature map of a video image target candidate region is obtained through VGG-Net, and then a series of multilayer depth feature maps with the same size and different levels are obtained through up-sampling of the multilayer depth feature map; then, converting the multilayer depth feature map from a time domain to a frequency domain according to related filtering, performing filter training and response map calculation according to fast Fourier transform, performing merging and dimension reduction processing on the multilayer depth feature map according to intra-layer feature weighting fusion, constructing feature response maps of different levels, and solving a maximum related response value, namely a target estimation position; finally, extracting the dense features of the target, obtaining a maximum response value of the features according to the related filtering, and obtaining a response confidence coefficient of the central position of the target estimated through the depth convolution features; when the response confidence of the target center position is less than the re-detection threshold T0Then, the obtained estimated position of the target is evaluated through on-line target re-detection and the position of the target is self-adapted according to the evaluation resultAnd (6) updating.
In the above scheme, the obtaining of the multilayer depth feature map of the video image target candidate region through VGG-Net specifically includes: taking a VGG-Net-19 deep convolutional network as a core network, and directly taking the multidimensional image as network input; where "19" represents the number of levels of weight to be learned in the network; each convolutional layer from Conv1 to Conv5 contains 2, 4 convolutions respectively, all convolutional layers use the same convolution kernel of 3 × 3 size, and each convolutional layer in VGG-Net-19 obtains a multi-layer depth feature map of a video image target candidate region by training on an ImageNet dataset.
In the foregoing solution, the obtaining a series of multilayer depth feature maps with the same size and different levels by upsampling the multilayer depth feature map specifically includes: the output of each convolutional layer is a set of multi-channel feature mapsM, N and D respectively represent the width, height and channel number of the characteristic diagram; and performing upsampling operation on the feature maps of different levels according to bilinear interpolation, so that the feature maps of all convolutional layers have the same size.
In the above scheme, the feature map f is up-sampled, and the feature vector at the position i is expressed as formula (1):
wherein f is a feature map, x is an up-sampled feature map, alphaikIs a difference weight whose value is related to the neighborhood feature vector at location i and k.
In the foregoing solution, the converting the multilayer depth feature map from the time domain to the frequency domain according to the correlation filtering, and performing filter training and response map calculation according to the fast fourier transform specifically include: and giving a multi-dimensional characteristic input x after the target is tracked and is sampled, obtaining an optimal correlation filter w through learning training data based on a tracking algorithm of correlation filtering, and estimating the position of the target according to the maximum correlation response value in a candidate region searched by the filter.
In the foregoing solution, the converting the multilayer depth feature map from the time domain to the frequency domain according to the correlation filtering, and performing filter training and response map calculation according to the fast fourier transform further specifically includes: in the t frame image, the multi-dimensional convolution characteristic of the target is input asAll cyclic shifts of x in the vertical and horizontal directions are used as samples for training the correlation filter, and each sample can be represented as xm,n(M, N) is an element {0,1,. M-1} × {0,1,. N-1 }; given the desired output y (m, n) of each sample at the same time, the optimal correlation filter in the t-th frame image can be obtained by minimizing the output error, see equation (2):
where λ is a regularization parameter and λ is not less than zero, y (m, n) is a two-dimensional gaussian kernel function with a peak at the center position, and its expression can be represented by formula (3):
wherein, (M × N) is formed by {0,1,., M-1} × {0,1,., N-1}, and sigma is the width of a gaussian kernel; the frequency domain representation of the above formula can be obtained according to the Pasaval theorem as formula (4):
wherein X, Y and W are discrete Fourier transforms of X, Y and W, respectively,complex conjugate of X, dot product operation of an element(ii) a Finding the optimal filter on each eigen channel d can be expressed by equation (5):
when a multidimensional convolution feature map Z of a target candidate region in a t +1 th frame is given, and discrete Fourier transform is performed on the feature map Z, a correlation response map H of the t frame can be obtained, and the correlation response map H can be expressed by a formula (6):
wherein, F-1Representing inverse discrete Fourier transform operation, and finding a maximum response value in H, namely the estimated position of the target in the t +1 th frame;
multiplying each pixel by a raised cosine window brings the pixel value near the edge close to zero, which can be expressed by equation (7):
in the above scheme, the merging and dimensionality reduction processing is performed on the multilayer depth feature map according to the intra-layer feature weighted fusion, feature response maps of different levels are constructed, and the maximum correlation response value is obtained and is the target estimated position, specifically:
extracting different convolution characteristics of 3 layers of Conv3-4, Conv4-4 and Conv5-4 through VGG-Net and obtaining maximum response value H of each layer3,H4,H5(ii) a And performing weighted fusion on the response data to obtain a related response graph H after multilayer feature fusion:
H=β1H3+β2H4+β3H5wherein, β1,β2,β3Respectively corresponding fusion weighted values of different convolution layers; searching out a maximum response value on the fused correlation response graph H, wherein the position of the maximum response value is an estimated central position p of the target, and p is argmaxH (m, n);where (m, n) is the pixel point location in the candidate region.
In the above scheme, the extracting dense features of the target, obtaining a maximum response value of the features according to the correlation filtering, and obtaining a response confidence of the target center position estimated by the depth convolution features specifically includes:
firstly, on a t +1 frame image, sampling by taking the estimated target central position as a central coordinate to obtain a search area block with the size of a multiplied by b, using the search area block as a dense feature extraction range, calculating by using formulas (10) and (11) to obtain gradient components of each pixel in the horizontal and vertical directions, and obtaining the length and the angle of each pixel gradient vector by using formulas (12) and (13);
G1=pixel(pos_x+1,pos_y)-pixel(pos_x-1,pos_y) (10)
G2=pixel(pos_x,pos_y+1)-pixel(pos_x,pos_y-1) (11)
θ=arctan(G1/G2) (13)
wherein the pixel(pos_x+1,pos_y),pixel(pos_x-1,pos_y),pixel(pos_x,pos_y+1),pixel(pos_x,pos_y-1)Respectively representing the positions of 4 pixels, pos _ x and pos _ y being estimated target positions, G1,G2Respectively representing the distance of 2 pixel points in the horizontal direction and the vertical direction, and S and theta represent the length of a gradient vector and a gradient vector angle;
then, dividing the search area into cells with the same size, respectively calculating gradient information of pixels in each cell patch, wherein the gradient information comprises the size and the direction, the gradient size of each pixel contributes different weights to the direction of each pixel, and then the weights are accumulated to all gradient directions;
dividing the search area into blocks, each cell is a/4 x b/4 pixels in size, dividing the blocks into 4 x 4 cells, and counting each cell in 9 directionsSo as to use a 9-dimensional vector to represent its image information, and count the feature map of each cell to obtain dense features
Finally, Gaussian correlation filtering is carried out on the extracted multilayer dense features, and the maximum response value is obtained, so that the response Confidence factor Confidence of the target center position estimated through the deep convolution features can be obtained, the value reflects the reliability degree of each tracking result,
Confidence=max(F-1(E)) (15)
zf and xf are dense feature sets extracted from the current frame and the previous frame respectively, and F is Fourier transform.
In the above solution, when the confidence of the response of the target center position is smaller than the redetection threshold T0And then, evaluating the obtained target estimation position through online target redetection, specifically: the core of the re-detection module is a linear two-classifier, and the obtained target estimated position is subjected to two-term classification by a calculation formula (16):
f(p)=<s_w,p>+s_b (16)
wherein < > is a vector inner product sign, s _ w is a weight vector which is a normal direction of the hyperplane, s _ b is offset, values of s _ w and s _ b are obtained after training, and a Lagrangian method is used for solving, so that a Lagrangian function is a formula (17):
wherein alpha islLagrange multiplier for each sample is ≧ 0, (p)1,q1),...,(pl,ql),(pk,qk) Is the histogram equalized sample, k is sampleThis number, qlEqual to 1 or-1, plD-dimensional vectors;
the optimal classification function obtained after the solution can be represented by equation (18):
f(p)=sgn[(s_w*·p)+s_b*] (18)
and s _ w and s _ b are respectively corresponding optimal solutions.
In the above scheme, the adaptively updating the position of the target according to the evaluation result specifically includes: evaluating the obtained target estimated position through on-line target redetection to obtain a series of score values scorcs of the sample, and taking the maximum value of the score values, wherein the position of the sample corresponding to the score values is the re-estimated position tpos of the target after redetection; processing the sample by the formulas (14) and (15) to obtain a detected tracking Confidence Confidence2When the value meets the formula (19), pos is replaced by tpos to obtain the re-detected target position, and if the value does not meet the formula (19), the tracking position is not changed;
Confidence2>1.1Confidence&&max(scorcs)>0 (19)。
compared with the prior art, the invention has the following beneficial effects:
the method can not only stably track the target under the condition of target deformation, but also solve the problem of long-time shielding, and has better robustness.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a 1 st frame of an experimental image sequence, the image having 1 object, marked by a white frame;
FIG. 3 is a sequence diagram (frame 62 to 88) of the occlusion of the target, showing the center position of the target as a red dot for the salient position change;
FIG. 4 shows the tracking results of different methods for an experimental image sequence; wherein fig. 4(a), 4(b), 4(c) are the tracking results of the 70 th frame, 90 th frame, 180 th frame image fdst method, respectively;
FIG. 5 shows the results of different methods for tracking experimental image sequences; wherein, fig. 5(a), 5(b), 5(c) are the tracking results of the 70 th frame, 90 th frame, 180 th frame image HCF method, respectively;
FIG. 6 shows the tracking results of different methods for an experimental image sequence; wherein, fig. 6(a), 6(b), 6(c) are the tracking results of the method of the present invention for the 70 th frame image, the 90 th frame image, and the 180 th frame image, respectively.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a self-adaptive anti-occlusion target tracking method based on multi-layer depth feature fusion, which is specifically realized by the following steps as shown in figure 1:
step 101: and acquiring multi-layer depth feature expression of the video image target candidate region.
The VGG-Net-19 deep convolutional network is used as a core network, and the multidimensional image is directly used as network input, so that the complex processes of feature extraction and data reconstruction are avoided.
VGG-Net-19 is composed of 5 sets (16 layers in total) of convolution layers, 2 fully-connected feature layers and 1 fully-connected classification layer. Each set of convolutional layers from Conv1 to Conv5 included 2, 4, and 4 convolutions, respectively, and all convolutional layers used the same convolution kernel of size 3 × 3. By training on the ImageNet dataset, each convolutional layer in VGG-Net-19 can get a different level of feature expression for the target.
Step 102: and performing upsampling to obtain a series of semantic information and detail information with the same size and different levels.
The output of each convolutional layer is a set of multi-channel feature mapsM, N, D respectively represent the width, height and channel number of the feature map. But due to the special pooling operation of the VGG series convolutional network, the size of the obtained characteristic diagram is different between different hierarchies, and the deeper layersThe smaller the size of the resulting feature map of the stage. Therefore, in order to better fuse the convolution feature maps between different levels, the feature maps of different levels are upsampled so that the feature maps of all convolution layers have the same size.
The feature map f is up-sampled, and the feature vector at the position i is expressed as formula (1):
wherein f is a feature map, x is an up-sampled feature map, and alphaikIs the difference weight whose value is related to the location i and k neighborhood feature vectors.
Step 103: filter training and response map calculation are performed in the fourier domain using correlation filtering.
Giving an up-sampled multidimensional characteristic input x of a tracking target, obtaining an optimal correlation filter w through learning training data based on a tracking algorithm of correlation filtering, and searching the maximum correlation response value in a candidate region by using the filter to estimate the position of the target.
In the t frame image, the multi-dimensional convolution characteristic of the target is input asAll cyclic shifts of x in the vertical and horizontal directions are used as samples for training the correlation filter, and each sample can be represented as xm,nAnd (M, N) is an element {0,1,.. M-1} × {0,1,. N-1 }. Given the desired output y (m, n) of each sample at the same time, the optimal correlation filter in the t-th frame image can be obtained by minimizing the output error, see equation (2):
where λ is a regularization parameter and λ is not less than zero, y (m, n) is a two-dimensional gaussian kernel function with a peak at the center position, and its expression can be represented by formula (3):
where (M × N) ∈ {0, 1., M-1} × {0, 1., N-1}, σ being the width of the gaussian kernel. The frequency domain from the above formula can be obtained according to the Pasaval theorem as formula (4):
wherein X, Y and W are discrete Fourier transforms of X, Y and W, respectively,a complex conjugate of X, a dot product of elements. The optimal filtering on each eigenchannel d can be found as shown in equation (5):
when a multidimensional convolution feature map Z of a target candidate region in the t +1 th frame is given, and discrete fourier transform of the feature map Z is converted into Z, a correlation response map H of the t th frame can be obtained, and can be represented by formula (6):
wherein, F-1Representing an inverse discrete fourier transform operation. And finding the maximum response value in the H, namely the estimated position of the target in the t +1 th frame.
To account for the boundary effect, multiplying each pixel by a raised cosine window brings the pixel value near the edge close to zero, which can be expressed by equation (7).
Step 104: and carrying out merging and dimensionality reduction on the multilayer features by utilizing intra-layer feature weighted fusion to construct a feature response graph.
Extracting different convolution characteristics of 3 layers of Conv3-4, Conv4-4 and Conv5-4 by using VGG-Net, and obtaining the maximum response value H of each layer according to the method3,H4,H5. The weighted fusion is performed to obtain a multi-layer feature fused correlation response graph H, which can be represented by formula (8):
H=β1H3+β2H4+β3H5 (8)
wherein, beta1,β2,β3The fusion weights corresponding to different convolutional layers.
Step 105: and solving the maximum correlation response value to obtain the estimated position of the target.
Searching out a maximum response value on the fused correlation response graph H, wherein the position of the maximum response value is the estimated center position p of the target, and the formula (9) shows that:
p=arg max H(m,n) (9)
where (m, n) is the pixel point location in the candidate region.
Step 106: extracting the dense features of the target, and obtaining a maximum response value of the features by using related filtering to obtain a response Confidence Confidence of the central position of the target estimated by the deep convolution features.
On the t +1 frame image, a search area block of a size of a × b is sampled with the estimated target center position as the center coordinate, and as a range of dense feature extraction, gradient components in the horizontal and vertical directions of each pixel are calculated by equations (10) and (11), and the length and angle of the gradient vector of each pixel can be obtained by equations (12) and (13).
G1=pixel(pos_x+1,pos_y)-pixel(pos_x-1,pos_y) (10)
G2=pixel(pos_x,pos_y+1)-pixel(pos_x,pos_y-1) (11)
θ=arctan(G1/G2) (13)
Wherein the pixel(pos_x+1,pos_y),pixel(pos_x-1,pos_y),pixel(pos_x,pos_y+1),pixel(pos_x,pos_y-1)Respectively representing the positions of 4 pixels, pos _ x and pos _ y being estimated target positions, G1,G2And respectively representing the distance of the 2 pixel points in the horizontal and vertical directions, and S and theta represent the length of a gradient vector and a gradient vector angle.
Then, the search area is divided into cells with the same size, and gradient information including the size and the direction of pixels in each cell patch is calculated respectively. The magnitude of the gradient of each pixel contributes a different weight to its direction, which is then accumulated over all gradient directions. Increasing the number of gradient directions will improve the detection performance, when the gradient is divided into 9 directions to be respectively counted, the detection is most effective (0-20 degrees, 21-40 degrees, 161-180 degrees), and when the gradient exceeds 9 directions, the heavy detection effect is not obviously improved.
Dividing the search area into blocks, each cell is a/4 x b/4 pixel in size and divided into 4 x 4 cells, counting the gradient information of each cell in 9 directions, thereby representing the image information by a 9-dimensional vector, and counting the feature map of each cell to obtain dense feature map
Finally, performing gaussian correlation filtering on the extracted multilayer dense features and solving for a maximum response value, so as to obtain a response Confidence factor of the target center position estimated by the deep convolution features, where the response Confidence factor reflects the reliability of each tracking result and can be represented by formulas (14) and (15).
Confidence=max(F-1(E)) (15)
Zf and xf are dense feature sets extracted from the current frame and the previous frame respectively, and F is Fourier transform.
Step 107: and evaluating the obtained target estimated position through online target redetection.
Setting a redetection threshold T0And evaluating the estimated target position, repositioning the estimated target position pos when the tracking confidence coefficient is smaller than the threshold value, and starting a re-detection module at the moment.
The core of the re-detection module is a linear two-classifier, the aim is to construct a classification decision function to classify positive and negative samples as correctly as possible, the aim of linear classification is to find one or a group of hyperplanes to completely separate the positive and negative samples around the target, and the binomial classification is carried out through a calculation formula (16):
f(p)=<s_w,p>+s_b (16)
where < > is the vector inner product sign, s _ w is the weight vector, which is the normal direction of the hyperplane, and s _ b is the offset, and the values of s _ w and s _ b are obtained after training. This is an optimization problem with constraint conditions, which can be solved by the lagrangian method, making the lagrangian function as formula (17):
wherein alpha islLagrange multiplier for each sample (p ≧ 01,q1),...,(pl,ql),(pk,qk) Is the histogram equalized sample, k is the number of samples, qlEqual to 1 or-1, plIs a d-dimensional vector.
The optimal classification function obtained after the solution can be represented by equation (18):
f(p)=sgn[(s_w*·p)+s_b*] (18)
and s _ w and s _ b are respectively corresponding optimal solutions.
Step 108: and performing self-adaptive updating on the position of the target according to the evaluation result.
And (4) obtaining a series of score values scorcs of the samples through the processing, taking the maximum value of the score values, wherein the position of the sample corresponding to the score values is the re-estimated position tpos of the re-detected target. Processing the sample by the formulas (14) and (15) to obtain a detected tracking Confidence Confidence2When the value satisfies the formula (19), pos is replaced by tpos, that is, the target position after re-detection is obtained, and if the value does not satisfy the formula, the tracking position is not changed.
Confidence2>1.1Confidence&&max(scorcs)>0 (19)
The beneficial effects of the invention are specifically explained by simulation experiments:
1. conditions of the experiment
The CPU used in the experiment was Intel Core (TM) i 7-41702.50 GHz memory 8GB, and the programming platform was MATLAB R2015 b. The experiment used a real image sequence containing the target, data collected from DARPA VIVID, thermal infrared data for a series of vehicles, some trees occluded and passed through shadows. The size of the image is 640 x 480 as shown in fig. 2.
In order to effectively illustrate the superiority of the invention, the invention is compared with the fDSST and HCF tracking methods which are superior in the last two years. Experiments show that the subjective vision and objective evaluation index of the method are superior to those of a comparison method. Fig. 4 to 6 show the tracking results of the target of fig. 2 by the method of the present invention and two comparative tracking methods. The HCF tracking method based on deep learning is effective for tracking the slow change of the target form of the front frame image and the rear frame image of the video, but when the target is shielded for a long time, the method is easy to lose the target. Although the real-time performance of the fDSST tracking is high, the problem of long-time tracking of the target cannot be solved. The method adopts a self-adaptive anti-occlusion tracking method of multi-layer depth feature fusion, utilizes the multi-layer depth convolution features of the target, combines the semantic information and the detail information of the target, simultaneously adds confidence coefficient evaluation in the tracking process, starts a target re-detection module when the confidence coefficient does not meet the condition, re-determines the position of the target, and can still accurately track the target by correcting the central position of the target when the target is occluded for a long time.
The objective evaluation index of the tracking result is given in table 1, CLE is the center position error, the average euclidean distance between the target center position estimated by the tracking method and the real target center position is calculated, the unit is a pixel, and the smaller the value is, the better the tracking effect is indicated; OP represents the overlapping ratio of the boundary frames, the average overlapping degree of the areas of the target frame predicted by the tracking method and the actual target frame is calculated, and the larger the value is, the better the tracking effect is; DP represents the measurement precision, the ratio of the frame number with the center position error smaller than a certain threshold value to the total frame number of the video is calculated, the larger the value is, the better the value is, and in the experimental process, the threshold value is set to be 20 pixels; fps represents the frame rate, and the larger the value, the better the tracking effect. As can be seen from table 1, the method of the present invention has greater advantages in terms of center position error, tracking success rate and measurement accuracy than the two tracking methods with the widest application, and although the tracking real-time performance is still slightly worse than fdst, the method is basically equivalent to the target tracking method based on deep learning.
TABLE 1 Objective index of tracking results
Method | Average CLE (Pixel) | Average OP (%) | Average DP (%) | Average velocity (fps) |
fDSST | 58.9 | 24 | 25.7 | 25 |
HCF | 7.58 | 90.7 | 88 | 1.6 |
The method of the invention | 4.5 | 96.3 | 93 | 2.2 |
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (7)
1. A self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion is characterized by comprising the following steps: firstly, acquiring a multilayer depth feature map of a video image target candidate region through VGG-Net, and then performing up-sampling on the multilayer depth feature map to obtain a series of multilayer depth feature maps with the same size and different levels; then, converting the multilayer depth feature map from a time domain to a frequency domain according to related filtering, performing filter training and response map calculation according to fast Fourier transform, performing merging and dimension reduction processing on the multilayer depth feature map according to intra-layer feature weighting fusion, constructing feature response maps of different levels, and solving a maximum related response value, namely a target estimation position; finally, extracting the dense features of the target, obtaining a maximum response value of the features according to the related filtering, and obtaining a response confidence coefficient of the central position of the target estimated through the depth convolution features; when the confidence of the response of the target center position is less than the redetectionThreshold value T0Then, the obtained target estimation position is evaluated through online target redetection, and the position of the target is adaptively updated according to the evaluation result;
the extracting of the dense features of the target, obtaining the maximum response value of the features according to the related filtering, and obtaining the response confidence of the target center position estimated by the depth convolution features specifically include:
firstly, on a t +1 frame image, sampling by taking the estimated target central position as a central coordinate to obtain a search area block with the size of a multiplied by b, using the search area block as a dense feature extraction range, calculating by using formulas (10) and (11) to obtain gradient components of each pixel in the horizontal and vertical directions, and obtaining the length and the angle of each pixel gradient vector by using formulas (12) and (13);
G1=pixel(pos_x+1,pos_y)-pixel(pos_x-1,pos_y) (10)
G2=pixel(pos_x,pos_y+1)-pixel(pos_x,pos_y-1) (11)
θ=arctan(G1/G2) (13)
wherein the pixel(pos_x+1,pos_y),pixel(pos_x-1,pos_y),pixel(pos_x,pos_y+1),pixel(pos_x,pos_y-1)Respectively representing the positions of 4 pixels, pos _ x and pos _ y being estimated target positions, G1,G2Respectively representing the distance of 2 pixel points in the horizontal direction and the vertical direction, and S and theta represent the length of a gradient vector and a gradient vector angle;
then, dividing the search area into cells with the same size, respectively calculating gradient information of pixels in each unit patch, wherein the gradient information comprises the size and the direction, the gradient size of each pixel contributes different weights to the direction of each pixel, and then the weights are accumulated to all gradient directions;
to search areaDividing the unit into blocks, each unit cell is a/4 x b/4 pixel in size, dividing the unit cell into 4 x 4 unit cells, counting the gradient information of each unit cell in 9 directions, thereby representing the image information by a 9-dimensional vector, and counting the feature map of each unit cell to obtain dense features
Finally, Gaussian correlation filtering is carried out on the extracted multilayer dense features, and the maximum response value is obtained, so that the response Confidence factor Confidence of the target center position estimated through the deep convolution features can be obtained, the value reflects the reliability degree of each tracking result,
Confidence=max(F-1(E)) (15)
wherein zf and xf are dense feature sets extracted from a current frame and a previous frame respectively, and F is Fourier transform;
when the response confidence of the target center position is less than the redetection threshold T0And then, evaluating the obtained target estimation position through online target redetection, specifically: the core of the re-detection module is a linear two-classifier, and the obtained target estimated position is subjected to two-term classification by a calculation formula (16):
f(p)=<s_w,p>+s_b (16)
wherein < > is a vector inner product sign, s _ w is a weight vector which is a normal direction of the hyperplane, s _ b is offset, values of s _ w and s _ b are obtained after training, and a Lagrangian method is used for solving, so that a Lagrangian function is a formula (17):
wherein alpha islLagrange multiplier for each sample is ≧ 0, (p)1,q1),...,(pl,ql),(pk,qk) Is the histogram equalized sample, k is the number of samples, qlEqual to 1 or-1, plIs a d-dimensional vector;
the optimal classification function obtained after the solution can be represented by equation (18):
f(p)=sgn[(s_w*·p)+s_b*] (18)
wherein s _ w, s _ b are respectively corresponding optimal solutions;
the self-adaptive updating of the position of the target according to the evaluation result specifically comprises: evaluating the obtained target estimated position through on-line target redetection to obtain a series of score values scorcs of the sample, and taking the maximum value of the score values, wherein the position of the sample corresponding to the score values is the re-estimated position tpos of the target after redetection; processing the sample by the formulas (14) and (15) to obtain a tracking Confidence after detection2When the value meets the formula (19), pos is replaced by tpos to obtain the re-detected target position, and if the value does not meet the formula (19), the tracking position is not changed;
Confidence2>1.1Confidence&&max(scorcs)>0 (19)。
2. the method for tracking the multi-layer depth feature fused adaptive anti-occlusion infrared target according to claim 1, wherein the obtaining of the multi-layer depth feature map of the video image target candidate region through VGG-Net specifically comprises: a VGG-Net-19 deep convolutional network is used as a core network, and the multidimensional image is directly used as network input; where "19" represents the number of levels of weight to be learned in the network; each convolutional layer from Conv1 to Conv5 contains 2, 4 convolutions, respectively, all of which use the same convolution kernel of 3 × 3 size, and each convolutional layer in VGG-Net-19 obtains a multi-layer depth feature map of a target candidate region of a video image by training on an ImageNet data set.
3. Adaptation of multi-layer depth feature fusion according to claim 1 or 2The anti-blocking infrared target tracking method is characterized in that a series of multilayer depth feature maps with the same size and different levels are obtained by up-sampling the multilayer depth feature maps, and specifically comprises the following steps: the output of each convolutional layer is a set of multi-channel feature mapsM, N and D respectively represent the width, height and channel number of the characteristic diagram; and performing upsampling operation on the feature maps of different levels according to bilinear interpolation, so that the feature maps of all convolutional layers have the same size.
4. The multi-layer depth feature fusion adaptive anti-occlusion infrared target tracking method according to claim 3, wherein a feature map f is up-sampled, and a feature vector at a position i is represented as formula (1):
wherein f is a feature map, x is an up-sampled feature map, alphaikIs the difference weight whose value is related to the location i and k neighborhood feature vectors.
5. The multi-layer depth feature fusion adaptive anti-occlusion infrared target tracking method according to claim 4, wherein the multi-layer depth feature map is converted from a time domain to a frequency domain according to a correlation filter, and filter training and response map calculation are performed according to a fast Fourier transform, specifically: and giving a multi-dimensional characteristic input x after the target is tracked and is sampled, obtaining an optimal correlation filter w through learning training data based on a tracking algorithm of correlation filtering, and estimating the position of the target according to the maximum correlation response value in a candidate region searched by the filter.
6. The multi-layer depth feature fused adaptive anti-occlusion infrared target tracking method according to claim 5, characterized in thatConverting the multilayer depth feature map from a time domain to a frequency domain according to the correlation filtering, and performing filter training and response map calculation according to the fast fourier transform, further specifically: in the t frame image, the multi-dimensional convolution characteristic of the target is input asAll cyclic shifts of x in the vertical and horizontal directions are used as samples for training the correlation filter, and each sample can be represented as xm,n(M, N) is an element {0,1,. M-1} × {0,1,. N-1 }; given the desired output y (m, n) of each sample at the same time, the optimal correlation filter in the t-th frame image can be obtained by minimizing the output error, see equation (2):
where λ is a regularization parameter and λ is not less than zero, y (m, n) is a two-dimensional gaussian kernel function with a peak at the center position, and its expression can be represented by formula (3):
wherein (M × N) is within {0,1, M-1} × {0,1, N-1}, and σ is the width of the gaussian kernel; the frequency domain from the above formula can be obtained according to the Pasaval theorem as formula (4):
wherein X, Y and W are discrete Fourier transforms of X, Y and W, respectively,a complex conjugate of X, a dot product of an element; finding the optimal filtering available on each eigen channel dEquation (5) represents:
when a multidimensional convolution feature map Z of a target candidate region in a t +1 th frame is given, and discrete Fourier transform is performed on the feature map Z, a correlation response map H of the t frame can be obtained, and the correlation response map H can be expressed by a formula (6):
wherein, F-1Representing inverse discrete Fourier transform operation, and finding a maximum response value in H, namely the estimated position of the target in the t +1 th frame;
multiplying each pixel by a raised cosine window brings the pixel value near the edge close to zero, which can be expressed by equation (7):
7. the method for tracking the multi-layer depth feature fused self-adaptive anti-blocking infrared target according to claim 6, wherein the multi-layer depth feature map is subjected to merging and dimensionality reduction according to intra-layer feature weighted fusion to construct feature response maps of different levels and obtain a maximum correlation response value, which is the target estimated position, and specifically comprises the following steps:
extracting different convolution characteristics of 3 layers of Conv3-4, Conv4-4 and Conv5-4 through VGG-Net and obtaining maximum response value H of each layer3,H4,H5(ii) a And performing weighted fusion on the response data to obtain a related response graph H after multilayer feature fusion:
H=β1H3+β2H4+β3H5wherein, β1,β2,β3Respectively corresponding fusion weighted values of different convolution layers; in meltSearching out a maximum response value on the combined related response graph H, wherein the position of the maximum response value is the estimated central position p of the target, and p is arg max H (m, n); where (m, n) is the pixel point location in the candidate region.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810259132.4A CN108665481B (en) | 2018-03-27 | 2018-03-27 | Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810259132.4A CN108665481B (en) | 2018-03-27 | 2018-03-27 | Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108665481A CN108665481A (en) | 2018-10-16 |
CN108665481B true CN108665481B (en) | 2022-05-31 |
Family
ID=63782540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810259132.4A Active CN108665481B (en) | 2018-03-27 | 2018-03-27 | Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108665481B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109615640B (en) * | 2018-11-19 | 2021-04-30 | 北京陌上花科技有限公司 | Related filtering target tracking method and device |
US11055574B2 (en) | 2018-11-20 | 2021-07-06 | Xidian University | Feature fusion and dense connection-based method for infrared plane object detection |
CN109741366B (en) * | 2018-11-27 | 2022-10-18 | 昆明理工大学 | Related filtering target tracking method fusing multilayer convolution characteristics |
CN109816689B (en) * | 2018-12-18 | 2022-07-19 | 昆明理工大学 | Moving target tracking method based on adaptive fusion of multilayer convolution characteristics |
CN110427839B (en) * | 2018-12-26 | 2022-05-06 | 厦门瞳景物联科技股份有限公司 | Video target detection method based on multi-layer feature fusion |
CN111291745B (en) * | 2019-01-15 | 2022-06-14 | 展讯通信(上海)有限公司 | Target position estimation method and device, storage medium and terminal |
CN110033472B (en) * | 2019-03-15 | 2021-05-11 | 电子科技大学 | Stable target tracking method in complex infrared ground environment |
CN111738036B (en) * | 2019-03-25 | 2023-09-29 | 北京四维图新科技股份有限公司 | Image processing method, device, equipment and storage medium |
CN109980781B (en) * | 2019-03-26 | 2023-03-03 | 惠州学院 | Intelligent monitoring system of transformer substation |
CN110084834B (en) * | 2019-04-28 | 2021-04-06 | 东华大学 | Target tracking method based on rapid tensor singular value decomposition feature dimension reduction |
CN110189365B (en) * | 2019-05-24 | 2023-04-07 | 上海交通大学 | Anti-occlusion correlation filtering tracking method |
CN110276785B (en) * | 2019-06-24 | 2023-03-31 | 电子科技大学 | Anti-shielding infrared target tracking method |
CN110599519B (en) * | 2019-08-27 | 2022-11-08 | 上海交通大学 | Anti-occlusion related filtering tracking method based on domain search strategy |
CN110782479B (en) * | 2019-10-08 | 2022-07-19 | 中国科学院光电技术研究所 | Visual target tracking method based on Gaussian center alignment |
CN111260689B (en) * | 2020-01-16 | 2022-10-11 | 东华大学 | Confidence enhancement-based correlation filtering visual tracking method |
CN113658217B (en) * | 2021-07-14 | 2024-02-23 | 南京邮电大学 | Self-adaptive target tracking method, device and storage medium |
CN113537241B (en) * | 2021-07-16 | 2022-11-08 | 重庆邮电大学 | Long-term correlation filtering target tracking method based on adaptive feature fusion |
CN113971216B (en) * | 2021-10-22 | 2023-02-03 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and memory |
CN114845137B (en) * | 2022-03-21 | 2023-03-10 | 南京大学 | Video light path reconstruction method and device based on image registration |
CN116563348B (en) * | 2023-07-06 | 2023-11-14 | 中国科学院国家空间科学中心 | Infrared weak small target multi-mode tracking method and system based on dual-feature template |
CN117011196B (en) * | 2023-08-10 | 2024-04-19 | 哈尔滨工业大学 | Infrared small target detection method and system based on combined filtering optimization |
CN117893574A (en) * | 2024-03-14 | 2024-04-16 | 大连理工大学 | Infrared unmanned aerial vehicle target tracking method based on correlation filtering convolutional neural network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103400129A (en) * | 2013-07-22 | 2013-11-20 | 中国科学院光电技术研究所 | Target tracking method based on frequency domain saliency |
CN106447711A (en) * | 2016-05-26 | 2017-02-22 | 武汉轻工大学 | Multiscale basic geometrical shape feature extraction method |
CN107154024A (en) * | 2017-05-19 | 2017-09-12 | 南京理工大学 | Dimension self-adaption method for tracking target based on depth characteristic core correlation filter |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104240542B (en) * | 2014-09-03 | 2016-08-24 | 南京航空航天大学 | A kind of airdrome scene maneuvering target recognition methods based on geomagnetic sensor network |
CN104730537B (en) * | 2015-02-13 | 2017-04-26 | 西安电子科技大学 | Infrared/laser radar data fusion target tracking method based on multi-scale model |
US9443320B1 (en) * | 2015-05-18 | 2016-09-13 | Xerox Corporation | Multi-object tracking with generic object proposals |
CN106327526B (en) * | 2016-08-22 | 2020-07-07 | 杭州保新科技有限公司 | Image target tracking method and system |
CN107644430A (en) * | 2017-07-27 | 2018-01-30 | 孙战里 | Target following based on self-adaptive features fusion |
-
2018
- 2018-03-27 CN CN201810259132.4A patent/CN108665481B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103400129A (en) * | 2013-07-22 | 2013-11-20 | 中国科学院光电技术研究所 | Target tracking method based on frequency domain saliency |
CN106447711A (en) * | 2016-05-26 | 2017-02-22 | 武汉轻工大学 | Multiscale basic geometrical shape feature extraction method |
CN107154024A (en) * | 2017-05-19 | 2017-09-12 | 南京理工大学 | Dimension self-adaption method for tracking target based on depth characteristic core correlation filter |
Non-Patent Citations (2)
Title |
---|
Hainan Zhao等.Robust Object Tracking Using Adaptive Multi-Features Fusion Based on Local Kernel Learning.《2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing》.2014, * |
闫俊强.基于图像的空中目标跟踪算法研究.《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》.2016,(第(2016)08期), * |
Also Published As
Publication number | Publication date |
---|---|
CN108665481A (en) | 2018-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108665481B (en) | Self-adaptive anti-blocking infrared target tracking method based on multi-layer depth feature fusion | |
CN104574445B (en) | A kind of method for tracking target | |
CN108062574B (en) | Weak supervision target detection method based on specific category space constraint | |
CN112184752A (en) | Video target tracking method based on pyramid convolution | |
CN107633226B (en) | Human body motion tracking feature processing method | |
CN112733822B (en) | End-to-end text detection and identification method | |
CN110175649B (en) | Rapid multi-scale estimation target tracking method for re-detection | |
CN111311647B (en) | Global-local and Kalman filtering-based target tracking method and device | |
CN109146911B (en) | Target tracking method and device | |
CN109461172A (en) | Manually with the united correlation filtering video adaptive tracking method of depth characteristic | |
CN106295564B (en) | A kind of action identification method of neighborhood Gaussian structures and video features fusion | |
CN107918772B (en) | Target tracking method based on compressed sensing theory and gcForest | |
CN107767416B (en) | Method for identifying pedestrian orientation in low-resolution image | |
CN107944354B (en) | Vehicle detection method based on deep learning | |
CN107369158A (en) | The estimation of indoor scene layout and target area extracting method based on RGB D images | |
CN104484890A (en) | Video target tracking method based on compound sparse model | |
CN110555870A (en) | DCF tracking confidence evaluation and classifier updating method based on neural network | |
CN111027586A (en) | Target tracking method based on novel response map fusion | |
CN110111369A (en) | A kind of dimension self-adaption sea-surface target tracking based on edge detection | |
CN111931722A (en) | Correlated filtering tracking method combining color ratio characteristics | |
CN109448024B (en) | Visual tracking method and system for constructing constraint correlation filter by using depth data | |
CN110751670B (en) | Target tracking method based on fusion | |
CN110827327B (en) | Fusion-based long-term target tracking method | |
Ren et al. | Research on infrared small target segmentation algorithm based on improved mask R-CNN | |
Ju et al. | A novel fully convolutional network based on marker-controlled watershed segmentation algorithm for industrial soot robot target segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |