CN108665481A

CN108665481A - Multilayer depth characteristic fusion it is adaptive resist block infrared object tracking method

Info

Publication number: CN108665481A
Application number: CN201810259132.4A
Authority: CN
Inventors: 秦翰林; 王婉婷; 王春妹; 延翔; 程文雄; 彭昕; 胡壮壮; 周慧鑫
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-03-27
Filing date: 2018-03-27
Publication date: 2018-10-16
Anticipated expiration: 2038-03-27
Also published as: CN108665481B

Abstract

The invention discloses a kind of multilayer depth characteristic fusion it is adaptive resist block infrared object tracking method, first, obtain the multilayer depth characteristic figure of series of identical size different levels；Then, the multilayer depth characteristic figure is transformed into frequency domain from time domain according to correlation filtering, and device training is filtered according to Fast Fourier Transform (FFT) and response diagram calculates, dimension-reduction treatment is merged to multilayer depth characteristic figure further according to characteristic weighing fusion in layer, it is target state estimator position to construct the characteristic response figure of different levels and find out maximal correlation response；Finally, target dense characteristic is extracted, feature maximum response is obtained according to correlation filtering, obtains the response confidence level by the target's center position estimated by depth convolution feature；When the response confidence level of the target's center position is less than re-detection threshold value T₀When, by carrying out assessment to the target state estimator position of acquisition in line target re-detection and carrying out adaptive updates to the position of target according to assessment result.

Description

Multilayer depth characteristic fusion it is adaptive resist block infrared object tracking method

Technical field

The invention belongs to technical field of video processing, and in particular to a kind of fusion of multilayer depth characteristic it is adaptive resist block Infrared object tracking method.

Background technology

Vision tracking is one of the hot spot of computer vision field research, is widely used in video monitoring, intelligent transportation Equal civil fields.In recent years, with the fast development of computer vision field, the comprehensive performance of track algorithm has obtained significantly carrying It rises.Simultaneously as infrared imaging system is detected using the energy that target generates, by obtaining the energy information of target to mesh Mark is identified, therefore is widely used in the equipment of target apperception with the ability that passive detection and round-the-clock detect； Wherein, it is the main task of infrared detection system into line trace to interested target, therefore, the tracking to infrared target is to work as A modern research hotspot problem.

Track algorithm can be divided into classical method for tracking target and the method for tracking target based on deep learning at present, The method for tracking target of middle classics is divided into production and discriminate two major classes are other, the target following side based on deep learning Method is divided into again according to the different of Training strategy：(1) image data pre-training model is assisted, is finely tuned when tracking online；(2) The CNN sorter networks of pre-training extract further feature.

Production approach application in classical tracking generates model and describes target appearance features, is waited later by searching for Target is selected to minimize reconstructed error.More representational algorithm has sparse coding, online density estimation and principal component analysis Deng.Production method is conceived to portraying to target itself, ignores background information, target Self-variation acutely or be blocked When easy to produce drift.

On the other side, discriminate method distinguishes target and background by training grader.This method is also often claimed To be detected before tracking.In recent years, various machine learning algorithms are used in discriminate method, wherein more representational have Multi-instance learning method, structural support vector machine etc..Discriminate method is because the information of background and foreground can be distinguished significantly, performance More robust gradually occupies dominant position in target tracking domain.Major part deep learning method for tracking target can also return at present Belong to discriminate frame.

Based on the track algorithm of deep learning in the case where the training data of target following is very limited, auxiliary is used Non-tracking training data carries out pre-training, obtains the generic representation to object features.In actual tracking, by using currently with The finite sample information of track target finely tunes pre-training model, and model is made to have stronger classification performance to currently tracking target, this Kind transfer learning thinking considerably reduces the demand to tracking target training sample, also improves the performance of track algorithm.Allusion quotation The method of type has the deep learning tracker and its modified version that Hong Kong University of Science and Thchnology doctor Wang Naiyan proposes, this method is as the One by depth network application in the track algorithm of monotrack, first proposed the think of of " offline pre-training+on-line fine " Road significantly solves the problems, such as lack of training samples in tracking, but there are still directly training large scale convolutional Neural nets The insufficient predicament of network sample.

The thinking of another deep learning tracking is directly using extensive taxonomy database as ImageNet On the CNN networks that train obtain the further feature of target and indicate that carrying out classification with observation model again later obtains tracking result. This method not only avoids the overfitting problem caused by lack of training samples, but also it is powerful to take full advantage of depth characteristic Characterization ability.

In recent years, the tracking based on correlation filtering is because speed is fast, effect is good and has attracted the mesh of numerous researchers Light.Correlation filter trains filter group by returning input feature vector for target Gaussian Profile, and seeks in follow-up tracking The peak value of response looked in prediction distribution positions the position of target.It is obtained substantially using Fast Fourier Transform in processes Speed promoted, also have much currently based on the expanding method of correlation filtering, including coring correlation filter, add size estimation Correlation filter etc..The tracking for being combined depth characteristic with correlation filter was gradually appeared in recent years, and main thought is to carry The further feature of area-of-interest is taken, correlation filter is recycled to determine final target location, can be well solved at present Have the insoluble real time problems of deep learning method for tracking target.

The net of target and background information can be distinguished by focusing on training currently based on the method for tracking target of deep learning Network model, hence it is evident that inhibit the non-targeted similar object in background；Therefore, when target is blocked for a long time by complex scene, exist Track the phenomenon that target is lost so that the stability of target following reduces, and the tracking again after occurring again to target itself is simultaneously Without robustness.

Invention content

In view of this, the main purpose of the present invention is to provide a kind of multilayer depth characteristic fusion it is adaptive resist block it is red Outer method for tracking target.

In order to achieve the above objectives, the technical proposal of the invention is realized in this way：

The embodiment of the present invention provide a kind of fusion of multilayer depth characteristic it is adaptive it is anti-block infrared object tracking method, this Method is：First, the multilayer depth characteristic figure of video image target candidate region is obtained by VGG-Net, then to the multilayer Depth characteristic figure obtains the multilayer depth characteristic figure of series of identical size different levels by up-sampling；Then, according to correlation The multilayer depth characteristic figure is transformed into frequency domain by filtering from time domain, and is filtered device training according to Fast Fourier Transform (FFT) It is calculated with response diagram, dimension-reduction treatment is merged to multilayer depth characteristic figure further according to characteristic weighing fusion in layer, is constructed not With level characteristic response figure and to find out maximal correlation response be target state estimator position；Finally, to target dense characteristic It extracts, and feature maximum response is obtained according to correlation filtering, obtain by the target estimated by depth convolution feature The response confidence level of heart position；When the response confidence level of the target's center position is less than re-detection threshold value T₀When, by online Target re-detection carries out assessment to the target state estimator position of acquisition and is carried out to the position of target according to assessment result adaptive Update.

In said program, the multilayer depth characteristic figure that video image target candidate region is obtained by VGG-Net, tool Body is：Using VGG-Net-19 deep layers convolutional network as core network, directly using multidimensional image as network inputs；" 19 " therein The number of plies of the weight learnt is needed in expression network；Every group of convolutional layer has separately included 2,2,4,4,4 layers from Conv1 to Conv5 Convolution, all convolutional layers use identical 3 × 3 size convolution kernel, by being trained on ImageNet data sets, Each convolutional layer in VGG-Net-19 obtains the multilayer depth characteristic figure of video image target candidate region.

It is described that series of identical size different layers are obtained by up-sampling to the multilayer depth characteristic figure in said program The multilayer depth characteristic figure of grade, specially：The output of each convolutional layer is the characteristic pattern of one group of multichannelM, N, D indicate the width, height and port number of characteristic pattern respectively；Up-sampling behaviour is carried out to the characteristic pattern of different levels according to bilinear interpolation Make so that the characteristic pattern of all convolutional layers has identical size.

In said program, characteristic pattern f is up-sampled, feature vector is expressed as formula (1) at the i of position：

Wherein, f is characterized figure, and x is the characteristic pattern after up-sampling, α_ikFor difference weight, value and position i and k neighborhood spy Sign vector is related.

It is described that the multilayer depth characteristic figure is transformed by frequency domain from time domain according to correlation filtering in said program, and It is filtered device training according to Fast Fourier Transform (FFT) and response diagram calculates, specially：It is more after given tracking target up-sampling Dimensional feature inputs x, and an optimization correlative filter device w*, root are obtained by learning training data based on the track algorithm of correlation filtering The maximal correlation response in candidate region is found according to the filter, and location estimation is carried out to target.

It is described that the multilayer depth characteristic figure is transformed by frequency domain from time domain according to correlation filtering in said program, and It is filtered device training according to Fast Fourier Transform (FFT) and response diagram calculates, further, specially：In t frame images, target Multidimensional convolution feature input beBy all cyclic shifts of x in the vertical and horizontal directions as to correlation The sample of filter training, each sample are represented by x_m,n, (m, n) ∈ 0,1 ... M-1 } × 0,1 ... N-1 }；Simultaneously The desired output y (m, n) for giving each sample can be obtained by carrying out minimum processing to output error in t frame images In optimization correlative filter device, see formula (2)：

Wherein, it not less than zero, y (m, n) is dimensional Gaussian kernel function of the peak value in center that λ, which is regularization parameter and λ, Its expression formula can be indicated by formula (3)：

Wherein, (m × n) ∈ { 0,1 ..., M-1 } × { 0,1 ..., N-1 }, σ are the width of Gaussian kernel；According to Pa Sawa The frequency domain representation that your theorem can obtain above formula is formula (4)：

Wherein, X, Y and W are respectively the discrete Fourier transform of x, y and w,For the complex conjugate of X, ⊙ is the point of element Multiplication；Acquiring the optimal filter on each feature channel d can use formula (5) to indicate：

When the multidimensional convolution characteristic pattern z for providing object candidate area in t+1 frames, discrete Fourier transform Z, energy The relevant response figure H of t frames is accessed, formula (6) can be used to indicate：

Wherein, F^-1Indicate inverse discrete Fourier transform operation, it is target in t+1 frames that maximum response is found in H Estimated location；

Multiplying a raised cosine window on each pixel makes that, close to the pixel value at edge close to zero, formula (7) table can be used Show：

It is described that dimension-reduction treatment is merged to multilayer depth characteristic figure according to characteristic weighing fusion in layer in said program, It is target state estimator position to construct the characteristic response figure of different levels and find out maximal correlation response, specially：

The different convolution features of 3 layers of Conv3-4, Conv4-4 and Conv5-4 are extracted by VGG-Net and obtain every layer Maximum response H₃, H₄, H₅；It is weighted fusion to it and obtains the relevant response figure H after multilayer feature fusion：

H=β₁H₃+β₂H₄+β₃H₅, wherein β₁, β₂, β₃, it is the corresponding diffusion-weighted value of different convolutional layers respectively；It is merging Maximum response is searched out on relevant response figure H afterwards, which is estimation the center p, p=of target argmaxH(m,n)；Wherein, (m, n) is the pixel position in candidate region.

It is described that target dense characteristic is extracted in said program, and feature peak response is obtained according to correlation filtering Value obtains the response confidence level by the target's center position estimated by depth convolution feature, specially：

First, on t+1 frame images, coordinate samples to obtain a × b big centered on estimated target's center position Each pixel is calculated in water using formula (10) and (11) as the range of dense characteristic extraction in small region of search block The gradient component of gentle vertical direction can be obtained the length and angle of each pixel gradient vector by formula (12) and (13)；

G₁=pixel_{(pos_x+1,pos_y)}-pixel_{(pos_x-1,pos_y)} (10)

G₂=pixel_{(pos_x,pos_y+1)}-pixel_{(pos_x,pos_y-1)} (11)

θ=arctan (G₁/G₂) (13)

Wherein, pixel_{(pos_x+1,pos_y)}, pixel_{(pos_x-1,pos_y)}, pixel_{(pos_x,pos_y+1)}, pixel_{(pos_x,pos_y-1)} The position of 4 pixels is indicated respectively, and pos_x, pos_y are the target location estimated, G₁, G₂Indicate that 2 pixels exist respectively Distance on both horizontally and vertically, S and θ indicate the length and gradient vector angle of gradient vector；

Then, region of search is divided into the identical cell of size, calculates separately the gradient of pixel in each unit patch The gradient magnitude of information, including size and Orientation, each pixel contributes its direction different weights, and then this weight is cumulative Onto all gradient directions；

Unit piecemeal is carried out to region of search, each cell size is a/4 × b/4 pixels, is divided into 4 × 4 unit Lattice, the gradient information of point 9 each cells of directional statistics are united to indicate its image information with the vector of one 9 dimension The characteristic pattern for counting each cell can be obtained dense characteristic

Finally, Gauss correlation filtering is carried out to the multilayer dense characteristic extracted and seeks maximum response, you can led to The response confidence level Confidence of the target's center position estimated by above-mentioned depth convolution feature is crossed, which reflects each time The degree of reliability of tracking result,

Confidence=max (F^-1(E)) (15)

Wherein, zf, xf are respectively the dense characteristic set of present frame and former frame extraction, and F is Fourier transformation.

It is described when the response confidence level of the target's center position is less than re-detection threshold value T in said program₀When, pass through The target state estimator position of acquisition is assessed in line target re-detection, specially：The core of re-detection module is linear two points Class device carries out binomial classification by calculation formula (16) to the target state estimator position of acquisition：

F (p)=<s_w,p>+s_b (16)

Wherein,<>It is inner product of vectors symbol, s_w is weight vector, it is the normal direction of hyperplane, and s_b is offset, instruction The value that s_w and s_b are obtained after white silk, is solved by Lagrangian method, and it is formula (17) to enable Lagrangian：

Wherein, α_l>=0 is the Lagrange multiplier of each sample, (p₁,q₁),...,(p_l,q_l),(p_k,q_k) it is histogram Sample after equilibrium, k are number of samples, q_lEqual to 1 or -1, p_lFor d dimensional vectors；

The optimal classification function obtained after solution can be indicated by formula (18)：

F (p)=sgn [(s_w^*·p)+s_b^*] (18)

Wherein, s_w*, s_b* are respectively corresponding optimal solution.

It is described that adaptive updates are carried out to the position of target according to assessment result in said program, specially：By online Target re-detection to the target state estimator position of acquisition carry out assessment obtain a series of score value scorcs of sample, take it is therein most Big to be worth, the sample position corresponding to the value is the estimated location tpos again of target after re-detection；The sample is carried out The processing of formula (14) and (15) retrieves the tracking creditability Confidence after a detection₂, when the value meets formula (19) when, pos will be replaced to get to the target location after re-detection with tpos, tracing positional is without variation if being unsatisfactory for；

Confidence₂>1.1Confidence&&max(scorcs)>0 (19)。

Compared with prior art, beneficial effects of the present invention：

The present invention can not only carry out tenacious tracking in target deformation to target, and can solve prolonged Occlusion issue has preferable robust performance.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the 1st frame image of experimental image sequence, contains 1 target in image, is marked with white box；

Fig. 3 is sequence chart (the 62nd frame to the 88th frame) after target occlusion, is changed for prominent position, is indicated in target with red point Heart position；

Fig. 4 is that distinct methods are directed to experimental image sequential tracks result；Wherein Fig. 4 (a), 4 (b), 4 (c) are the 70th respectively The tracking result of frame, 90 frames, 180 frame image fDSST methods；

Fig. 5 is that distinct methods are directed to experimental image sequential tracks result；Wherein Fig. 5 (a), 5 (b), 5 (c) are the 70th respectively The tracking result of frame, 90 frames, 180 frame image HCF methods；

Fig. 6 is that distinct methods are directed to experimental image sequential tracks result；Wherein Fig. 6 (a), 6 (b), 6 (c) are the 70th respectively The tracking result of frame, 90 frames, 180 frame image the method for the present invention.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

The embodiment of the present invention provides a kind of adaptive anti-shelter target tracking merged based on multilayer depth characteristic, such as Shown in Fig. 1, this method is realized especially by following steps：

Step 101：Obtain the multilayer depth characteristic expression of video image target candidate region.

Using VGG-Net-19 deep layers convolutional network as core network, directly using multidimensional image as network inputs, avoid Complicated feature extraction and data reconstruction processes.

VGG-Net-19 its mainly by 5 groups of (totally 16 layers) convolutional layers, 2 full connection features layers and 1 full link sort layer Composition.Wherein, every group of convolutional layer has separately included 2,2,4,4,4 layers of convolution from Conv1 to Conv5, and all convolutional layers use Identical 3 × 3 size convolution kernel.By being trained on ImageNet data sets, each convolution in VGG-Net-19 Layer can obtain the feature representation of the different levels of target.

Step 102：Up-sampling obtains the semantic information and detailed information of series of identical size different levels.

The output of each convolutional layer is the characteristic pattern of one group of multichannelM, N, D indicate characteristic pattern respectively Width, height and port number.But due to the special pondization operation of VGG series convolutional networks, so as to cause institute between different levels The size for obtaining characteristic pattern has differences, and the size of the obtained characteristic pattern of deeper level is smaller.Therefore, in order to more preferable Ground merges the convolution characteristic pattern between different levels, carries out up-sampling operation to the characteristic pattern of different levels so that all convolution The characteristic pattern of layer has identical size.

Characteristic pattern f is up-sampled, feature vector is expressed as formula (1) at the i of position：

Step 103：Device training is filtered using correlation filtering and response diagram is calculated in Fourier.

Multidimensional characteristic after the up-sampling of given tracking target inputs x, and the track algorithm based on correlation filtering passes through study Training data obtains an optimization correlative filter device w*, and the maximal correlation response pair in candidate region is found using the filter Target carries out location estimation.

In t frame images, the multidimensional convolution feature input of target isIn the vertical and horizontal directions by x All cyclic shifts as the sample trained to correlation filter, each sample is represented by x_m,n, (m, n) ∈ 0, 1,...M-1}×{0,1,...N-1}.The desired output y (m, n) for giving each sample simultaneously, by being carried out most to output error Smallization processing, can obtain the optimization correlative filter device in t frame images, see formula (2)：

Wherein, (m × n) ∈ { 0,1 ..., M-1 } × { 0,1 ..., N-1 }, σ are the width of Gaussian kernel.According to Pa Sawa The frequency domain representation that your theorem can obtain above formula is formula (4)：

Wherein, X, Y and W are respectively the discrete Fourier transform of x, y and w,For the complex conjugate of X, ⊙ is the point of element Multiplication.Formula (5) can be used to indicate in the hope of the optimal filter on each feature channel d：

Wherein, F^-1Indicate inverse discrete Fourier transform operation.It is target in t+1 frames that maximum response is found in H Estimated location.

In order to solve boundary effect, multiply on each pixel a raised cosine window make pixel value close to edge close to Zero, formula (7) can be used to indicate.

Step 104：Dimension-reduction treatment, construction feature response are merged to multilayer feature using characteristic weighing fusion in layer Figure.

The different convolution features of 3 layers of Conv3-4, Conv4-4 and Conv5-4 are extracted with VGG-Net, and according to the above method Obtain every layer of maximum response H₃, H₄, H₅.It is available that the relevant response figure H that fusion obtains after multilayer feature fusion is weighted to it Formula (8) indicates：

H=β₁H₃+β₂H₄+β₃H₅ (8)

Wherein, β₁, β₂, β₃, it is the corresponding diffusion-weighted value of different convolutional layers respectively.

Step 105：Find out the estimated location that maximal correlation response is target.

Maximum response is searched out on relevant response figure H after fusion, which is in the estimation of target Heart position p, by shown in formula (9)：

P=arg max H (m, n) (9)

Wherein, (m, n) is the pixel position in candidate region.

Step 106：Target dense characteristic is extracted, and feature maximum response is obtained using correlation filtering, is obtained Pass through the response confidence level Confidence of the target's center position estimated by above-mentioned depth convolution feature.

On t+1 frame images, coordinate samples to obtain searching for a × b sizes centered on estimated target's center position Rope region unit, as the range of dense characteristic extraction, using formula (10) and (11) be calculated each pixel it is horizontal with The gradient component of vertical direction can be obtained the length and angle of the gradient vector of each pixel by formula (12) and (13).

G₁=pixel_{(pos_x+1,pos_y)}-pixel_{(pos_x-1,pos_y)} (10)

G₂=pixel_{(pos_x,pos_y+1)}-pixel_{(pos_x,pos_y-1)} (11)

θ=arctan (G₁/G₂) (13)

Wherein, pixel_{(pos_x+1,pos_y)}, pixel_{(pos_x-1,pos_y)}, pixel_{(pos_x,pos_y+1)}, pixel_{(pos_x,pos_y-1)} The position of 4 pixels is indicated respectively, and pos_x, pos_y are the target location estimated, G₁, G₂Indicate that 2 pixels exist respectively Distance on both horizontally and vertically, S and θ indicate the length and gradient vector angle of gradient vector.

Then region of search is divided into the identical cell of size, calculates separately the gradient of pixel in each unit patch Information, including size and Orientation.The gradient magnitude of each pixel contributes its direction different weights, and then this weight is cumulative Onto all gradient directions.Increasing the number of gradient direction can be such that the performance of detection improves, when gradient is divided into 9 direction difference (0 °~20 °, 21 °~40 ° ..., 161 °~180 °) most effective when being counted, and re-detection is then imitated in more than 9 directions Fruit is not significantly improved.

Unit piecemeal is carried out to region of search, each cell size is a/4 × b/4 pixels, is divided into 4 × 4 cell, The gradient information for dividing 9 each cells of directional statistics counts to indicate its image information with the vector of one 9 dimension The characteristic pattern of each cell can be obtained dense characteristic

Finally, Gauss correlation filtering is carried out to the multilayer dense characteristic extracted and seeks maximum response, you can led to Cross the response confidence level Confidence of the target's center position estimated by above-mentioned depth convolution feature, value reflection each time with The degree of reliability of track result can be indicated by formula (14) and (15).

Confidence=max (F^-1(E)) (15)

Step 107：By realizing assessment to the target state estimator position of acquisition in line target re-detection.

Re-detection threshold value T is set₀, target state estimator position is assessed, when tracking creditability is less than the threshold value, to estimating It counts obtained target location pos to be repositioned, starts re-detection module at this time.

The core of re-detection module is linear two grader, and target is that one categorised decision function of construction uses up positive negative sample May correctly it classify, the purpose of linear classification is exactly to find one or one group of hyperplane is complete the positive negative sample around target It is complete to separate, binomial classification is carried out by calculation formula (16)：

F (p)=<s_w,p>+s_b (16)

Wherein,<>It is inner product of vectors symbol, s_w is weight vector, it is the normal direction of hyperplane, and s_b is offset, instruction The value of s_w and s_b are obtained after white silk.This is the optimization problem of a Prescribed Properties, can be solved with Lagrangian method, is enabled Lagrangian is formula (17)：

Wherein, α_l>=0 is the Lagrange multiplier of each sample, (p₁,q₁),...,(p_l,q_l),(p_k,q_k) it is histogram Sample after equilibrium, k are number of samples, q_lEqual to 1 or -1, p_lFor d dimensional vectors.

F (p)=sgn [(s_w^*·p)+s_b^*] (18)

Wherein, s_w*, s_b* are respectively corresponding optimal solution.

Step 108：Adaptive updates are carried out to the position of target according to assessment result.

By above-mentioned processing, a series of score value scorcs of sample are obtained, take maximum value therein, the sample corresponding to the value This position is the estimated location tpos again of target after re-detection.The place of formula (14) and (15) is carried out to the sample Reason retrieves the tracking creditability Confidence after a detection₂, when the value meets formula (19), tpos generations will be used For pos to get to the target location after re-detection, tracing positional is without variation if being unsatisfactory for.

Confidence₂>1.1Confidence&&max(scorcs)>0 (19)

Beneficial effects of the present invention are illustrated by emulation experiment：

1. experiment condition

Experiment CPU used is Intel Core (TM) i7-4170 2.50GHz memory 8GB, and programming platform is MATLAB R2015b.Experiment is the data collected from DARPA VIVID using the true picture sequence comprising target, is a series of The Thermal Infrared Data of vehicle, some trees are blocked, and pass through shade.The size of image is 640 × 480, as shown in Figure 2.

In order to effectively illustrate that the superiority of the present invention, the present invention are tracked with nearly 2 years more outstanding fDSST and HCF Method carries out subjective and objective comparison.Experiment shows that method of the invention is superior in subjective vision and objective evaluation index to analogy Method.Fig. 4~6 give the result of the method for the present invention and two kinds of comparison trackings to the target of Fig. 2 into line trace.Based on depth Slowly tracking is more effective for frame image object metamorphosis before and after video for the HCF trackings of degree study, but works as target When being blocked for a long time, this method is easily lost target.Although it is higher that fDSST tracks real-time, target cannot be equally solved Long time-tracking problem.The method of the present invention utilizes mesh using the adaptive anti-method for blocking tracking of multilayer depth characteristic fusion Multilayer depth convolution feature, combining target semantic information and detailed information are marked, while confidence level estimation being added during tracking, Start target re-detection module when confidence level is unsatisfactory for condition, redefines target location, when target is blocked for a long time, By being corrected target's center position, remain to be accurately tracked by target.

Table 1 gives the objective evaluation index of tracking result, and site error centered on CLE, calculating is that tracking is estimated The target's center position of meter and the average Euclidean distance of real goal center, unit is pixel, is worth smaller expression tracking effect Fruit is better；OP indicates that bounding box is overlapped ratio, calculating be tracking prediction target frame and actual target frame face It is better to be worth bigger expression tracking effect for long-pending average degree of overlapping；DP indicates measurement accuracy, and calculating is that center error is small In the ratio of the frame number and video totalframes of certain threshold value, value is the bigger the better, and during the experiment, given threshold is 20 pictures Element；Fps indicates frame frequency, and it is better to be worth bigger tracking effect.From table 1 it follows that the method for the present invention is than two kinds more most widely used Tracking all has greater advantage in center error, tracking success rate and measurement accuracy, although real-time performance of tracking is still It is slightly poorer than fDSST, but mutually maintain an equal level with the method for tracking target based on deep learning substantially.

1 tracking result objective indicator of table

Method	Average CLE (pixel)	Average OP (%)	Average DP (%)	Average speed (fps)
					fDSST	58.9	24	25.7	25
HCF	7.58	90.7	88	1.6
					The method of the present invention	4.5	96.3	93	2.2

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims

1. a kind of fusion of multilayer depth characteristic it is adaptive resist block infrared object tracking method, which is characterized in that this method is： First, the multilayer depth characteristic figure of video image target candidate region is obtained by VGG-Net, then to the multilayer depth characteristic Figure obtains the multilayer depth characteristic figure of series of identical size different levels by up-sampling；Then, according to correlation filtering by institute It states multilayer depth characteristic figure and is transformed into frequency domain from time domain, and device training and response diagram are filtered according to Fast Fourier Transform (FFT) It calculates, dimension-reduction treatment is merged to multilayer depth characteristic figure further according to characteristic weighing fusion in layer, constructs different levels Characteristic response figure and to find out maximal correlation response be target state estimator position；Finally, target dense characteristic is extracted, And feature maximum response is obtained according to correlation filtering, obtain the sound by the target's center position estimated by depth convolution feature Answer confidence level；When the response confidence level of the target's center position is less than re-detection threshold value T₀When, by line target re-detection Assessment is carried out to the target state estimator position of acquisition and adaptive updates are carried out to the position of target according to assessment result.

2. multilayer depth characteristic according to claim 1 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, the multilayer depth characteristic figure that video image target candidate region is obtained by VGG-Net, specially：With VGG- Net-19 deep layer convolutional networks are core network, directly using multidimensional image as network inputs；" 19 " therein indicate in network Need the number of plies of weight learnt；Every group of convolutional layer has separately included 2,2,4,4,4 layers of convolution, Suo Youjuan from Conv1 to Conv5 Lamination uses identical 3 × 3 size convolution kernel, by being trained on ImageNet data sets, in VGG-Net-19 Each convolutional layer obtains the multilayer depth characteristic figure of video image target candidate region.

3. multilayer depth characteristic according to claim 1 or 2 fusion it is adaptive resist block infrared object tracking method, It is characterized in that, the multilayer depth for obtaining series of identical size different levels by up-sampling to the multilayer depth characteristic figure Characteristic pattern is spent, specially：The output of each convolutional layer is the characteristic pattern of one group of multichannelM, N, D distinguish table Show the width, height and port number of characteristic pattern；Up-sampling operation is carried out to the characteristic pattern of different levels according to bilinear interpolation so that institute There is the characteristic pattern of convolutional layer to have identical size.

4. multilayer depth characteristic according to claim 3 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, is up-sampled to characteristic pattern f, and feature vector is expressed as formula (1) at the i of position：

Wherein, f is characterized figure, and x is the characteristic pattern after up-sampling, α_ikFor difference weight, value and position i and k neighborhood characteristics to It measures related.

5. multilayer depth characteristic according to claim 4 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, described that the multilayer depth characteristic figure is transformed into frequency domain from time domain according to correlation filtering, and according in quick Fu Leaf transformation is filtered device training and response diagram calculates, specially：Multidimensional characteristic after given tracking target up-sampling inputs x, An optimization correlative filter device w* is obtained by learning training data based on the track algorithm of correlation filtering, is sought according to the filter The maximal correlation response looked in candidate region carries out location estimation to target.

6. multilayer depth characteristic according to claim 5 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, described that the multilayer depth characteristic figure is transformed into frequency domain from time domain according to correlation filtering, and according in quick Fu Leaf transformation is filtered device training and response diagram calculates, further, specially：In t frame images, the multidimensional convolution of target is special Sign inputsBy all cyclic shifts of x in the vertical and horizontal directions as being trained to correlation filter Sample, each sample are represented by x_m,n, (m, n) ∈ 0,1 ... M-1 } × 0,1 ... N-1 }；Each sample is given simultaneously Desired output y (m, n) can obtain the optimal related filter in t frame images by carrying out minimum processing to output error Wave device is shown in formula (2)：

Wherein, it not less than zero, y (m, n) is dimensional Gaussian kernel function of the peak value in center, table that λ, which is regularization parameter and λ, It can be indicated up to formula by formula (3)：

Wherein, (m × n) ∈ { 0,1 ..., M-1 } × { 0,1 ..., N-1 }, σ are the width of Gaussian kernel；It is fixed according to pa Savall The frequency domain representation that reason can obtain above formula is formula (4)：

Wherein, X, Y and W are respectively the discrete Fourier transform of x, y and w,For the complex conjugate of X, ⊙ is that the dot product of element is transported It calculates；Acquiring the optimal filter on each feature channel d can use formula (5) to indicate：

As the multidimensional convolution characteristic pattern z for providing object candidate area in t+1 frames, discrete Fourier transform Z can be obtained The relevant response figure H of t frames can use formula (6) to indicate：

Wherein, F^-1Indicate inverse discrete Fourier transform operation, found in H maximum response be in t+1 frames target estimate Count position；

Multiplying a raised cosine window on each pixel makes, close to the pixel value at edge close to zero, formula (7) can be used to indicate：

7. multilayer depth characteristic according to claim 6 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, described to merge dimension-reduction treatment to multilayer depth characteristic figure according to characteristic weighing fusion in layer, constructs different layers The characteristic response figure of grade and to find out maximal correlation response be target state estimator position, specially：

The different convolution features of 3 layers of Conv3-4, Conv4-4 and Conv5-4 are extracted by VGG-Net and obtain every layer of maximum Response H₃, H₄, H₅；It is weighted fusion to it and obtains the relevant response figure H after multilayer feature fusion：

H=β₁H₃+β₂H₄+β₃H₅, wherein β₁, β₂, β₃, it is the corresponding diffusion-weighted value of different convolutional layers respectively；After fusion Maximum response is searched out on relevant response figure H, which is estimation center p, the p=arg max of target H(m,n)；Wherein, (m, n) is the pixel position in candidate region.

8. multilayer depth characteristic according to claim 7 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, described to be extracted to target dense characteristic, and obtains feature maximum response according to correlation filtering, and acquisition passes through depth The response confidence level of the target's center position estimated by convolution feature is spent, specially：

First, on t+1 frame images, coordinate samples to obtain searching for a × b sizes centered on estimated target's center position Each pixel is calculated horizontal and vertical using formula (10) and (11) as the range of dense characteristic extraction in rope region unit Histogram to gradient component, the length and angle of each pixel gradient vector can be obtained by formula (12) and (13)；

G₁=pixel_{(pos_x+1,pos_y)}-pixel_{(pos_x-1,pos_y)} (10)

G₂=pixel_{(pos_x,pos_y+1)}-p_{ixel(pos_x,pos_y-1)} (11)

θ=arctan (G₁/G₂) (13)

Wherein, pixel_{(pos_x+1,pos_y)}, pixel_{(pos_x-1,pos_y)}, pixel_{(pos_x,pos_y+1)}, pixel_{(pos_x,pos_y-1)}Respectively Indicate the position of 4 pixels, pos_x, pos_y are the target location estimated, G₁, G₂Indicate 2 pixels in level respectively With the distance in vertical direction, the length and gradient vector angle of S and θ expression gradient vectors；

Then, region of search is divided into the identical cell of size, calculates separately the gradient letter of pixel in each unit patch Breath, including size and Orientation, the gradient magnitude of each pixel contribute its direction different weights, and then this weight is added to On all gradient directions；

Unit piecemeal is carried out to region of search, each cell size is a/4 × b/4 pixels, is divided into 4 × 4 cell, divides 9 The gradient information of a each cell of directional statistics, to indicate that its image information, statistics are each with the vector of one 9 dimension The characteristic pattern of cell can be obtained dense characteristic

Finally, Gauss correlation filtering is carried out to the multilayer dense characteristic extracted and seeks maximum response, you can obtained by upper The response confidence level Confidence of the target's center position estimated by depth convolution feature is stated, which reflects each secondary tracking As a result the degree of reliability,

Confidence=max (F^-1(E)) (15)

9. multilayer depth characteristic according to claim 8 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, described when the response confidence level of the target's center position is less than re-detection threshold value T₀When, by being examined in line target again The target state estimator position of acquisition is assessed in survey, specially：The core of re-detection module is linear two grader, passes through calculating Formula (16) carries out binomial classification to the target state estimator position of acquisition：

F (p)=<s_w,p>+s_b (16)

Wherein,<>It is inner product of vectors symbol, s_w is weight vector, it is the normal direction of hyperplane, and s_b is offset, after training It to the value of s_w and s_b, is solved by Lagrangian method, it is formula (17) to enable Lagrangian：

Wherein, α_l>=0 is the Lagrange multiplier of each sample, (p₁,q₁),...,(p_l,q_l),(p_k,q_k) for after histogram equalization Sample, k is number of samples, q_lEqual to 1 or -1, p_lFor d dimensional vectors；

F (p)=sgn [(s_w^*·p)+s_b^*] (18)

Wherein, s_w*, s_b* are respectively corresponding optimal solution.

10. multilayer depth characteristic according to claim 9 fusion it is adaptive it is anti-block infrared object tracking method, spy Sign is, described to carry out adaptive updates to the position of target according to assessment result, specially：By in line target re-detection pair The target state estimator position of acquisition carries out assessment and obtains a series of score value scorcs of sample, takes maximum value therein, and value institute is right The sample position answered is the estimated location tpos again of target after re-detection；Formula (14) and (15) is carried out to the sample Processing, retrieve one detection after tracking creditability Confidence₂, when the value meets formula (19), will use Tpos replaces pos to get to the target location after re-detection, and tracing positional is without variation if being unsatisfactory for；

Confidence₂>1.1Confidence&&max(scorcs)>0 (19)。