CN109801310A

CN109801310A - A kind of method for tracking target in orientation and scale differentiation depth network

Info

Publication number: CN109801310A
Application number: CN201811403020.8A
Authority: CN
Inventors: 胡昭华; 侍孝义; 陈慧
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2019-05-24

Abstract

The invention discloses the method for tracking target that a kind of orientation and scale differentiate depth network, comprising the following steps: (1), and pre-training network；Step (2), azimuth information classification；Step (3), sliding window operation；Step (4), on-line fine；Step (5), loss detect again.The present invention joined directional information classification and positive and negative sample classification under deep learning network frame, so that depth network query function amount is low, fast speed；The deformation of network-adaptive target different times is made when deformation occurs for target by on-line fine network parameter strategy simultaneously, can still complete tracing task well；In addition, tracking when, introduce re-detection strategy, in target following target lose the case where can also handle very well.

Description

A kind of method for tracking target in orientation and scale differentiation depth network

Technical field

The present invention relates to the image procossings and computer vision field in human-computer interaction and video monitoring, more particularly to one kind Orientation and scale differentiate the method for tracking target of depth network, realize mesh by the method for depth e-learning target direction information Mark tracking.

Background technique

In order to allow machine to understand real world, computer vision is always the forward position studied, and target following is always The core content of computer vision, so target following technology has been a hot spot of research problem in recent years.Its purpose is exactly In the case of target sizes and location information by marking first frame, using the feature of algorithm learning objective, predict in subsequent frame The size and location of the target.In monotrack field, recent track algorithm is broadly divided into correlation filtering and deep learning two General orientation.

Deep learning was used widely in tracking field in recent years, Naiyan Wang et al. (Wang N, Yeung D Y.Learning a deep compact image representation for visual tracking[C]// Advances in neural information processing systems.2013:809-817.) propose DLT algorithm, Deep learning is introduced into target tracking domain for the first time.DLT algorithm first uses stack noise reduction self-encoding encoder to exist It is general to obtain that unsupervised offline pre-training is carried out on extensive natural image data set as TinyImagesdataset Object characterize ability.Tracking phase is initial, and there is no obtain to the current particular expression ability for being tracked object for network.At this time Positive negative sample is obtained using first frame, it is more targeted to current tracking target and background to be finely adjusted acquisition to sorter network Sorter network.During tracking, the sample of a collection of candidate is extracted by the way of particle filter to present frame, these samples are defeated Enter in sorter network, confidence level is highest to become final prediction target.The data set that the offline pre-training of DLT algorithm uses only wraps The picture of the size containing 32*32, resolution ratio is obviously low, therefore is difficult to acquire sufficiently strong character representation.The network knot that DLT is connected entirely Structure portray it to clarification of objective ability is not outstanding enough, although having used 4 layers of depth model, effect is still below some make Manually traditional tracking of feature.

As deep learning is widely applied, the CNN network for being good at handling image is introduced in tracking field.It is large quantities of outstanding Algorithm is born, Nam et al. (Nam H, Han B.Learning multi-domain convolutional neural networks for visual tracking[C]//Computer Vision and Pattern Recognition (CVPR), 2016IEEE Conference on.IEEE, 2016:4293-4302.) propose MDNet algorithm.Nam et al. consciousness To, there are huge difference, MDNet proposes directly to be led to tracking video pre-training CNN between image classification task and tracking The method that target indicates ability.MDNet proposes each training sequence treating as an individual domain, each domain There are a two classification layers for it, for distinguishing the foreground and background of current sequence, and all layers before network are all What sequence was shared.Inclusion layer has achieved the purpose that learn clarification of objective expression in tracking sequence in this way.Tracking phase uses The data of one frame train the bounding box regression model of the sequence.Positive sample and negative sample are extracted with first frame, is updated The weight of network.256 candidate samples are generated later, and therefrom selection confidence level is highest, be bounding-box later and return Return to obtain final result.MDNet effect is although good, there is also some problems, to transmitting a samples up to a hundred before needing, although Network structure is smaller, and speed is still relatively slow.And bounding-box recurrence is also required to independent training.

Summary of the invention

Goal of the invention: in order to solve the shortcoming in the above technology, make tracker that lighting change, scale occur in target Variation covering, deformation, motion blur, fast moves, under the complex situations such as blurred background, can still be accurately tracked by target, The present invention proposes a kind of target tracking algorism of deep learning target bearing information, and addition sliding window is mobile, single Piecewise Micro Tracking, Accelerate depth network query function, is a kind of simple and robust tracking.

Technical solution: in order to solve by target deformation, block, the caused target following failure of situations such as illumination, rotation with And the problem that depth network speed is slow, the invention proposes a kind of deep learning target bearing information object track sides of single sample Method is also able to maintain the stability of target following under complex scene, improves the precision of tracker, improves the speed of depth network.

A kind of tracking of deep learning target bearing information of the present invention, the specific steps of this method are as follows:

(1) step 1: pre-training network.The present invention is using three-layer coil product three layers of fully-connected network frame of neural network.But Third layer fully-connected network uses two-way sorter network.Azimuth information is classified layer, carries out the judgement of target position information, and positive and negative point Class layer carries out the judgement of target and background.The initialization of three-layer coil lamination is using VGG-M before present networks, using extensive The convolutional network of data set training has good Generalization Capability and transfer learning ability.The present invention is using trained network Initialization of the weight as training network, improves a lot to the tracking learning ability of network.

The training output of each sorter network uses softmax cross entropy loss function, and stochastic gradient descent is passed with reversed The update for carrying out network weight is broadcast, cycle-index 100, training dataset is in VOT2013, VOT2014, VOT2015 58 video sequences, using video sequence carry out pre-training in the past using extensive classification data it is different, the method can be right Network for tracking is more targeted, is more suitable for track training.

The selection of training stage sample is chosen using the strategy for calculating Duplication.In order to make e-learning to more Target bearing information, training stage indicate the Duplication threshold value of the division of positive negative sample with α.Such as formula (1), l represents positive and negative The label of sample, IoU indicates Duplication, if g_sAnd g_tDuplication be greater than α be then judged as positive sample, be otherwise negative sample.Volume Lamination learning rate β₁It indicates, full articulamentum learning rate β₂It indicates.

(2) step 2: azimuth information classification.The present invention proposes ten a kind of orientation of depth e-learning positive sample, main to divide For upper left (D₁), a left side (D₂), lower-left (D₃), under (D₄), bottom right (D₅), the right side (D₆), upper right (D₇), upper (D₈), small scale (S₁), it is big Scale (S₂), true sample (T).

[x_l,x_r,y_u,y_d] respectively indicate the left margin that orientation divides constraint frame, right margin, coboundary and lower boundary.Boundary Setting be to be determined by the original object size of sample, as shown in formula (2), (3).If g_t=[x_t,y_t,w_t,h_t] it is present frame True value, x_tIt is target position in the position of x-axis coordinate, y_tIt is target position in the position of y-axis, w_tIt is target size in x-axis Width, h_tFor target size y-axis height.

In formula (2), (3), α is the scaling factor of boundary limitation.The affiliated class of sample is by formula (2), the limit of (3) Boundary processed determines.

C=ρ (g_s,[x_l,x_r,y_u,y_d],[w_t,h_t]) (4)

Formula (4) is the judgment mechanism description to sample judgement classification.g_s=[x_s,y_s,w_s,h_s] believe for the position of sample Breath, c are sample generic output results.ρ indicates judgement mode of the invention, utilizes restriction frame [x_l,x_r,y_u,y_d] With sample position information g_sJudgement sample belonging positions classification utilizes the wide height [w of scale decision mechanism and target true value_t,h_t] sentence Disconnected dimensional variation classification.

The region decision sample generic fallen according to the value of sample transverse and longitudinal coordinate, as described in formula (5).If sample Originally restriction frame most intermediate region is fallen in, then the scale of judgement sample changes, and utilizes the w of sample_sAnd h_sWith original mesh Mark true value w_tAnd h_tIt compares, the dimensional variation of judgement sample.It is inputted with multiple dimensioned sample used by traditional most of algorithm The dimensional variation for adjudicating target is different, and dimensional variation is fused in sample classification, is directly placed on the judgement of scale by the present invention In the classification layer of depth network output, it is utilized the powerful learning ability of depth convolutional neural networks significantly.Adjudicate sample The precondition of change of scale is that sample point falls in restriction frame middle position, i.e. x_l<x_s< x_r and y_u<y_s< y_u.Sample The judgement of scale mainly determines that γ and λ respectively indicates the change of scale factor in formula by formula (6).Sample point falls in constraint frame It is interior and be then judged as classification T without change in size.

(3) step 3: sliding window operation.It is not that target following is wanted by the classification that sorter network obtains target sample To as a result, needing the actual position that a treatment mechanism approaches target at leisure while obtaining sample information.The present invention The actual position of target is gradually approached using the method for sliding window.

The sample of current location is sent to azimuth information sorter network, obtains the specific side that current location is located at target Position carries out corresponding sliding close to target to current sample window according to azimuth information.It is operated using sliding window, each network only connects A sample is received, to realize the network query function of single sample, improves arithmetic speed.

There is corresponding sliding window strategy for each azimuth information.Shown in specific implementation method such as formula (7).

θ represents the mobile pixel size of sliding window operation in formula.g_s+1Indicate next sample point position that will be sampled. Work as g_s+1Sample be passed to network output result be classification T, i.e., then stop closest to target sample sliding window operation, finally obtain and work as The tracking objective result of previous frame.

(4) step 4: on-line fine.The network of pre-training utilizes a large amount of video sequence, there is good Generalization Capability, but It is that still cannot obtain tracking effect well directly to specific video sequence using no specific aim.So in first frame Network weight is finely adjusted using the strategy as pre-training, so that network is targeted to the video sequence.

Maximum challenge is exactly the deformation of target in target following, after target mobile a period of time, more or less all can There is deformation, at this time target itself cannot be fully described in the information of first frame, so online updating is carried out to target, it is right Network on-line fine is vital.

(5) step 5: loss detects again.Double sorter networks are exactly that the case where losing is tracked for processing target, positive and negative Sample classification layer positive sample score is less than certain threshold value and then judges that the positive sample has been lost, and tracking result is entirely background, this Shi Caiyong particle sampler strategy, default objects are all small-scale movements in consecutive frame, then utilize height in former frame target position The rule of this distribution samples present frame, and the sample that input network chooses positive sample classification layer highest scoring is present frame Target position.

Working principle: the present invention only carries out target following to network inputs list sample every time, to reduce in tracking phase The computation burden of network improves tracking velocity.The algorithm carries out classification to target information, and training network judges the side of target Position information, to obtain the actual position of target.Using depth network, target can be better described, improve the robustness of tracking.It will Dimensional variation puts classification layer into, does not need individually to train a regression model, accelerates algorithm calculating speed.

The utility model has the advantages that the present invention joined directional information classification and positive and negative sample classification under deep learning network frame, So that depth network query function amount is low, fast speed.The present invention is by on-line fine network parameter strategy simultaneously, when shape occurs for target When change, make the deformation of network-adaptive target different times, the present invention can still complete tracing task well.Finally originally The case where invention is in tracking, introducing re-detection strategy, loses to target in target following can also be handled very well.The present invention is main Innovation has: (1) learning network of the azimuth information proposed has good adaptability to target following；(2) sliding window strategy carries out single Sample tracking can reduce network query function burden, improve depth network trace speed；(3) dimensional variation is put into one of network Classification output is handled so that also having good robustness to the sequence of tracking dimensional variation class；(4) positive negative sample re-detection Strategy can guarantee that well sample is not lost.

Detailed description of the invention

Fig. 1 is the system flow chart of deep learning directional information target following of the present invention；

Fig. 2 is the schematic diagram of the depth network architecture of the present invention；

Fig. 3 is that azimuth information of the present invention divides schematic diagram；

Fig. 4 is sample of the present invention classifying quality figure；

Fig. 5 is azimuth information glide widow trace principal sketches of the present invention；

Fig. 6 is the present invention to 6 test video tracking result sample frames；

Fig. 7 is present invention figure compared with the synthesis tracking performance of 8 kinds of trackers under OPE assessment mode；

Under the OPE assessment mode that Fig. 8 challenges the factor at three kinds for the present invention compared with the synthesis tracking performance of 8 kinds of trackers Figure.

Specific embodiment

Deep learning directional information method for tracking target provided by the invention, flow chart is as shown in Figure 1, specifically include following Operating procedure:

(1) step 1: pre-training network.The present invention is such as schemed using three-layer coil product three layers of fully-connected network frame of neural network Shown in 2.But third layer fully-connected network uses two-way sorter network.Directional information classification layer FC6, carries out target position letter The judgement of breath；Positive and negative classification layer FC7 carries out the judgement of target and background.The initialization of three-layer coil lamination uses before present networks It is VGG-M, the convolutional network using large-scale dataset training has good Generalization Capability and transfer learning ability.The present invention adopts Trained network weight is used to improve a lot as the initialization of training network to the tracking learning ability of network.

The last of network frame uses double sorter networks, and a sorter network learns azimuth information, a sorter network Practise the positive negative information of sample.FC6 is ten a kind of azimuth information sorter networks as shown in Figure 2, and FC7 is positive and negative sorter network, point Class target and background information.The training of each sorter network, which exports, uses softmax cross entropy loss function, under stochastic gradient Drop and backpropagation carry out the update of network weight, cycle-index 100, and training dataset comes from VOT2013, VOT2014, 58 video sequences in VOT2015, using video sequence carry out pre-training in the past using extensive classification data it is different, this Method can be more targeted to the network for tracking, is more suitable for track training.

The selection of training stage sample is chosen using the strategy for calculating Duplication.In order to make e-learning to more Target bearing information, training stage use smaller numerical value 0.6 to the Duplication threshold alpha of the division of positive negative sample.Such as formula (1), l represents the label of positive negative sample, and IoU indicates Duplication, if g_sAnd g_tDuplication be greater than 0.6 be judged as positive sample, It otherwise is negative sample.Each frame sample number is set as 200 positive samples, 50 negative samples.Convolutional layer learning rate β₁It is set as 0.0001, full articulamentum learns β₂It is set as 0.001.

(2) step 2: azimuth information classification.The present invention proposes ten one kind orientation such as Fig. 3 of depth e-learning positive sample, It is broadly divided into upper left (D₁), a left side (D₂), lower-left (D₃), under (D₄), bottom right (D₅), the right side (D₆), upper right (D₇), upper (D₈), small scale (S₁), large scale (S₂), true sample (T).

[x in Fig. 3_l,x_r,y_u,y_d] respectively indicate the left margin that orientation divides constraint frame, right margin, coboundary and following Boundary.The setting on boundary is determined by the original object size of sample, as shown in formula (2), (3).If g_t=[x_t,y_t,w_t,h_t] For the true value of present frame, x_tIt is target position in the position of x-axis coordinate, y_tIt is target position in the position of y-axis, w_tFor target ruler The very little width in x-axis, h_tFor target size y-axis height.

In formula (2), (3), α is the scaling factor of boundary limitation.The affiliated class of sample such as Fig. 3 will by formula (2), (3) restricted boundary determines.

C=ρ (g_s,[x_l,x_r,y_u,y_d],[w_t,h_t]) (4)

Formula (4) is the judgment mechanism description to sample judgement classification.g_s=[x_s,y_s,w_s,h_s] believe for the position of sample Breath, c are sample generic output results.ρ indicates judgement mode of the invention, utilizes restriction frame [x_l,x_r,y_u,y_d] With sample position information g_sJudgement sample belonging positions classification, as shown in formula (5)；Utilize scale decision mechanism and target true value Wide height [w_t,h_t] deposit index variation classification, as shown in formula (6).

The region decision sample generic fallen according to sample transverse and longitudinal coordinate value, as described in formula (5).If sample Restriction frame most intermediate region is fallen in, then the scale of judgement sample changes, and utilizes the w of sample_sAnd h_sIt is true with original object Value w_tAnd h_tIt compares, the dimensional variation of judgement sample.It is adjudicated with multiple dimensioned sample input used by traditional most of algorithm The dimensional variation of target is different, and dimensional variation is fused in sample classification by the present invention, and the judgement of scale has directly been placed on depth It spends in the classification layer of network output, the powerful learning ability of depth convolutional neural networks is utilized significantly.Adjudicate the scale of sample The precondition of transformation is that sample point falls in restriction frame middle position, i.e. x_l<x_s< x_r and y_u<y_s< y_u.Sample size Judgement mainly determine that γ and λ respectively indicates the change of scale factor in formula by formula (6).Sample point fall in constraint frame in and There is no change in size to be then judged as classification T.

The final classifying quality of sample is as shown in figure 4, every class is extracted two samples and shows in figure, and primitive frame is as schemed Shown in Frame1, birds2 video sequence of this sample from VOT2015 data set.The authentic specimen of target as shown in Truth, The affiliated class of sample is as shown in Fig. 4 subscript in figure.As can be seen that dividing 11 classes to positive sample in figure, the constrained frame of final effect Constraint factor limits, and gap is very small between every class sample, such as S₁- 1, S₁- 2 clearly visible birds2 target sizes needs pair bigger than normal Sample carries out increasing the operation of size sliding window, S₂- 1, S₂- 2 visible birds2 target sizes are less than normal to be needed to carry out reduction ruler to sample Very little sliding window operation, every class sample difference is very almost the change of Pixel-level less, but each sample has the orientation of the classification to become Gesture.Final operation result shows powerful neural network completely and may learn the minor change of this kind of method, and makes reasonable Judgement just obtains final outstanding tracking effect by the variation of sliding window small pixel each time.

The each frame of the sliding window strategy of tracking executes most 15 times, can not find target when 15 times and then uses loss mechanisms.Experiment Show that most of frames can find target in 10 sliding windows or so.Exactly because sliding window mechanism is added, so that every secondary tracking does not need Great amount of samples is taken, the computation burden of depth network is reduced, so that network trace is accelerated.

Fig. 5 is the specific schematic diagram of sliding window, Bird2 of the video sequence in OTB100, first, two, three, four rows difference Indicate that the 3rd, 25,37,73 frame tracks schematic diagram.Wherein figure caption variable subscript indicates the classification of the affiliated azimuth information of current sample. It is illustrated by taking Fig. 5 the first row picture series as an example, the first row first, which opens figure, indicates present frame, and second is knot in former frame The sampling carried out on fruit inputs inventive network, obtains the sample and belong to D₁Class, according to D₁It carries out sliding window and obtains second sample, Second sample input network obtains the sample and belongs to D₂, according to D₂Continue sliding window and obtain third sample, and so on obtain the It is judged as classification, as this frame tracking result when six sample output.The sliding window sample mode and so on of other each row pictures. As seen from the figure, set forth herein algorithms, and general sliding window, which samples 10 times or so, can obtain accurate tracking result.

(4) step 4: on-line fine.The network of pre-training utilizes a large amount of video sequence, there is good Generalization Capability, but It is that still cannot obtain tracking effect well directly to specific video sequence using no specific aim.So in first frame Network weight is finely adjusted using the strategy as pre-training, so that network is targeted to the video sequence, the present invention 500 positive samples are chosen in first frame, 200 negative samples initialize network, and Duplication is set as 0.7.

Maximum challenge is exactly the deformation of target in target following, after target mobile a period of time, more or less all can There is deformation, at this time target itself cannot be fully described in the information of first frame, so online updating is carried out to target, it is right Network on-line fine is vital., there is the case where 20 frame sequence is continuously tracked just to mesh in the online tracking phase of the present invention Indicated weight sampling, takes sample number for 200 positive samples, 50 negative samples, and Duplication is constant.Network is finely adjusted, in this way may be used With Strengthens network to the processing capacity of various deformation.

(5) step 5: loss detects again.Double sorter networks are exactly that the case where losing is tracked for processing target, positive and negative Sample classification layer positive sample score is less than certain threshold value and then judges that the positive sample has been lost, and tracking result is entirely background, this Shi Caiyong particle sampler strategy, default objects are all small-scale movements in consecutive frame, then utilize height in former frame target position The rule of this distribution samples present frame, and taking sample number is 300, and input network chooses positive sample classification layer score most High sample is the target position of present frame.

Evaluation criteria: the present invention measures the performance of tracker, OPE (One-pass by OPE evaluation criteria Evaluation) tracker disposably assesses video sequence, traditional disposable assessment tracker accuracy and success rate；It chooses The video sequence of 58 different attributes tests method for tracking target of the invention, and with other trackers (such as CNT, CSK, HCF, 8 kinds of trackers such as KCF, RPT, SAMF, SRDCF, Staple) under different challenge factors, such as plane external rotation, scale becomes Change, illumination variation, quickly moves, compared when blocking.Fig. 6 be the present invention with 8 kinds of trackers to (a) Bird2, (b) 6 Bolt, (c) Couple, (d) Football1, (e) Jogging, (f) MotorRolling test video tracking knots Fruit sample frame, Fig. 7 give the present invention and its in terms of accuracy (Precision) and success rate (Success rate) two The performance comparison figure of his 8 kinds of trackers.Fig. 8 is the present invention in plane external rotation, dimensional variation, three kinds of challenge factors of illumination variation OPE assessment mode under compared with the synthesis tracking performance of 8 kinds of trackers figure, from accuracy (Precision) and success rate (Success rate) two aspects are it can be seen that inventive algorithm has fine performance.It can be seen that provided by the invention Method for tracking target, compared with existing algorithm, arithmetic accuracy is significantly improved, and tracking result is more stable.

Claims

1. the method for tracking target that a kind of orientation and scale differentiate depth network, it is characterised in that: the following steps are included:

Step (1), pre-training network；

Step (2), azimuth information classification；

Step (3), sliding window operation；

Step (4), on-line fine；

Step (5), loss detect again.

2. the method for tracking target that orientation according to claim 1 and scale differentiate depth network, it is characterised in that: step (1) in, using three-layer coil product three layers of fully-connected network frame of neural network, third layer fully-connected network is using two-way classification net Network.

3. the method for tracking target that orientation according to claim 2 and scale differentiate depth network, it is characterised in that: described The training output of sorter network uses softmax cross entropy loss function, and stochastic gradient descent and backpropagation carry out network weight The update of weight.

4. the method for tracking target that orientation according to claim 3 and scale differentiate depth network, it is characterised in that: described The selection of training output sample is chosen by calculating Duplication.

5. the method for tracking target that orientation according to claim 1 and scale differentiate depth network, it is characterised in that: step (2) in, according to the orientation of sample said target and scale size judgement sample generic.

6. the method for tracking target that orientation according to claim 1 and scale differentiate depth network, it is characterised in that: step (3) in, the actual position of target is gradually approached using sliding window.

7. the method for tracking target that orientation according to claim 1 and scale differentiate depth network, it is characterised in that: step (4) in, when occurring that multiple frame sequences are continuously tracked, to target resampling trim network.