CN108038435A

CN108038435A - A kind of feature extraction and method for tracking target based on convolutional neural networks

Info

Publication number: CN108038435A
Application number: CN201711262806.8A
Authority: CN
Inventors: 纪庆革; 李凝; 马天俊
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-05-15
Anticipated expiration: 2037-12-04
Also published as: CN108038435B

Abstract

The present invention discloses a kind of feature extraction based on convolutional neural networks and method for tracking target, first by the method for pre-training under line, improves performance of the network in feature extraction and foreground segmentation；The first frame of video sequence that calibration has is put into network into training on line afterwards, the parameter of network model is finely adjusted, so as to lift disposal ability of the convolutional neural networks in particular problem.Method by increasing random two-dimentional mask and successive ignition, the problem of both having improved the accuracy rate of neural network forecast, while it also avoid over-fitting.Multiple dimensioned washability also causes this method that the scale size of target can be adaptive selected in object tracking process.In terms of network parameter renewal, it is updated in real time by way of threshold value is set, improves the precision and robustness of method for tracking target.

Description

A kind of feature extraction and method for tracking target based on convolutional neural networks

Technical field

The present invention relates to computer realm, more particularly, to a kind of feature extraction based on convolutional neural networks and mesh Mark tracking.

Background technology

Target Tracking Problem is an important component in computer vision field.Its basic task is to provide one Section video sequence, and the original state (such as position, size) of target is calibrated in the first frame, by a series of algorithm certainly It is dynamic to estimate state of the target among subsequent frame.Current track algorithm can be divided into production (generative model) With two kinds of discriminate (discriminative model).Production method with generation model by describing the apparent spy of target Sign, search candidate target minimize reconstructed error, representative algorithm such as sparse coding (sparse coding), Line density estimation (online density estimation), principal component analysis (PCA, principal component Analysis) etc.；Discriminate method distinguishes target and background by training aids, realizes the real-time tracking to target, common calculation Method includes multi-instance learning method (multiple instance learning), structure SVM algorithm (structured SVM) Deng.Compared with production method, discriminate method can more significantly distinguish the foreground and background information in image, show The more feature of robust, therefore become main stream approach at present, most deep learning methods also using discriminate frame.

Conventional method is when extracting characteristics of image, the method often extracted using hand-designed feature, as SIFT is special Sign (Scale-invariant feature transform, scale invariant feature conversion), HOG features ((Histogram of Oriented Gradient, histograms of oriented gradients), LBP features (Local Binary Patterns, local binary patterns) Etc..The method of manual extraction feature excessively relies on the priori of designer, it is difficult to using the advantage of big data, is carried in feature The influence of various factors is also limited by terms of the effect taken.The method of deep learning and the maximum difference of conventional method are, lead to The training of mass data is crossed, neutral net being capable of automatic learning characteristic.In the case where training data is sufficiently large, deep learning obtains To feature extraction to be far superior to manual method extraction feature.

In recent years, as the continuous development of deep learning, deep learning method are regarded in computers such as target detection, identifications The key areas of feel has shown very outstanding performance, but general in target tracking domain performance.Main cause just exists Training set provided in target following is too small, usually only includes the original state demarcated in video sequence in the first frame, difficult To have given play to the advantage of deep learning method, deep learning method still has very big in the efficiency of tracking, accuracy rate etc. Room for promotion.

The content of the invention

The present invention is overcomes at least one defect described in the above-mentioned prior art, there is provided a kind of based on convolutional neural networks Feature extraction and method for tracking target, first by the method for pre-training under line, improve network in feature extraction and foreground segmentation When performance；The first frame of video sequence that calibration has groundtruth is put into network into training on line, to net afterwards The parameter of network model is finely adjusted, so as to lift disposal ability of the convolutional neural networks in particular problem.It is random by increasing The method of two-dimentional mask and successive ignition, the problem of both having improved the accuracy rate of neural network forecast, while it also avoid over-fitting.It is more The washability of scale also causes this method that the scale size of target can be adaptive selected in object tracking process.In network In terms of parameter renewal, it is updated in real time by way of threshold value is set, improves the precision and robustness of method for tracking target.

In order to solve the above technical problems, technical scheme is as follows：

A kind of feature extraction and method for tracking target based on convolutional neural networks, comprise the following steps：

S1：Build simultaneously pre-training network model；

S2：According to video sequence, training network model on line；

S3：Input video sequence, calculates tracking result；

S4：The tracking result of previous frame in video sequence is assessed, positive sample result is chosen and is put into iteration in network To update network parameter；When being updated to network, given threshold θ, works as L_sDuring less than threshold value, result is put into network middle school Practise and update network parameter；Work as L_sDuring more than threshold value, then retain tracking result, network is not updated.

In a kind of preferable scheme, step S1 can be divided into following three steps and perform：

S11：Obtain video sequence used in the data set for training foreground segmentation network and target following；

S12：The network model needed for foreground segmentation is built, network model parameter is initialized；

S13：The data that Utilization prospects segmentation network data is concentrated are trained network model, until result restrains.

In a kind of preferable scheme, training set has been used under line in step S11 in advance to the network prospect of convolutional Neural Segmentation module is trained, and data set used in the training foreground segmentation network is ILSVRC 2014Object Detection Dataset, can obtain from Imagenet；The data set is the standard data set for target identification, contains 200 Tens of thousands of photos of a variety of different classes of objects, enumerate block, rotate, the different disturbing factor such as deformation.Used in test Data set is OTB50, which is target following frequently-used data collection, contains 50 different video sequences of length.These Video sequence shooting is contained and quickly move, blocks, rotating, under illumination variation, the difference scenes such as deformation in different scenes Follow the trail of result；Video sequence used in the target following is Object Tracking Benchmark 50.

In a kind of preferable scheme, the network model needed for foreground segmentation is built in step S12 and includes 11 for one The feature extraction network of a convolutional layer, its parameter are as shown in table 1.

In a kind of preferable scheme, with ILSVRC 2014Object Detection Dataset data sets Image is put into feature extraction network and is trained, until result restrains as label by bounding box；The network Loss function be：

Wherein p_xyFor the predicted value at (x, y) place, t_xyFor the true value at (x, y) place.This method is according to where target in image Bounding box location positions train label, t in bounding box locations_xy1 is set to, other regions are set to 0；

The parameter of the feature extraction network of 1 11 convolutional layers of table

In a kind of preferable scheme, step S2 can be divided into following two steps and perform：

S21：The first frame in video sequence is put into network, the tracking object position provided using groundtruth Put and network model is finely adjusted；

S22：Extract the characteristic extraction part and parameter in training network, reconstruct target positioning network, by video sequence the One frame is put into the network to reconstruct and iterates, training network model.

Training result in step 2 in step S1, remains Conv4 and its preceding 8 layer of volume in feature extraction network Lamination and model parameter, have reformulated target positioning network.Foreground extraction knot among original image is used as using the output of Conv4 Fruit, is put into a network comprising two layers of convolutional layer, to obtain center where tracking target.Its parameter is as shown in table 2：

The parameter of 2 one networks comprising two layers of convolutional layer of table

Dimensional Gaussian using the first frame in OTB50 video sequences as training sample, in groundtruth calibration ranges It is distributed as label and is put into present networks to be trained.Random binary mask successive ignition is added to sample in the training process, Improve the robustness of network.The loss function of the network is：

Wherein G₀For the standard two-dimensional Gaussian Profile in groundtruth calibration ranges, ε is number of channels, X_iI-th logical The distribution output in road, γ_iFor this multi-channel parametric, the parameters such as convolution nuclear parameter, bias term are contained.After successive ignition, the network Loss function can slowly restrain；

In a kind of preferable scheme, step S3 can be divided into following two steps and perform：

S31：Image in selecting video sequence successively, on the basis of target location in previous frame, determines tracking target institute Approximate range, be put into target following network, according to obtained probability distribution matrix after processing, determine in sequence The center of current tracking target；

S32：Centered on currently definite target's center position, the confidence corresponding to each scale in multiscale space is calculated Value, chooses scale of the scale corresponding to response maximum as tracking target, and multiscale space is updated.

In a kind of preferable scheme, step S3 is selected in the scope where choosing target according to following scope：

Wherein (x₀, y₀) demarcating the top left co-ordinate of position by groudtruth, w is the width of bounding box, h For the height of bounding box.

In a kind of preferable scheme, step S3 calculates confidence value in the following way when determining target scale：

The scope that wherein S is demarcated by scale.When current scale can not obtain higher value, it is updated；

Compared with prior art, the beneficial effect of technical solution of the present invention is：A kind of feature based on convolutional neural networks Extraction and method for tracking target, first by the method for pre-training under line, improve network in feature extraction and foreground segmentation Performance；The first frame of video sequence that calibration has groundtruth is put into network into training on line, to network mould afterwards The parameter of type is finely adjusted, so as to lift disposal ability of the convolutional neural networks in particular problem.With convolutional neural networks Characteristics of image is extracted, the characteristics of image of extraction has more robustness, while also has preferably characterization to target detail； Convolution operation is carried out using the small convolution collecting image of multilayer, improves the completeness of extraction feature；On line training when add with Machine binary mask, carries out successive ignition to the image in training set, had both improved the robustness of network, while also effectively avoid Over-fitting；Threshold value has been preset during tracking, to network model real-time update, has improved the accuracy rate of tracking.

Brief description of the drawings

Fig. 1 is the step flow diagram of the embodiment of the present invention 1.

Fig. 2 is the overall effect flow diagram of the embodiment of the present invention 1.

Embodiment

Attached drawing is only for illustration, it is impossible to is interpreted as the limitation to this patent；

In order to more preferably illustrate the present embodiment, some components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be appreciated that some known features and its explanation, which may be omitted, in attached drawing 's.

Technical scheme is described further with reference to the accompanying drawings and examples.

Embodiment 1

As shown in Figure 1, a kind of feature extraction and method for tracking target based on convolutional neural networks, comprise the following steps：

S1：Build simultaneously pre-training network model；

S2：According to video sequence, training network model on line；

S3：Input video sequence, calculates tracking result；

S4：The tracking result of previous frame in video sequence is assessed, positive sample result is chosen and is put into iteration in network To update network parameter；

In specific implementation process, step S1 can be divided into following three steps and perform：

In specific implementation process, step S2 can be divided into following two steps and perform：

In specific implementation process, step S3 can be divided into following two steps and perform：

In specific implementation process, S11：Obtain training set and test set.Used training set is ILSVRC 2014Object Detection Dataset data sets, which is the standard data set for target identification, is contained Tens of thousands of photos of the different classes of object of kind more than 200, enumerate block, rotate, the different disturbing factor such as deformation.Used in test Data set be OTB50, which is target following frequently-used data collection, contains 50 different video sequences of length.This A little video sequences shootings are contained and quickly move, block, rotating, under illumination variation, the difference scenes such as deformation in different scenes Tracking result.

In specific implementation process, S12：Build foreground segmentation network model.Extracted to improve convolutional neural networks Performance during feature (foreground segmentation), this method construct one 11 layers of foreground segmentation network, and network model is carried out with Machine initializes, its design parameter is as shown in table 1.Convolutional layer has all used the small convolution kernel of 3x3, and it is thin both to have improved network extraction The performance of portion's feature, while the parameter of network model is also greatly reduced, accelerate the process of convolution algorithm.

In specific implementation process, S13:Training network model.Training, which has been used in training set, under line amounts to 200 classes Tens of thousands of pictures, improve the performance of network model.The problem of a large amount of training datas bring over-fitting in order to prevent, is being instructed For example pure background image of part negative sample etc. is also added when practicing, improves the robustness of network model.

According to the groundtruth information provided in data set during training, using the center of groundtruth calibration as Center, extracts the block of pixels in scope to be detected and is put into foreground segmentation network.To be detected piece of size is as follows：

Wherein (x₀, y₀) demarcating the top left co-ordinate of position by groudtruth, w is the width of bounding box, h For the height of bounding box.Size to be detected piece whole is less than 280x280 pixel sizes.Detected and pass through convolutional layer Afterwards, the eigenmatrix of 1 35x35 will be exported, each numerical representation method in matrix is carried to the pixel regions of former 8x8 block sizes Take result.Label used is what is be manually set during training, we are using the region that bounding box are outlined as target area, structure The 0-1 matrixes of 35x35 sizes are built, the numerical value corresponding to bounding box in region is set to 1, and other regions are all set to 0.Before The loss function of scape segmentation network is shown below：

Wherein p_xyFor the predicted value at (x, y) place, t_xyFor the true value at (x, y) place.After largely training, the network model To gradually it restrain.

In specific implementation process, S21:Trim network model.By way of training on line, to foreground segmentation network It is finely adjusted.The first two field picture with groundtruth is put into foreground segmentation network, as training sample iteration 50 Secondary, trim network model, improves foreground segmentation performance under special scenes.

In specific implementation process, S22:Reconstruct target positioning network, training network model parameter.Take foreground segmentation net Network Conv4 and its preceding network structure, and retain the network model parameter after training, two layers after network model is carried out random first Beginningization, its parameter are as shown in table 2：

The input of target positioning network is identical with foreground segmentation network, is to be detected piece of 280x280 sizes.Carry out During training, the first frame of selecting video sequence demarcates the standard two-dimensional Gaussian Profile in region as training set, groundtruth For label.The probability distributing density function of the label can be obtained by following formula：

Wherein (μ₁, μ₂) it is that groudtruth marks center point coordinate, σ₁=σ₂=1.

Target positioning network losses function is as follows defined in training process：

Wherein G₀For the standard two-dimensional Gaussian Profile in groundtruth calibration ranges, ε is number of channels, X_iI-th logical The distribution output in road, γ_iFor this multi-channel parametric, the parameters such as convolution nuclear parameter, bias term are contained.

To solve the problems, such as that training sample is excessively rare, random binary mask behaviour is added when carrying out convolution operation Make：

Wherein F^cExported for convolutional layer, i.e., the characteristic image of one K channel；ω^cFor convolution kernel, X^kFor last layer convolutional layer Output as a result, b_cFor bias term.Here we assume that input picture is a c channel image, M^cFor a c passage two into Mask processed, its parameter are random initializtions.After adding random binary masking operations, make training objective positioning network on line Training set quantity greatly increases, and improves the robustness of network.It the experiment proved that, when training on line, pass through 150 times or so Repetitive exercise, the loss function of the network are gradually restrained, the accuracy rate highest of target positioning.

In specific implementation process, S31:Video sequence is read in, determines tracking position of object in present frame.By video sequence Image in row is put into target positioning network, and corresponding target position will be returned to after a series of convolution operation most Coordinate value at maximum probability.

In specific implementation process, S32:Dynamitic scales.For improve under multiple dimensioned change condition target with The order of accuarcy of track, determines to track the ruler corresponding to target in present frame method proposes a kind of dynamitic scales method Degree.The change of scale vector of a 7 degree of freedom is preset, its parameter is as follows：

Scale_step=1.02

Scale_sigma=1.4361

On the basis of the scale size in previous frame, using Scale_step as step-length establish change of scale vector, take pair Answer confidence value the maximum：

The scope that wherein S is demarcated by scale.Here given threshold is 0.6.When corresponding maximum letter in change of scale vector When center value is less than threshold value, then according to Scale_sigma step-lengths renewal change of scale vector.

In specific implementation process, S4:Tracking result is assessed, with new network model.To improve the accurate of target positioning Rate, this method is during target following to network model parameter real-time update.Selected threshold θ=L_θ=0.2, present frame When loss function is less than threshold value, the renewal of random binary mask successive ignition, the same S22 of its specific steps are added；It is on the contrary then retain Tracking result, is not updated network model.

As shown in Fig. 2, a large amount of training set training network models are used under online in advance, and by the first frame in video sequence It is put into network and is iterated fine setting, improves the performance of neutral net foreground segmentation；Then net is recombinated using two layers of convolutional layer Network, realizes the positioning to target location, and the shape of tracking target in the current frame is finally obtained after adding adaptive change of scale State.

The same or similar label correspond to the same or similar components；

The terms describing the positional relationship in the drawings are only for illustration, it is impossible to is interpreted as the limitation to this patent；

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not pair The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this All any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of feature extraction and method for tracking target based on convolutional neural networks, it is characterised in that comprise the following steps：

S1：Build simultaneously pre-training network model；

S2：According to video sequence, training network model on line；

S3：Input video sequence, calculates tracking result；

S4：The tracking result of previous frame in video sequence is assessed, positive sample result is chosen and is put into network iteration with more New network parameter.

2. feature extraction and method for tracking target according to claim 1 based on convolutional neural networks, it is characterised in that Step S1 can be divided into following three steps and perform：

3. feature extraction and method for tracking target according to claim 2 based on convolutional neural networks, it is characterised in that Training set under line has been used to be trained the network foreground segmentation module of convolutional Neural in step S11 in advance, the training Data set used in foreground segmentation network is 2014 Object Detection Dataset of ILSVRC；The target following institute It is Object Tracking Benchmark 50 with video sequence.

4. feature extraction and method for tracking target according to claim 2 based on convolutional neural networks, it is characterised in that The network model built in step S12 needed for foreground segmentation is a feature extraction network for including 11 convolutional layers.

5. feature extraction and method for tracking target according to claim 3 based on convolutional neural networks, it is characterised in that Using the bounding box in 2014 Object Detection Dataset data sets of ILSVRC as label, by image It is put into feature extraction network and is trained, until result restrains.

6. feature extraction and method for tracking target according to claim 3 based on convolutional neural networks, it is characterised in that Step S2 can be divided into following two steps and perform：

S21：The first frame in video sequence is put into network, the tracking object positions pair provided using groundtruth Network model is finely adjusted；

S22：Extract the characteristic extraction part and parameter in training network, reconstruct target positioning network, by the first frame of video sequence It is put into the network to reconstruct and iterates, training network model.

7. feature extraction and method for tracking target according to claim 6 based on convolutional neural networks, it is characterised in that Step S3 can be divided into following two steps and perform：

S31：Image in selecting video sequence successively, on the basis of target location in previous frame, where definite tracking target Approximate range, is put into target following network in sequence, according to obtained probability distribution matrix after processing, determines current Track the center of target；

S32：Centered on currently definite target's center position, the confidence value corresponding to each scale in multiscale space is calculated, Scale of the scale corresponding to response maximum as tracking target is chosen, and multiscale space is updated.