CN108460790A

CN108460790A - A kind of visual tracking method based on consistency fallout predictor model

Info

Publication number: CN108460790A
Application number: CN201810270188.XA
Authority: CN
Inventors: 高琳
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2018-08-28

Abstract

The invention belongs to technical field of data processing, disclose a kind of visual tracking method based on consistency fallout predictor model, the method includes：A dual input convolutional neural networks model is built first, and the high-level characteristic of synchronous extraction video frame sampling area and target template distinguishes target and background area using logistic regression method；Then convolutional neural networks are embedded into consistency fallout predictor frame, the reliability of classification results is assessed using algorithmic theory of randomness inspection, under specified risk level, classification results of the output with confidence level target in the form of domain；It finally selects high confidence level region as candidate target region, target trajectory is obtained by optimizing time-space domain global energy function.The present invention can adapt to the complex situations such as target occlusion, cosmetic variation and background interference, have stronger robustness and accuracy than current a variety of popular track algorithms.

Description

A kind of visual tracking method based on consistency fallout predictor model

Technical field

The invention belongs to technical field of data processing more particularly to a kind of vision tracking based on consistency fallout predictor model Method.

Background technology

As soon as visual target tracking is the basic problem in computer vision field, task is to determine target in video In motion state, including position, speed and movement locus etc..Although in recent years Visual Tracking achieve very greatly into Exhibition, but in target occlusion, attitudes vibration, mix the complex situations such as background under, it is huge to realize that the tracking of robust still suffers from Big challenge.

In vision tracking problem, clarification of objective expression be influence tracking performance an important factor for one of.For expressing Clarification of objective should be able to adapt to target appearance variation, while have good distinction to background.A large amount of feature extraction Method is applied to vision tracking, and such as HARR, HOG etc., these features are mostly the low-level image feature by hand-designed, specific aim It is relatively strong, and to object variations not robust.In recent years, the convolutional neural networks (Convolutional in depth learning technology Neural Network, CNN) it is widely used in target detection, image classification, semantic segmentation etc..Compared to traditional hand Work feature, the automatic learning characteristic based on CNN can capture the high-level semantic information of target, change to target appearance and have There is stronger robustness, therefore is gradually introduced in the solution of Target Tracking Problem.But using depth characteristic carry out with When track, common problem is to need great amount of samples to train and update CNN parameters, and for vision tracing task, usually It is difficult to be obtained ahead of time the training sample largely about tracked target.Therefore, the effective training with update of CNN parameters are that it is answered For tracking faced main problem.

On the other hand, in the method for tracking target based on CNN, after extracting target signature using CNN, typically to differentiate Formula method realizes tracking^[7-8], basic thought is the two-value classification problem that target following is regarded as to image-region, passes through grader Image-region is divided into target and background region, final track is obtained according to the classification results of every frame.The reliability of classification results It is the key that determine tracking success or failure, however current sorting algorithm mostly lacks the fail-safe analysis to exporting result, that is, passes through It is correct that the confidence level of one quantization, which carrys out evaluation result to what extent,.If point at each moment can be assessed effectively Class for Target state estimator and the update of characteristic model parameter as a result, provide reliable information foundation, it will greatly improves tracking Accuracy and robustness.

In conclusion problem of the existing technology is：

Existing visual tracking method is poor for the robustness target following effect of video sequence；It can not adapt to target screening The complex situations such as gear, cosmetic variation and background interference, and existing a variety of track algorithm accuracys are poor.

Invention content

In view of the problems of the existing technology, the vision tracking based on consistency fallout predictor model that the present invention provides a kind of Method.

The invention is realized in this way a kind of visual tracking method based on consistency fallout predictor model, including：

A dual input convolutional neural networks model, synchronous extraction video frame sampling area and target template are built first High-level characteristic, utilize logistic regression method to distinguish target and background area；

Then convolutional neural networks are embedded into consistency fallout predictor frame, are assessed point using algorithmic theory of randomness inspection The reliability of class result, under specified risk level, classification results of the output with confidence level target in the form of domain；

It finally selects high confidence level region as candidate target region, mesh is obtained by optimizing time-space domain global energy function Mark track.

Further, the visual tracking method based on consistency fallout predictor model specifically includes：

Input：Target original state x₀, the C'N'N of pre-training, length is N sequence images；

Output：Target trajectory

Initial phase, including：

By x₀Input template of the corresponding image-region as CNN；

In x₀Place acquires positive negative sample, establishes training set Τ, and be divided into normal training set Τ_aWith calibration set Τ_b；

Utilize Τ_aTo in CNN full articulamentum and output layer be trained adjustment；

Tracking phase, including：

Image sequence is divided intoA segment is successively handled kth=1 ..., K segments；

Estimate the target trajectory of k-th of segment；

Update training set Τ：It selects tracking result with a high credibility to update training set according to p value, and excavates difficult negative sample It is added into Τ；

Linking objective track T ← T ∪ T_kIf otherwise the last one processed segment, output trajectory T enable k=k+1, turn Enter the target trajectory step of k-th of segment of estimation.

Further, in the target trajectory of k-th of segment of the estimation, processing procedure includes：

Establish the candidate target set of all frames

(1) it is t, O to enable current time_t=φ, with the highest dbjective state of p value in moment t-1 imageCentered on, in place It sets and carries out Gaussian Profile stochastical sampling on scale, obtain M sampleGaussian Profile covariance is diagonal Battle array Diag (0.1r²,0.1r², 0.2), r isLength and wide average value；

(2) CNN is utilized to calculate the regressand value of sample

(3) according to calibration set Τ_b, calculated using formula (3)Confidence level

(4) it according to risk threshold value ε, is obtained using formula (4)Domain prediction resultIt chooses Output result is { C⁺Or { C⁺,C^-, and confidence level p (C⁺) value comes preceding N_cA sampleIt is added to candidate target collection O_t,

(5) t=t+1 is enabled, if t ＞ n_lThen processing terminates, and otherwise goes to (1).

Further, in the target trajectory of k-th of segment of the estimation, processing procedure further includes：

By optimizing energy function E_Track, obtain the target trajectory of k-th of segment

Further, the dual input convolutional neural networks model includes CNN network structures：By target template with it is to be identified Image inputs as two-way while entering network and merges to form differentiation feature in full articulamentum after extracting feature by convolutional layer, Finally logistic regression is carried out in output layer realize classification；Wherein, target template is obtained by craft in sequence image head frames, and Images to be recognized is then the regional area sampled in sequence image；CNN network structures include two independent convolutional layers, this two sets Convolutional layer shares same structure and parameter；Two-way input is being mapped as high-level characteristic after convolutional layer, is then connecting entirely It connects in layer and is merged, be further mapped as the feature that there is distinction to target and background；Output layer returns for Logistic Grader predicts that the target of input sample or background are different classes of by logistic regression；

Further, the dual input convolutional neural networks model further includes network parameter training：

The CNN convolutional layers off-line training on data set in advance, can extract general target feature；In pre-training, CNN nets Network structure is single input structure, and the parameter after training is shared by two sets of convolutional layers；

The output layer of CNN is set as 10 units, and 1 unit is replaced with after pre-training, then by output layer, it is corresponding with Two classification of track task；CNN after pre-training will carry out small parameter perturbations according to actual tracking task；It, will be pre- during tracking Convolution layer parameter after training is fixed, and is only carried out online updating to full articulamentum and output layer parameter, is adapted to target and background Variation；

Foundation for training set chooses the target area in first frame, according to target area by hand in tracking initial phase Domain samples positive negative training sample, is to judge its positive and negative attribute, coverage rate given threshold with the coverage rate of sample and target area 0.5；

Data, which enhance, to be realized into row stochastic scale and rotation transformation to sample；In follow-up tracking, tied by classifying The risk assessment of fruit is chosen and is trained specimen sample centered on meeting the tracking result of confidence level condition；Enable training set table It is shown as T={ (x⁽¹⁾,y⁽¹⁾),...,(x⁽ⁿ⁾,y⁽ⁿ⁾), wherein y⁽ⁱ⁾∈{C^-=0, C⁺=1 }, class label C^-For background, C⁺For mesh Mark, x⁽ⁱ⁾∈Z^dIt is dbjective state vector, including position and scale；Calculating sample is returned in output layer using Logistic to belong to The probability of target or background：

R(y|x；θ)=h_θ(x)^y(1-h_θ(x))^1-y(1)；

Wherein,θ is network model parameter；Using training set T come training pattern so that logarithm Likelihood loss function L (θ) reaches minimum：

Network weight and bias are adjusted along the negative gradient direction of L (θ) using stochastic gradient descent method, by reversely passing Broadcasting method updates each layer parameter Simultaneous Iteration more than convolutional layer.

Further, the consistency fallout predictor uses the innovatory algorithm forecast sample classification of CP, in the innovatory algorithm of CP In, it is first assumed that the sample in training set obeys independent same distribution, by training set T={ (x⁽¹⁾,y⁽¹⁾),...,(x⁽ⁿ⁾,y⁽ⁿ⁾)} It is divided into two parts：Preceding m sample forms normal training set T_a={ (x⁽¹⁾,y⁽¹⁾),...,(x^(m),y^(m)), behind q Sample forms calibration set T_b={ (x^(m+1),y^(m+1)),...,(x^(m+q),y^(m+q)), n=m+q；Normal training set T_aFor updating CNN parameters, and calibration set T_bChecking sequence is constituted with together with sample to be identified, is examined using the algorithmic theory of randomness of sequence to determine Sample class；

The algorithmic theory of randomness method of inspection of the sequence is：Mapping function A is defined first:Z^(q-1)× Z → R, it will calibrate Collect T_bIn each sample be mapped to singular value space one by one, obtain unusual value sequence α_m+1,...,α_m+q；

It is x to enable the dbjective state of sample to be identified^s, x is assigned respectively^sClass label C^-And C⁺, constitute two test samples (x^s,y_i), i=0,1；Calculate the singular value of test samplesAfterwards, with calibration set T_bCorresponding singular value constitutes two inspections together Test sequenceBy calculating test statistics p value, the algorithmic theory of randomness for obtaining sequence is horizontal：

Wherein p_s(y_i) indicate dbjective state x^sIt is marked as y_iWhen p value, as x^sBelong to classification y_iConfidence level； Assignment algorithm risk level threshold epsilon, output of the hypothesis as ICP by p value more than ε：

Work as x^sTrue classification y^sDo not existWhen middle, then it is assumed that there is prediction error, it is effective according to consistency fallout predictor Property theorem, error rate be not more than algorithm risk level ε, i.e.,：

P{p_s(y^s)≤ε}≤ε (5)；

In the algorithmic theory of randomness method of inspection of the sequence, singular value mapping function is first defined, for measuring sample to be tested Originally it is under the jurisdiction of the degree of consistency of whole sample distribution；Consistency is analyzed according to the regressand value of CNN outputs, sample characteristics correspond to The regressand value of true classification is bigger, then the sample and the consistency of calibration set sequence are stronger, and singular value function is defined as：

Wherein R^y(x⁽ⁱ⁾) it is that x is calculated by formula (1)⁽ⁱ⁾Corresponding to the regressand value of classification y, parameter γ is unusual for adjusting Value a_iTo the susceptibility of regressand value variation, the smaller then a of γ_iTo R^y(x⁽ⁱ⁾) variation it is more sensitive.

Further, the result of the innovatory algorithm output of CP includes multiple classifications；Two classification for sample to be identified, CP's The result of innovatory algorithm output includes φ, { C^-},{C⁺},{C⁺,C^-}₄Middle result；In each output result, in addition to classification information, Also it is accompanied with confidence level p value；According to the domain prediction result of all samples, therefrom select sample with a high credibility as the time per frame Select target；

It specifically includes：It is { C by output in the frame for the picture frame of t moment⁺Or { C⁺,C^-Sample according to credible Spend p (C⁺) value is ranked up, choose maximum N_cA Sample Establishing candidate target collection O_t, wherein | O_t|≤N_c；

The candidate target collection O_tInclude several possible states of t moment target, target will be from O_tIn some state turn Change to subsequent time candidate target collection O_t+1In some state；

It is to obtain optimal path by target following, defines time-space domain energy function E_TrackTarget trajectory is portrayed, optimization is passed through Energy function obtains target trajectory

Wherein E_TrackIncluding two parts, local cost item E_LocalWith by cost item E_Pariwise；

E_LocalIt is defined as the dbjective state x at each moment_tCorresponding to the sum of the CNN output valves of background；

Local cost item can be reduced for target part circumstance of occlusion, Robust Estimators are introduced and be used for reducing out lattice point Influences of the data Outliers to function optimization, E_LocalIt is defined as：

Wherein it isCorrespond to regressand value when background for dbjective state x, ρ (), which is Huber operators, enhances local generation The reliability of valence item, is defined as：

E_PariwiseThe variation degree of dbjective state is described；When occurring target occlusion, mixed and disorderly background or target appearance in sequence When state changes, dbjective state great-jump-forward variation can occur since evaluated error is larger；It is assumed that the movement of target is coherent, E_PariwiseEffect be to be punished the catastrophe point in track when energy function optimizes so that track has certain smooth Property；E_PariwiseIt is defined as：

The energy function in formula (7) is optimized using dynamic programming method, obtains optimal movement locus.

Further, training sample updates, including：

During tracking, CNN model parameters are updated using the tracking result of a upper tract, and then handle next sequence Row section；For the tracking result of moment tIt is selected according to its confidence level p value, if p is more than the threshold alpha of setting, is based onPositive negative training sample is sampled, otherwise carries out judgement selection into subsequent time.

Another object of the present invention is to provide a kind of visual tracking methods based on consistency fallout predictor model Robust Visual Tracking System.

The present invention extracts image high-level characteristic using convolutional neural networks, is used for carrying out feature representation to target, overcome The low-level image feature disadvantage sensitive to target appearance transformation.In order to adapt to different types of tracking target, dual input network is designed Structure, combining target template distinguish target and background area using logistic regression method.To further increase tracking robustness, It introduces consistency fallout predictor and fail-safe analysis is carried out to classification results, select the classification results for meeting confidence level condition as candidate Target area obtains final target trajectory finally by time-space domain global energy function optimization.It is on public data collection and more The currently a popular track algorithm of kind carries out contrast experiment, and the present invention can adapt to target occlusion, cosmetic variation and background interference Etc. complex situations, the results showed that inventive algorithm, which has, more preferably tracks robustness and accuracy.

Description of the drawings

Fig. 1 is the visual tracking method flow chart provided in an embodiment of the present invention based on consistency fallout predictor model.

Fig. 2 is the visual tracking method schematic diagram provided in an embodiment of the present invention based on consistency fallout predictor model.

Fig. 3 is the CNN schematic network structures of dual input provided in an embodiment of the present invention.

Fig. 4 is target following result schematic diagram provided in an embodiment of the present invention；

In figure：(a)FaceOcc1；(b)Bolt；(c)Football；(d)CarDark.

Fig. 5 is tracking result center error schematic diagram provided in an embodiment of the present invention；

In figure：(a)FaceOcc1；(b)Bolt；(c)Football；(d)CarDark.

Fig. 6 is tracking result coverage rate schematic diagram provided in an embodiment of the present invention；

In figure：(a)FaceOcc1；(b)Bolt；(c)Football；(d)CarDark.

Fig. 6 is provided in an embodiment of the present invention once by evaluation result schematic diagram；

In figure：(a) positional precision figure；(b) success rate figure is covered.

Fig. 7 is that algorithm provided in an embodiment of the present invention is primary by evaluation result figure on all cycle tests；

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Currently, the robustness target following effect for video sequence is poor；Can not adapt to target occlusion, cosmetic variation with And the complex situations such as background interference, and existing a variety of track algorithm accuracys are poor.

The application principle of the present invention is described in detail below in conjunction with the accompanying drawings.

As shown in Figure 1, the visual tracking method provided in an embodiment of the present invention based on consistency fallout predictor model, including：

S101：A dual input convolutional neural networks model, synchronous extraction video frame sampling area and mesh are built first The high-level characteristic for marking template distinguishes target and background area using logistic regression method.

S102：Then convolutional neural networks are embedded into consistency fallout predictor frame, using algorithmic theory of randomness inspection come The reliability for assessing classification results, under specified risk level, classification results of the output with confidence level target in the form of domain.

S103：Finally select high confidence level region as candidate target region, by optimizing time-space domain global energy function Obtain target trajectory.The experimental results showed that the algorithm can adapt to the complicated feelings such as target occlusion, cosmetic variation and background interference Condition has stronger robustness and accuracy than current a variety of popular track algorithms.

The application principle of the present invention is further described with reference to specific embodiment.

Visual tracking method schematic diagram provided in an embodiment of the present invention based on consistency fallout predictor model is as shown in Figure 2；

It is broadly divided into two stages.First stage is initial phase, builds the CNN of a dual input, therein Convolution layer parameter is that training obtains in advance using conventional images data set, other layers then utilize the sample of hand sampling in first frame into Row training, to obtain the initial parameter of model.

Second stage is tracking phase, and area sampling is carried out frame by frame to sequence image, and the high level that sample is extracted using CNN is special Sign calculates the regressand value that sample belongs to target or background by logistic regression, and then obtains sample under specified risk level using CP This classification；It selects target sample with a high credibility to establish candidate target collection, passes through the space-time to being defined on candidate target collection Energy function optimizes, and obtains final target trajectory.The target following of long sequence is handled using a kind of half offline mode, it will Entire video sequence segmentation, handles the tracking of each tract, connects every section of track, and piecewise during tracking successively Online updating is carried out to the model parameter of CNN.

1) CNN target's feature-extractions and classification：

CNN is a kind of multilayer neural network of special disposal lattice structure data, is implicitly extracted by convolution kernel Image local feature, and the invariance with good displacement, scaling and other types deformation.For tracking problem, network knot Structure and parameter training mode are to influence the key factor of CNN performances, illustrate this two parts in inventive algorithm separately below Design.

1.1) CNN network structures：

In target identification application, CNN usually requires accurately express target signature after mass data is trained, and For a specific tracing task, it tends to be difficult to sufficient training data be obtained ahead of time, therefore applied to the CNN of target identification It is difficult to directly apply to target following, needs to be adjusted and improves.

It is different from target identification, the specific type of target need not be paid close attention in tracking, as long as can be distinguished with background, A kind of CNN network structures (as shown in Figure 3) of dual input are used thus, target template is defeated as two-way with images to be recognized Enter at the same enter network, by convolutional layer extract feature after, merge to form differentiation feature in full articulamentum, finally output layer into Row logistic regression realizes classification.Wherein, target template can be obtained by craft in sequence image head frames, and images to be recognized is then It is the regional area sampled in sequence image；Two sets of independent convolutional layers are contained in network, are simplified model, this two sets volumes Lamination shares same structure and parameter；Two-way input is being mapped as high-level characteristic after convolutional layer, is then connecting entirely It is merged in layer, is further mapped as the feature that there is distinction to target and background；Output layer is Logistic recurrence point Class device predicts the classification of input sample, i.e. target or background by logistic regression.

1.2) network parameter is trained：

CNN convolutional layers in the present invention off-line training on CIFAR-10 data sets in advance, enables to extract general mesh Mark feature.In pre-training, CNN network structures are reduced to single input structure, and the parameter after training is then shared by two sets of convolutional layers.

In addition, for 10 classification problems of CIFAR-10 data, the output layer of CNN is set as 10 units, when pre-training knot Shu Hou, then output layer is replaced with into 1 unit, to correspond to two classification problems of tracing task.CNN after pre-training will be according to reality Border tracing task carries out small parameter perturbations.During tracking, in order to improve CNN parameter adjustment efficiency, by the convolution after pre-training Layer parameter is fixed, and only online updating is carried out to full articulamentum and output layer parameter, to adapt to the variation of target and background.

Foundation for training set chooses the target area in first frame, according to target area by hand in tracking initial phase Domain samples positive negative training sample, judges its positive and negative attribute with the coverage rate (given threshold 0.5) of sample and target area.For Raising training samples number realizes that data enhance to sample into row stochastic scale and rotation transformation.In follow-up tracking, By the risk assessment of classification results, chooses and be trained specimen sample centered on the tracking result for meeting confidence level condition. Training set is enabled to be expressed as T={ (x⁽¹⁾,y⁽¹⁾),...,(x⁽ⁿ⁾,y⁽ⁿ⁾), wherein y⁽ⁱ⁾∈{C^-=0, C⁺=1 }, class label C^-For the back of the body Scape, C⁺For target, x⁽ⁱ⁾∈Z^dIt is dbjective state vector, including position and scale.It is returned and is calculated using Logistic in output layer Sample belongs to the probability of target or background：

R(y|x；θ)=h_θ(x)^y(1-h_θ(x))^1-y(1)；

Wherein,θ is network model parameter.Using training set T come training pattern so that logarithm Likelihood loss function L (θ) reaches minimum：

2) the candidate target selection based on consistency fallout predictor：

Logistic regressand values provide foundation for sample class prediction, but Logistic regressand values itself can not be in theory On the risk of prediction error is assessed.In order to realize the fail-safe analysis of prediction result, by CNN model insertions to CP frames In frame, according to algorithmic theory of randomness level calculation sample class confidence level, and then candidate target is selected.

2.1) consistency fallout predictor：

Current machine learning algorithm mostly lacks prediction result effective fail-safe analysis, that is, passes through one It is correct that the confidence level of quantization, which carrys out evaluation and foreca result to what extent, and the measurement standard of effectively confidence level is adjustable Property.CP is a kind of machine learning normal form that can effectively export confidence level, is predicted using hypothesis testing method, and to pre- It surveys result and reliability assessment is provided.

Traditional CP algorithm calculation amounts are very big, and to improve operation efficiency, the present invention uses the innovatory algorithm forecast sample of CP Classification, i.e. consilience of induction fallout predictor (Inductive Conformal Predictor, ICP).In ICP algorithms, first It is assumed that the sample in training set obeys independent same distribution, by training set T={ (x⁽¹⁾,y⁽¹⁾),...,(x⁽ⁿ⁾,y⁽ⁿ⁾) be divided into Two parts：Preceding m sample forms normal training set T_a={ (x⁽¹⁾,y⁽¹⁾),...,(x^(m),y^(m)), behind q sample group At calibration set T_b={ (x^(m+1),y^(m+1)),...,(x^(m+q),y^(m+q)), n=m+q.Normal training set T_aFor updating CNN ginsengs Number, and calibration set T_bChecking sequence is constituted with together with sample to be identified, sample class is determined using algorithmic theory of randomness inspection.

The algorithmic theory of randomness method of inspection is：Mapping function A is defined first:Z^(q-1)× Z → R, by calibration set T_bEach of Sample is mapped to singular value space one by one, obtains unusual value sequence α_m+1,...,α_m+q.It is whole with sample that singular value reflects the sample The inconsistency of body distribution.It is x to enable the dbjective state of sample to be identified^s, x is assigned respectively^sClass label C^-And C⁺, to constitute Two test samples (x^s,y_i), i=0,1.Calculate the singular value of test samplesAfterwards, with calibration set T_bCorresponding singular value one It rises and constitutes two checking sequencesBy calculating test statistics p value, the algorithm of sequence is obtained Randomness is horizontal：

Wherein p_s(y_i) indicate dbjective state x^sIt is marked as y_iWhen p value, as x^sBelong to classification y_iConfidence level. Assignment algorithm risk level threshold epsilon, output of the hypothesis as ICP by p value more than ε：

Work as x^sTrue classification y^sDo not existWhen middle, then it is assumed that there is prediction error, it is effective according to consistency fallout predictor Property theorem^[10], error rate is not more than algorithm risk level ε, i.e.,：

P{p_s(y^s)≤ε}≤ε (5)；

Therefore, the prediction domain of ICP has adjustable.

2.2) sample singular function：

The algorithmic theory of randomness inspection of sequence needs first to define singular value mapping function, is under the jurisdiction of for measuring sample to be tested The degree of consistency of whole sample distribution.Consistency is analyzed according to the regressand value of CNN outputs, it is believed that sample characteristics correspond to true The regressand value of classification is bigger, then the sample and the consistency of calibration set sequence are stronger, and singular value function is defined as：

2.3) candidate target selects：

ICP output the result is that one set, wherein may include multiple classifications.Two classification of sample to be identified are asked The result of topic, ICP outputs has 4 kinds of possibilities：φ,{C^-},{C⁺},{C⁺,C-}.In each output result, in addition to classification information, Also it is accompanied with confidence level p value.According to the domain prediction result of all samples, therefrom select sample with a high credibility as the time per frame Select target.Specifically, it is { C by output in the frame for the picture frame at t moment⁺Or { C⁺,C^-Sample according to credible Spend p (C⁺) value is ranked up, choose maximum N_cA Sample Establishing candidate target collection O_t, it is known that | O_t|≤N_c。

3) target tracking algorism：

3.1) time-space domain energy function：

Candidate target collection O_tSeveral possible states of t moment target are contained, target will be from O_tIn some state conversion To subsequent time candidate target collection O_t+1In some state, therefore target following can be considered as to searching optimum path problems, be Acquisition optimal path defines time-space domain energy function E_TrackTarget trajectory is portrayed, target track is obtained by optimizing energy function Mark

Wherein E_TrackIncluding two parts, local cost item E_LocalWith by cost item E_Pariwise。

E_LocalIt is defined as the dbjective state x at each moment_tCorresponding to the sum of the CNN output valves of background.Due to target part Circumstance of occlusion can reduce the reliability of local cost item, introduce Robust Estimators thus and be used for reducing out Grid data (Outliers) to the influence of function optimization, E_LocalIt is defined as：

Wherein it isCorrespond to regressand value when background for dbjective state x, ρ () is Huber operators, for enhancing The reliability of local cost item, is defined as：

E_PariwiseThe variation degree of dbjective state is described.When occurring target occlusion, mixed and disorderly background or target appearance in sequence When state changes, dbjective state great-jump-forward variation may occur since evaluated error is larger.It is assumed that the movement of target is coherent , E_PariwiseEffect be to be punished the catastrophe point in track when energy function optimizes so that track has certain Flatness.E_PariwiseIt is defined as：

3.2) training sample updates：

During tracking, CNN model parameters are updated using the tracking result of a upper tract, and then handle next sequence Row section.To avoid the occurrence of model drift, only training sample is acquired in the high tracking result of reliability.Tracking for moment t As a resultIt is selected according to its confidence level p value, if p is more than the threshold alpha of setting, is based onPositive negative training sample is sampled, it is no Then judgement selection is carried out into subsequent time.

The negative sample of negative sample generally existing redundancy phenomena in training set, redundancy contributes very little, waste to model training Computing resource.For this purpose, optimizing training set by excavating difficult negative sample (Hard Negative Sample), training effect is improved Rate.It was found that, domain prediction result is { C⁺,C^-Sample (be denoted as), it will usually it is easily mixed with target to appear in background objects In the case of confusing, difficult negative sample can be selected from this kind of sample.A kind of simple selection mode is to judgeWith current tracking As a resultBetween it is overlapping with the presence or absence of region, will if overlappingIt is added in training set as negative sample.

3.3) track algorithm step：

The Vision Tracking of proposition is as follows：

4) experimental result and analysis：

For the validity of verification algorithm, emulation experiment is carried out to algorithm on Matlab, hardware platform is 3.4 GHz Intel-i7-6700 inside saves as 8GB.The parameters of algorithm are set as in experiment：Normal training set T_aScale m=300, school Quasi- collection T_bScale q=30, algorithm risk level ε=0.4, unusual value parameter γ=0.5 of sample, candidate target collection scale upper limit N_c =20, robust function parameter δ=0.4, training sample undated parameter α=0.6.Algorithm parameter remains unchanged in entire experiment, calculates The average treatment speed of method is about 8 frames/second.

Select the video sequence in public data collection TOP100 as experimental subjects, and to current a variety of mainstream track algorithms Carry out experiment effect comparison, including VTS, LOT, STRUCK, MIL, KCF etc..In order to verify the validity of CP, test in an experiment One simple version of inventive algorithm, is not introduced into CP in the version, but the regressand value directly exported according to CNN, selection Maximum N_cA sample is as candidate region.Coverage rate (Coverage rate) and center position error are used in experiment (Center location error) two standards carry out the performance of more each algorithm.Coverage rate is defined as Cr=(R_s∩R_t)/ (R_s∪R_t), wherein R_s、R_tRespectively tracking result region and real goal region.Center point tolerance refers to tracking result center Euclidean distance between point and true value central point.Part of test results is shown, some typical cases are contained in the video sequence of selection Complex situations, such as target occlusion, variable cosmetic, illumination variation and complex background.

Target can be accurately positioned in Ours of the present invention, Ours (no CP) and tri- algorithms of KCF always in the sequence, To blocking with stronger robustness.

The characteristics of FaceOcc1 video sequence images is that face is repeatedly blocked, and the position blocked and degree are different. As can be seen that the tracking result of LOT is limited only to the part not being blocked under occlusion, scale error is larger, and MIL exists When target is seriously blocked, as in the 834th frame, there is larger drift.Error and coverage data are shown from the central position, Target can be accurately positioned in inventive algorithm Ours, Ours (no CP) and tri- algorithms of KCF always in the sequence, to hiding Gear has stronger robustness.

Video sequence Bolt is dash match scene, and task is the one of sportsman of tracking.The challenge of the sequence is, The posture of target is constantly changing, while with the rotation of camera lens, and sportsman is increasingly turned to the back side, therefore mesh from front in image It is very big to mark cosmetic variation.

Inventive algorithm utilizes high-level characteristic, influenced by target appearance variation it is little, and by fail-safe analysis come more New model avoids the occurrence of drift, finds out from error analysis, in the algorithm compared, the tracking result error of inventive algorithm It is minimum.

Football video sequences are rugby match scenes, and tracking target is the head of a sportsman, it is shown that should The part tracking result of sequence.The difficult point of the sequence is the quite similar sportsman of many of background appearance, between them frequently Reciprocal motion causes interference to target following.Repeatedly there is larger drift in VTS, MIL, KCF, STRUCK and Ours (no CP) Move, in 360 frame, VTS, MIL and KCF then perfect tracking to other sportsmen.Inventive algorithm Ours is excellent by space-time track Change ensures smooth trajectory, reduces the influence of similar purpose interference.Statistics indicate that Ours is maintained in the tracking of the sequence Minimum tracking error.

The characteristics of Cardark video sequences are the tail portions that track an automobile, the sequence is illumination acute variation, background Mix and image resolution ratio is low.In the tracking result of display, interfered by the light occurred on the left of target in the 58th frame LOT, Substantial deviation target, while scale error is larger, also there is a degree of drift in MIL and VTS.With left side light Constantly interference, MIL and LOT have lost target in 208 frame, and in 315 frame, the inverted image speck on road surface also results in VTS loses target.STRUCK, KCF, Ours (no CP) and inventive algorithm Ours keep stablizing during tracking the tailstock, But the larger scale error of Ours (no CP) generally existings.Shown in the tracking error analysis of the sequence, wherein KCF and STRUCK Tracking accuracy be slightly below Ours.

In order to compare the overall performance of 7 kinds of algorithms, it is primary on all cycle tests that The present invention gives these algorithms By evaluation result (One-Pass Evaluation, OPE), the performance of algorithm can be from area under the curve (Area Under Curve, AUC) it is ranked up, there it can be seen that inventive algorithm Ours is higher than in positional precision and covering success rate Other algorithms, the wherein performance of KCF and Ours are closest, and Ours (no CP) its performance under conditions of being not introduced into CP occurs It glides, it is more apparent especially in covering success rate.

The mean center point site error and average coverage rate of 7 kinds of algorithms are counted in table 1, the performance of inventive algorithm refers to Mark is better than other algorithms, shows that the depth characteristic of the extraction of the CNN networks in this hair AMING algorithms can distinguish target well And by combining ICP to carry out trust evaluation to classification results the reliability of tracking can be effectively ensured, in a variety of allusion quotations in background Good performance is shown on the video sequence of type complex situations.

The present invention proposes a kind of target tracking algorism based on convolutional neural networks Yu consistency fallout predictor.The algorithm is adopted With convolutional neural networks extract image high-level characteristic, be used for target carry out feature representation, overcome low-level image feature to target outside See the sensitive disadvantage of transformation.In order to adapt to different types of tracking target, dual input network structure, combining target mould are designed Plate distinguishes target and background area using logistic regression method.To further increase tracking robustness, consistency fallout predictor is introduced Fail-safe analysis is carried out to classification results, selects the classification results for meeting confidence level condition as candidate target region, finally leads to It crosses time-space domain global energy function optimization and obtains final target trajectory.On public data collection with a variety of currently a popular tracking Algorithm carries out contrast experiment, the results showed that inventive algorithm, which has, more preferably tracks robustness and accuracy.

The application effect of the present invention is explained in detail with reference to experiment.

Select public data collection TOP100^[14]In video sequence tracked as experimental subjects, and to current a variety of mainstreams Algorithm carries out experiment effect comparison, including VTS^[15], LOT^[16], STRUCK^[1], MIL^[17], KCF^[2]Deng.In order to verify having for CP Effect property, tests a simple version of inventive algorithm, CP is not introduced into the version in an experiment, but directly according to CNN The regressand value of output selects maximum N_cA sample is as candidate region.In experiment using coverage rate (Coverage rate) and Two standards of center position error (Center location error)^[18]Carry out the performance of more each algorithm.Cover calibration Justice is Cr=(R_s∩R_t)/(R_s∪R_t), wherein R_s、R_tRespectively tracking result region and real goal region.Center point tolerance It refer to the Euclidean distance between tracking result central point and true value central point.Fig. 4 shows part of test results, the video of selection Some typical complex situations, such as target occlusion, variable cosmetic, illumination variation and complex background are contained in sequence.

The part tracking result of FaceOcc1 video sequences is shown in Fig. 4 (a), and tracking target is the face of a woman Portion.The characteristics of sequence image is that face is repeatedly blocked by books, and the position blocked and degree are different.It can from figure Go out, the tracking result of LOT is limited only to the part not being blocked under occlusion, and scale error is larger, and MIL is in target quilt When seriously blocking, as in the 834th frame, there is larger drift.The covering in center error and Fig. 6 (a) in Fig. 5 (a) Rate data show that inventive algorithm Ours, Ours (no CP) and tri- algorithms of KCF can accurately be determined always in the sequence Target is arrived in position, to blocking with stronger robustness.

Video sequence Bolt is dash match scene, and task is the one of sportsman of tracking.The challenge of the sequence is, The posture of target is constantly changing, while with the rotation of camera lens, and sportsman is increasingly turned to the back side, therefore mesh from front in image It is very big to mark cosmetic variation.In the result of Fig. 4 (b) displays, VTS, STRUCK, MIL shortly shift since sequence, Equal breakaway in 48th frame, KCF, LOT, Ours and Ours (no CP) can keep keeping up with target, but LOT and Ours (no CP) when target deforms upon, there is larger scale error in such as the 222nd frame.Inventive algorithm utilizes high-level characteristic, by The influence of target appearance variation is little, and avoids the occurrence of drift by fail-safe analysis come more new model, from Fig. 5 (b) and Fig. 6 (b) error analysis in is found out, in the algorithm compared, the tracking result error of inventive algorithm is minimum.

Football video sequences are rugby match scenes, and tracking target is the head of a sportsman, and Fig. 4 (c) is aobvious The part tracking result of the sequence is shown.The difficult point of the sequence is the quite similar sportsman of many of background appearance, they it Between frequent reciprocal motion, interference is caused to target following.VTS, MIL, KCF, STRUCK and Ours (no CP) repeatedly occur Larger drift, in 360 frame, VTS, MIL and KCF then perfect tracking to other sportsmen.Inventive algorithm Ours passes through space-time Track optimizing ensures smooth trajectory, reduces the influence of similar purpose interference.It is in Fig. 5 (c) and Fig. 6 (c) statistics indicate that, Ours maintains minimum tracking error in the tracking of the sequence.

The characteristics of Cardark video sequences are the tail portions that track an automobile, the sequence is illumination acute variation, background Mix and image resolution ratio is low.In the tracking result of Fig. 4 (d) displays, in the 58th frame LOT by the light occurred on the left of target Interference, substantial deviation target, while scale error is larger, also there is a degree of drift in MIL and VTS.As left side is bright The continuous interference of light, MIL and LOT have lost target in 208 frame, and in 315 frame, the inverted image speck on road surface, It causes VTS and loses target.STRUCK, KCF, Ours (no CP) and inventive algorithm Ours are kept during tracking the tailstock Stablize, but the larger scale error of Ours (no CP) generally existings.The tracking error analysis such as Fig. 5 (d) and Fig. 6 of the sequence (d) shown in, the tracking accuracy of wherein KCF and STRUCK are slightly below Ours.

In order to compare the overall performance of 7 kinds of algorithms, it is primary logical on all cycle tests that these algorithms are given in Fig. 7 Cross evaluation result (One-Pass Evaluation, OPE)^[14], including positional precision figure (Fig. 7 (a)) and covering success rate figure (Fig. 7 (b)).The performance of algorithm can be ranked up with the area under the curve (Area Under Curve, AUC) provided in Fig. 6, from In as can be seen that inventive algorithm Ours is higher than other algorithms, the wherein performance of KCF in positional precision and covering success rate It is closest with Ours, and gliding occurs in Ours (no CP) its performance under conditions of being not introduced into CP, and is especially covering successfully In rate more apparent (see Fig. 7 (b)).

The mean center point site error and average coverage rate of 7 kinds of algorithms are counted in table 1, the performance of inventive algorithm refers to Mark be better than other algorithms, show in inventive algorithm CNN networks extraction depth characteristic can distinguish well target and By combining ICP to carry out trust evaluation to classification results the reliability of tracking can be effectively ensured, in a variety of typical cases in background Good performance is shown on the video sequence of complex situations.

1 mean center point site error of table and coverage rate

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims

1. a kind of visual tracking method based on consistency fallout predictor model, which is characterized in that described to be based on consistency fallout predictor The visual tracking method of model includes：

A dual input convolutional neural networks model, the height of synchronous extraction video frame sampling area and target template are built first Layer feature distinguishes target and background area using logistic regression method；

Then convolutional neural networks are embedded into consistency fallout predictor frame, classification knot is assessed using algorithmic theory of randomness inspection The reliability of fruit, under specified risk level, classification results of the output with confidence level target in the form of domain；

It finally selects high confidence level region as candidate target region, target track is obtained by optimizing time-space domain global energy function Mark.

2. the visual tracking method as described in claim 1 based on consistency fallout predictor model, which is characterized in that described to be based on The visual tracking method of consistency fallout predictor model specifically includes：

Output：Target trajectory

Initial phase, including：

By x₀Input template of the corresponding image-region as CNN；

Tracking phase, including：

Estimate the target trajectory of k-th of segment；

Update training set Τ：It selects tracking result with a high credibility to update training set according to p value, and excavates difficult negative sample and be added Into Τ；

Linking objective track T ← T ∪ T_kIf otherwise the last one processed segment, output trajectory T enable k=k+1, are transferred to and estimate Count the target trajectory step of k-th of segment.

3. the visual tracking method as claimed in claim 2 based on consistency fallout predictor model, which is characterized in that the estimation In the target trajectory of k-th of segment, processing procedure includes：

Establish the candidate target set of all frames

(1) it is t, O to enable current time_t=φ, with the highest dbjective state center of p value in moment t-1 image, in position and scale Upper progress Gaussian Profile stochastical sampling obtains M sampleJ=1 ..., M, Gaussian Profile covariance are diagonal matrix Diag (0.1r²,0.1r², 0.2), r isLength and wide average value；

(2) CNN is utilized to calculate the regressand value of sample=1 ..., M；

(3) according to calibration set Τ_b, calculated using formula (3)Confidence levelJ=1 ..., M；

(4) it according to risk threshold value ε, is obtained using formula (4)Domain prediction resultJ=1 ..., M；Choose output As a result it is { C⁺Or { C⁺,C^-, and confidence level p (C⁺) value comes preceding N_cA sampleIt is added to candidate target collection O_t,

4. the visual tracking method as claimed in claim 2 based on consistency fallout predictor model, which is characterized in that the estimation In the target trajectory of k-th of segment, processing procedure further includes：

5. the visual tracking method as described in claim 1 based on consistency fallout predictor model, which is characterized in that the two-way It includes CNN network structures to input convolutional neural networks model：Using target template and images to be recognized as two-way input simultaneously into Enter network, after extracting feature by convolutional layer, merge to form differentiation feature in full articulamentum, finally carrying out logic in output layer returns Realization is returned to classify；Wherein, target template is obtained by craft in sequence image head frames, and images to be recognized is then in sequence chart The regional area sampled as in；CNN network structures include two independent convolutional layers, this two sets convolutional layers share same structure with Parameter；Two-way input is being mapped as high-level characteristic after convolutional layer, is then merged in full articulamentum, is further reflected It penetrates to have the feature of distinction to target and background；Output layer is that Logistic returns grader, is predicted by logistic regression The target or background of input sample are different classes of.

6. the visual tracking method as claimed in claim 5 based on consistency fallout predictor model, which is characterized in that the two-way Input convolutional neural networks model further includes network parameter training：

The CNN convolutional layers off-line training on data set in advance, can extract general target feature；In pre-training, CNN network structures For single input structure, the parameter after training is shared by two sets of convolutional layers；

The output layer of CNN is set as 10 units, replaces with 1 unit after pre-training, then by output layer, corresponding tracking is appointed Two classification of business；CNN after pre-training will carry out small parameter perturbations according to actual tracking task；During tracking, by pre-training Convolution layer parameter afterwards is fixed, and is only carried out online updating to full articulamentum and output layer parameter, is adapted to the variation of target and background；

Foundation for training set is chosen the target area in first frame, is adopted according to target area by hand in tracking initial phase The positive negative training sample of sample judges its positive and negative attribute with the coverage rate of sample and target area, and coverage rate given threshold is 0.5；

Data, which enhance, to be realized into row stochastic scale and rotation transformation to sample；In follow-up tracking, pass through classification results Risk assessment is chosen and is trained specimen sample centered on meeting the tracking result of confidence level condition；Training set is enabled to be expressed as T ={ (x⁽¹⁾,y⁽¹⁾),...,(x⁽ⁿ⁾,y⁽ⁿ⁾), wherein y⁽ⁱ⁾∈{C^-=0, C⁺=1 }, class label C^-For background, C⁺For target, x⁽ⁱ⁾ ∈Z^dIt is dbjective state vector, including position and scale；Output layer using Logistic return calculate sample belong to target or The probability of background：

R(y|x；θ)=h_θ(x)^y(1-h_θ(x))^1-y(1)；

Wherein,θ is network model parameter；Using training set T come training pattern so that log-likelihood Loss function L (θ) reaches minimum：

Network weight and bias are adjusted along the negative gradient direction of L (θ) using stochastic gradient descent method, by backpropagation side Method updates each layer parameter Simultaneous Iteration more than convolutional layer.

7. the visual tracking method as described in claim 1 based on consistency fallout predictor model, which is characterized in that

The consistency fallout predictor uses the innovatory algorithm forecast sample classification of CP, in the innovatory algorithm of CP, it is first assumed that instruction Practice the sample concentrated and obey independent same distribution, by training set T={ (x⁽¹⁾,y⁽¹⁾),...,(x⁽ⁿ⁾,y⁽ⁿ⁾) it is divided into two portions Point：Preceding m sample forms normal training set T_a={ (x⁽¹⁾,y⁽¹⁾),...,(x^(m),y^(m)), behind q sample form calibration set T_b={ (x^(m+1),y^(m+1)),...,(x^(m+q),y^(m+q)), n=m+q；Normal training set T_aFor updating CNN parameters, and calibrate Collect T_bChecking sequence is constituted with together with sample to be identified, is examined using the algorithmic theory of randomness of sequence to determine sample class；

The algorithmic theory of randomness method of inspection of the sequence is：Mapping function A is defined first:Z^(q-1)× Z → R, by calibration set T_bIn Each sample be mapped to singular value space one by one, obtain unusual value sequence α_m+1,...,α_m+q；

It is x to enable the dbjective state of sample to be identified^s, x is assigned respectively^sClass label C^-And C⁺, constitute two test samples (x^s, y_i), i=0,1；Calculate the singular value of test samplesAfterwards, with calibration set T_bCorresponding singular value constitutes two inspection sequences together RowI=0,1；By calculating test statistics p value, the algorithmic theory of randomness for obtaining sequence is horizontal：

Wherein p_s(y_i) indicate dbjective state x^sIt is marked as y_iWhen p value, as x^sBelong to classification y_iConfidence level；It is specified Algorithm risk level threshold epsilon, output of the hypothesis as ICP by p value more than ε：

Work as x^sTrue classification y^sDo not existWhen middle, then it is assumed that there is prediction error, it is fixed according to consistency fallout predictor validity Reason, error rate are not more than algorithm risk level ε, i.e.,：

P{p_s(y^s)≤ε}≤ε (5)；

In the algorithmic theory of randomness method of inspection of the sequence, singular value mapping function is first defined, is subordinate to for measuring sample to be tested Belong to the degree of consistency of whole sample distribution；Consistency is analyzed according to the regressand value of CNN outputs, sample characteristics correspond to true The regressand value of classification is bigger, then the sample and the consistency of calibration set sequence are stronger, and singular value function is defined as：

Wherein R^y(x⁽ⁱ⁾) it is that x is calculated by formula (1)⁽ⁱ⁾Corresponding to the regressand value of classification y, parameter γ is for adjusting singular value a_i To the susceptibility of regressand value variation, the smaller then a of γ_iTo R^y(x⁽ⁱ⁾) variation it is more sensitive.

8. the visual tracking method as claimed in claim 7 based on consistency fallout predictor model, which is characterized in that the improvement of CP The result of algorithm output includes multiple classifications；Two classification for sample to be identified, the result that the innovatory algorithm of CP exports include φ,{C^-},{C⁺},{C⁺,C^-Result in 4；In each output result, in addition to classification information, it is also accompanied with confidence level p value；According to The domain prediction result of all samples therefrom selects sample with a high credibility as the candidate target per frame；

It specifically includes：It is { C by output in the frame for the picture frame of t moment⁺Or { C⁺,C^-Sample according to confidence level p (C⁺) value is ranked up, choose maximum N_cA Sample Establishing candidate target collection O_t, wherein | O_t|≤N_c；

The candidate target collection O_tInclude several possible states of t moment target, target will be from O_tIn some state be transformed into Subsequent time candidate target collection O_t+1In some state；

It is to obtain optimal path by target following, defines time-space domain energy function E_TrackTarget trajectory is portrayed, by optimizing energy Function obtains target trajectory

Local cost item can be reduced for target part circumstance of occlusion, Robust Estimators are introduced and be used for reducing out Grid data Influences of the Outliers to function optimization, E_LocalIt is defined as：

Wherein it isCorrespond to regressand value when background for dbjective state x, ρ (), which is Huber operators, enhances local cost item Reliability, be defined as：

E_PariwiseThe variation degree of dbjective state is described；Become when occurring target occlusion, mixed and disorderly background or targeted attitude in sequence When change, dbjective state great-jump-forward variation can occur since evaluated error is larger；It is assumed that the movement of target is coherent, E_Pariwise Effect be to be punished the catastrophe point in track when energy function optimizes so that track have certain flatness； E_PariwiseIt is defined as：

9. the visual tracking method as claimed in claim 2 based on consistency fallout predictor model, which is characterized in that training sample Update, including：

During tracking, CNN model parameters are updated using the tracking result of a upper tract, and then handle next sequence Section；For the tracking result of moment tIt is selected according to its confidence level p value, if p is more than the threshold alpha of setting, is based on Positive negative training sample is sampled, otherwise carries out judgement selection into subsequent time.

10. a kind of visual tracking method as described in claim 1 based on consistency fallout predictor model based on convolutional Neural net The robust Visual Tracking System of network and consistency fallout predictor.