CN110033473A

CN110033473A - Motion target tracking method based on template matching and depth sorting network

Info

Publication number: CN110033473A
Application number: CN201910297980.9A
Authority: CN
Inventors: 田小林; 李芳�; 李帅; 李娇娇; 荀亮; 贾楠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2019-07-19
Anticipated expiration: 2039-04-15
Also published as: CN110033473B

Abstract

The invention discloses a kind of motion target tracking method based on template matching and depth sorting network, mainly solves that prior art target detection speed is slow, and appearance deformation occurs in target, blocks and constantly tracks the problem of inaccuracy.Its implementation are as follows: 1) build double residual error depth sorting networks, and it is trained；2) template network and detection network are extracted in double residual error depth sorting networks；3) template characteristic is extracted using template network；4) detection feature is extracted using detection network；5) template characteristic is subjected to template matching in detection feature, obtains template matching figure；6) target position is determined according to template matching figure；7) template characteristic is updated according to target position；8) judge whether present frame is last frame, if so, target end tracks, otherwise, using updated template characteristic as the template characteristic of next frame, return 4).Tracking velocity of the present invention is fast, and accuracy rate is high, for drastic mechanical deformation, illumination variation video frequency object tracking.

Description

Motion target tracking method based on template matching and depth sorting network

Technical field

The invention belongs to technical field of image processing, further relate to a kind of motion target tracking method, can be used for pair The video frequency object tracking of these types such as drastic mechanical deformation, camera lens shake, dimensional variation, illumination variation.

Background technique

The main task of motion target tracking requires the study in the case where just knowing that target to be tracked initial frame information to arrive One tracker, so that tracker can be with the position of Accurate Prediction target to be tracked next frame in the video sequence.With people To the understanding that deepens continuously of computer vision field, motion target tracking is used widely and is developed in the field, due to depth Degree learns the continuous application in image classification, image segmentation field, and deep learning method is also gradually applied to target following neck Domain.Compared to the manual extraction characterization method for the priori knowledge for excessively relying on designer in traditional tracking, deep learning side Method can use the advantage of big data, by the training of mass data, neural network can automatic learning characteristic, deposited at present Motion target tracking is realized in a large amount of track algorithm.But due to blocking, background is mixed and disorderly, appearance deformation, illumination variation, The influence of the objective factors such as visual angle change, so that being accurately tracked by target still suffers from great challenge.

A kind of patent document " anti-method for tracking target blocked the " (patent application of Nanjing Aero-Space University in its application Numbers 201610818828.7, publication number 106408591A) in disclose a kind of target following side based on detection, tracking and study Method.What this method was realized comprises the concrete steps that, firstly, determining mesh target area according to initial picture frame, tracker passes through the mesh It marks region and forms initial target template；Secondly, initialization cascade detectors parameter；Then, the testing mechanism blocked is added, and real Shi Gengxin threshold value；Then, tracker and detector are calculated separately to the tracking creditability of target and detection confidence level；Finally, root Tracking result to be integrated according to confidence level, fails as tracker tracks, is initialized with testing result, tracking result passes through study module, Detector relevant parameter is updated.Shortcoming existing for this method is to utilize the weighted results of target template and background template As confidence value, failing the fluctuation situation for reflecting target response to be tracked, the classifier recognition capability that training obtains is not strong enough, It cannot achieve when strong illumination variation occurs for target, target quickly moves and be accurately tracked by for a long time.

Agricultural University Of South China is in patent document " method for tracking target based on local feature learning " (patent Shen of its application Please numbers 201610024953.0, publication number 108038435A) in disclose it is a kind of using local feature learning to moving target with Track method.What this method was realized comprises the concrete steps that a large amount of local units are resolved into target area and background area by (1), uses The mode training of deep learning, building apparent model；(2) confidence that each regional area of next frame image belongs to target is calculated Degree obtains the confidence level figure positioned for target；(3) threshold values T is set_posAnd T_neg, threshold values is greater than T_posRegional area be added Threshold values is less than T by target sample collection_negRegional area be added background sample collection, update apparent model.Existing for this method not Foot place is the sample type for needing to judge by setting threshold values each regional area of image, when target to be tracked generation is larger When degree is blocked, it can cause updated model that can not continue to accurately track target target sample or background sample mistake point.

Summary of the invention

It is a kind of based on template matching and depth sorting net the purpose of the present invention is in view of the above shortcomings of the prior art, proposing The motion target tracking method of network, to realize the standard to target in the case where target generates deformation, dimensional variation or blocks Really, it effectively tracks.

Realizing the object of the invention technical solution is, firstly, choosing the instruction of training under line aiming at the problem that lack of training samples Practice mechanism；Secondly, constructing template network and detection network using Resnet 50, the spy of template image is carried out using template network Sign is extracted, and the feature extraction of image to be detected is carried out using detection network；Finally, the template characteristic of extraction is mentioned with detection template The feature taken is matched, and determines target position, and specific steps include the following:

(1) double residual error depth sorting network models are built:

(1a) is using two depth residual error neural network ResNet50 as the front end net of double residual error depth sorting network models The input layer parameter of network, the two depth residual error neural networks is different, and the parameter of other layers is identical.

(1b) builds back-end network of two 3 layers of the fully-connected network as double residual error depth sorting network models, each The first layer of fully-connected network is input layer, and the second layer is hidden layer, and third layer is output layer, the of the two fully-connected networks One layer parameter is different, and the second layer, the parameter of third layer are identical；

(2) ImageNet categorized data set is input to double residual error depth sorting network models, uses stochastic gradient descent Method updates the weight of each node in double residual error depth sorting network models, obtains trained double residual error depth sorting nets Network model；

(3) reciprocal the of depth residual error network ResNet50 is deleted in trained double residual error depth sorting network models All layers after two hidden layers obtain template network model and detection network model；

(4) template characteristic figure is extracted using template network:

(4a) input is containing first frame image in the sequence of video images of target to be tracked, in target initial position to be tracked Center at, determine a rectangle frame with one times of length and width of target to be tracked；

(4b) intercepts target image from rectangle frame, and adjustment image size is 224 × 224 × 3 pixels, obtains Prototype drawing Picture；

Template image is input in template network by (4c), extracts the feature of image, and all features are formed characteristics of image Figure exports 2048 7 × 7 template characteristic figures in the last layer of template network；

(5) detection characteristic pattern is extracted using detection network:

(5a) input contain target to be tracked image to be detected, at the center of target initial position to be tracked, with to Twice of length and width of tracking target determine a rectangle frame；

(5b) intercepts target image from rectangle frame, and adjustment image size is 448 × 448 × 3 pixels, obtains detection figure Picture；

(5c) will test image and be input in detection network, extract the feature of image, and all features are formed characteristics of image Figure exports 2048 14 × 14 detection characteristic patterns in the last layer of detection network；

(6) template matching:

(6a) corresponds the detection characteristic pattern of 2048 template characteristic figures and 2048, forms 2048 moulds Plate detects feature pair；

(6b) in every a pair of of template detection feature pair, by 7 × 7 template characteristic figure on 14 × 14 detection characteristic pattern The convolution for carrying out sliding sash mode, obtains 2048 14 × 14 template matching figures；

(6c) corresponds 14 × 14 pixels in 2048 template matching figures, and by the matching value in corresponding points Summation operation is carried out, one 14 × 14 characteristic response figure is obtained.

(7) target position is determined:

Response in 14 × 14 characteristic response figures is ranked up by (7a) from big to small, and it is corresponding to choose preceding 10 responses Normalized coordinate, seek its average normalized coordinate value；

(7b) is calculate by the following formula position of the tracking target in video frame images according to average normalized coordinate value；

X '=x × m+a-w, y '=y × n+b-h

Wherein, in x ' expression video frame first pixel in the target image upper left corner abscissa value, x indicate it is average normalized Abscissa, a indicate the abscissa value of target initial position to be tracked, and w indicates that the width of template image, m indicate detection image Width, the ordinate value of first pixel in the target image upper left corner in y ' expression video frame, y indicate average normalized ordinate, b Indicate that the ordinate value of target initial position to be tracked, h indicate template image height, n indicates the height of detection image.

(8) tracking target signature is extracted according to position of the tracking target in video frame images, it is special according to tracking target Sign figure, updates template characteristic figure: Z=η Z₁+(1-η)Z₂, wherein Z indicates updated template characteristic figure, Z₁Indicate previous frame figure Template characteristic figure as in, η indicate the learning rate of template renewal, wherein | η |≤1, Z₂Indicate the tracking mesh in current video frame Mark characteristic pattern；

(9) judge current frame video image whether be sequence of video images to be tracked last frame video image, if so, Then terminate the tracking to target to be tracked is moved, otherwise, using updated template characteristic figure as next frame target to be tracked Template characteristic figure returns (5), completes target following.

The present invention has the advantage that compared with prior art

First, since the present invention is using the mechanism of training categorized data set under line, overcome in the prior art in training net Iterated when network using first frame image, be easy over-fitting, and when target to be tracked generate largely deformation when, cause with The problem of track inaccuracy allows the invention to more accurately track target when target to be tracked generates larger deformation.

Second, since the present invention constructs double residual error depth sorting network models, extracted using template network and detection network Characteristics of image matched, the position of target to be tracked is judged using response, is overcome in the prior art when mesh to be tracked When mark generation is largely blocked, easily by positive negative sample mistake point, cause updated model that can not continue to accurately track target Problem allows the invention to more accurately track target when target to be tracked is generated and largely blocked.

Detailed description of the invention

Fig. 1 is implementation flow chart of the invention；

Fig. 2 is analogous diagram of the invention.

Specific embodiment

The embodiment of the present invention and effect are further described with reference to the accompanying drawing.

Referring to Fig.1, to the specific steps of the present invention are as follows.

Step 1, double residual error depth sorting network models are built.

1.1) front network is set:

The input layer parameter of existing two depth residual error neural network ResNet50 is adjusted, wherein first The neuron number of network input layer is set as 224 × 224 × 3, and the neuron number setting 448 of second network input layer × 448 × 3, other each layer parameters remain unchanged, and using the two depth residual error neural networks as double residual error depth sorting networks The front network of model；

1.2) back-end network is set:

Back-end network of two three layers of the fully-connected network as double residual error depth sorting network models is built, is each connected entirely The first layer for connecing network is input layer, and the second layer is hidden layer, and third layer is output layer, the first layer of the two fully-connected networks Parameter is different, and the second layer, the parameter of third layer are identical, wherein the parameter of each layer is as follows in two fully-connected networks:

The number of the first layer neuron of first network is set as 1 × 1 × 2048, the first layer nerve of second network The number of member is set as 2 × 2 × 2048；

The number of the neuron of the second layer of the two networks is set as 1024 simultaneously, and activation primitive is set as correcting simultaneously Linear unit ReLU function；

The neuron number of the third layer of the two networks is set as 1000 simultaneously, and activation primitive is set as simultaneously Softmax function.

Step 2, the double residual error depth sorting network models of training.

ImageNet categorized data set is input in double residual error depth sorting network models that step 1 is built, using with Machine gradient descent method updates the weight of each node in double residual error depth sorting network models, obtains trained double residual errors Depth sorting network model:

(2a) selects a number at random in (0,0.1) range, uses the number as every in double residual error depth sorting network models The initial weight of a node；

(2b) is using the initial weight of each node as in residual error depth sorting network models double in first time iterative process The current weight of each node；

(2c) randomly selects 2 from ImageNet classification data concentrationⁿA sample image is in double residual error depth sorting network moulds Forward-propagating in type, the output layer output 2 of double residual error depth sorting network modelsⁿThe classification results of a sample image, wherein 3≤ n≤7；

(2d) calculates the average log penalty values of classification results according to the following formula according to the classification results of sample image:

Wherein, the average log penalty values of L presentation class result, N indicate the sum of the sample image randomly selected, i table Show the serial number of input sample image, y_iIndicate the classification of i-th of input sample image, the y of positive class sample_iValue is 1, negative class sample This y_iValue is 0, p_iDouble residual error depth sorting network model output valves of i-th of sample image in presentation class result；

(2e) seeks local derviation with current weight of the average log penalty values to each node in double residual error depth sorting networks, Obtain the gradient value Δ w of each node current weight in double residual error depth sorting network models_k；

(2f) calculates each node updates in double residual error depth sorting network models according to the gradient value of node current weight Weight afterwards:

Wherein,Weight after indicating double residual error depth sorting k-th of node updates of network model, w_kIndicate that double residual errors are deep The current weight of k-th of node of sorter network model is spent, α indicates learning rate, and value range is (0,1)；

Whether the sample image in (2g) training of judgement data set is all selected, if so, obtaining trained double residual Poor depth sorting network model, otherwise, using the weight after each node updates as current weight after, execute (3c).

Step 3, template network model and detection network model are extracted.

In the trained double residual error depth sorting network models obtained in step 2, two depth networks are deleted respectively The 49th layer after network layer, remaining network become new network.

According to input layer parameter extraction template network model and detection network model from remaining network, i.e., by input layer The rest network that parameter is 224 × 224 × 3 will input remaining net of the layer parameter for 448 × 448 × 3 as template network model Network is as detection network model.

Step 4, template characteristic figure is extracted using template network.

(4a) input is containing first frame image in the sequence of video images of target to be tracked, in target initial position to be tracked Center at, with one rectangle frame of a double-length of target to be tracked and wide determination；

Template image is input in template network obtained in step 3 and carries out feature extraction to template image by (4c), and After the feature composition characteristic figure of extraction, export 2048 7 × 7 characteristic patterns in the last layer of template network, by this 2048 A 7 × 7 characteristic pattern is as template characteristic figure.

Step 5, detection characteristic pattern is extracted using detection network.

(5a) input contain target to be tracked image to be detected, at the center of target initial position to be tracked, with to Twice of length and width of tracking target determine a rectangle frame.

(5c) will test image and be input in the detection network that step 3 obtains to detection image progress feature extraction, and will After the feature composition characteristic figure of extraction, export 2048 14 × 14 characteristic patterns in the last layer of detection network, by this 2048 A 14 × 14 characteristic pattern is as detection characteristic pattern.

Step 6, template matching.

(6a) is by 2048 obtained in 2048 obtained in step 4 template characteristic figures and step 5 detection features Figure is corresponded, and 2048 template detection features pair are formed；

(6b) is step with 1 pixel to detect the upper left corner of characteristic pattern as starting point in every a pair of of template detection feature pair It is long, corresponding template characteristic figure is successively moved to the upper right corner, the lower right corner, the lower left corner of detection characteristic pattern, finally translates Hui Zuo Upper angle carries out convolution algorithm, obtains 2048 14 × 14 template matching figures；

Step 7, target position is determined.

(7a) according to being ranked up from big to small, selects the response in 14 × 14 characteristic response figures obtained in step 6 The corresponding normalized coordinate of preceding 10 responses is taken, this 10 normalized coordinate values are averaging operation, are obtained average normalized Coordinate value (x, y)；

(7b) is calculate by the following formula position of the tracking target in video frame images according to average normalized coordinate value:

X '=x × m+a-w,

Y '=y × n+b-h,

Wherein, in x ' expression video frame first pixel in the target image upper left corner abscissa value, a indicates target to be tracked The abscissa value of initial position, w indicate that the width of template image, m indicate the width of detection image, target in y ' expression video frame The ordinate value of first pixel in the image upper left corner, b indicate that the ordinate value of target initial position to be tracked, h indicate Prototype drawing Image height degree, n indicate the height of detection image.

Step 8, template renewal.

Detection characteristic pattern in, centered on position of the tracking target acquired in step 7 in video frame images, with The initial size of track target is that size carries out shearing manipulation, obtains tracking target signature, according to tracking target signature, is updated Template characteristic figure:

Z=η Z₁+(1-η)Z₂,

Wherein, Z indicates updated template characteristic figure, Z₁Indicate that the template characteristic figure in previous frame image, η indicate template The learning rate of update, | η |≤1, Z₂Indicate the tracking target signature in current video frame.

Step 9, judge current frame video image whether be sequence of video images to be tracked last frame video image, if The tracking then terminated to target to be tracked is moved, otherwise, using the updated template characteristic figure of step 8 as next frame wait for The template characteristic figure of track target, return step 5 complete target following.

Effect of the invention is described further below with reference to emulation experiment.

1. emulation experiment condition:

The hardware test platform of emulation experiment of the present invention is: CPU be intel Core i5-6500, dominant frequency 3.2GHz, Memory 8GB, GPU are NVIDIA TITAN Xp；Software platform is: 16.04 LTS of Ubuntu, 64 bit manipulation systems, python 3.6.5。

2. emulation content and result:

With the method for the present invention to the Duan Yiming acquired from 2015 database of Object tracking benchmark The sequence of video images that man on the way walks about carries out motion target tracking emulation experiment, which shares 252 frames Video image, the result of emulation experiment is as shown in figure (2), in which:

Fig. 2 (a) is the 1st frame image of the sequence of video images of emulation experiment acquisition, and white line rectangle frame indicates in Fig. 2 (a) The initial position of target to be tracked.

Fig. 2 (b) is in emulation experiment of the present invention, to the sequence of video images of acquisition, carry out target following a frame wait for The tracking result of appearance deformation and video image when target occlusion occurs for track target, wherein gray line rectangle frame mark be to The predicted position of target is tracked, what white line rectangle frame marked is the actual position of target to be tracked, it can be seen from the graph that mesh to be tracked Appearance deformation and target occlusion has occurred compared with the target to be tracked in Fig. 2 (a) in mark.

Fig. 2 (c) is in emulation experiment of the present invention, and the frame for carrying out target following to the sequence of video images of acquisition is to be tracked The tracking result of appearance deformation and video image when illumination variation occurs for target, wherein gray line rectangle frame mark be to The predicted position of track target, what white line rectangle frame marked is the position of target to be tracked.It can be seen from the graph that target to be tracked and figure Target to be tracked in 2 (a) is compared, and appearance deformation and illumination enhancing has occurred.

What the target and white line rectangle frame that the gray line rectangle frame in figure it can be seen from Fig. 2 (b) and Fig. 2 (c) outlines outlined Target is consistent, illustrates that the present invention can be accurate, high when target to be tracked generates deformation, illumination variation, blocks in video image Effect ground tracking target.

Claims

1. a kind of motion target tracking method based on template matching and depth sorting network, which is characterized in that include the following:

(1) double residual error depth sorting network models are built:

(1a) using two depth residual error neural network ResNet50 as the front network of double residual error depth sorting network models, this The input layer parameter of two depth residual error neural networks is different, and the parameter of other layers is identical.

(1b) builds back-end network of two 3 layers of the fully-connected network as double residual error depth sorting network models, each connects entirely The first layer for connecing network is input layer, and the second layer is hidden layer, and third layer is output layer, the first layer of the two fully-connected networks Parameter is different, and the second layer, the parameter of third layer are identical；

(2) ImageNet categorized data set is input to double residual error depth sorting network models, using stochastic gradient descent method, more The weight of each node, obtains trained double residual error depth sorting network moulds in new double residual error depth sorting network models Type；

(3) penultimate of depth residual error network ResNet50 is deleted in trained double residual error depth sorting network models All layers after hidden layer obtain template network model and detection network model；

(4) template characteristic figure is extracted using template network:

(4a) input is containing first frame image in the sequence of video images of target to be tracked, in target initial position to be tracked At the heart, a rectangle frame is determined with one times of length and width of target to be tracked；

(4b) intercepts target image from rectangle frame, and adjustment image size is 224 × 224 × 3 pixels, obtains template image；

Template image is input in template network by (4c), extracts the feature of image, and all features are formed characteristics of image figure, The last layer of template network exports 2048 7 × 7 template characteristic figures；

(5) detection characteristic pattern is extracted using detection network:

(5a) input contains image to be detected of target to be tracked, at the center of target initial position to be tracked, with to be tracked Twice of length and width of target determine a rectangle frame；

(5b) intercepts target image from rectangle frame, and adjustment image size is 448 × 448 × 3 pixels, obtains detection image；

(5c) will test image and be input in detection network, extract the feature of image, and all features are formed characteristics of image figure, The last layer for detecting network exports 2048 14 × 14 detection characteristic patterns；

(6) template matching:

(6a) corresponds the detection characteristic pattern of 2048 template characteristic figures and 2048, forms 2048 template inspections Survey feature pair；

(6b) carries out 7 × 7 template characteristic figure in every a pair of of template detection feature pair on 14 × 14 detection characteristic pattern The convolution of sliding sash mode obtains 2048 14 × 14 template matching figures；

(6c) corresponds 14 × 14 pixels in 2048 template matching figures, and the matching value in corresponding points is carried out Summation operation obtains one 14 × 14 characteristic response figure.

(7) target position is determined:

Response in 14 × 14 characteristic response figures is ranked up by (7a) from big to small, and preceding 10 responses of selection are corresponding to return One changes coordinate, seeks its average normalized coordinate value；

X '=x × m+a-w, y '=y × n+b-h,

Wherein, in x ' expression video frame first pixel in the target image upper left corner abscissa value, x indicates average normalized horizontal seat Mark, a indicate the abscissa value of target initial position to be tracked, and w indicates that the width of template image, m indicate the width of detection image, The ordinate value of first pixel in the target image upper left corner in y ' expression video frame, y indicate that average normalized ordinate, b indicate The ordinate value of target initial position to be tracked, h indicate template image height, and n indicates the height of detection image.

(8) tracking target signature is extracted according to position of the tracking target in video frame images, according to tracking target signature, Update template characteristic figure: Z=η Z₁+(1-η)Z₂, wherein Z indicates updated template characteristic figure, Z₁It indicates in previous frame image Template characteristic figure, η indicate template renewal learning rate, wherein | η |≤1, Z₂Indicate that the tracking target in current video frame is special Sign figure；

(9) judge whether current frame video image is the last frame video image of sequence of video images to be tracked, if so, tying Tracking of the beam to target to be tracked is moved, otherwise, using updated template characteristic figure as the template of next frame target to be tracked Characteristic pattern returns (5), completes target following.

2. the method according to claim 1, wherein two depth residual error neural network ResNet50 in (1a) The number of input layer, neuron is respectively set to 224 × 224 × 3 and 448 × 448 × 3.

3. the method according to claim 1, wherein the parameter setting of two each layers of fully-connected network is such as in (1b) Under:

The number of first layer neuron is respectively 1 × 1 × 2048 and 2 × 2 × 2048；

The number of the neuron of the second layer is 1024, and activation primitive is set as correcting linear unit R eLU function；

The neuron number of third layer is 1000, and activation primitive is set as Softmax function.

4. it is deep to update double residual errors the method according to claim 1, wherein using stochastic gradient descent method in (2) Spend the weight of each node in sorter network model, the specific steps are as follows:

(2a) selects a number at random in (0,0.1) range, uses the number as each section in double residual error depth sorting network models The initial weight of point；

(2b) is using the initial weight of each node as each in residual error depth sorting network models double in first time iterative process The current weight of node；

(2c) randomly selects 2 from ImageNet classification data concentrationⁿA sample image in double residual error depth sorting network models just To propagation, wherein 3≤n≤7, the output layer output 2 of double residual error depth sorting network modelsⁿThe classification results of a sample image；

Wherein, the average log penalty values of L presentation class result, N indicate that the sum of the sample image randomly selected, i indicate defeated Enter the serial number of sample image, y_iIndicate the classification of i-th of input sample image, the y of positive class sample_iValue takes 1, the y of negative class sample_i Value takes 0, p_iDouble residual error depth sorting network model output valves of i-th of sample image in presentation class result；

(2e) seeks local derviation with current weight of the average log penalty values to each node in double residual error depth sorting networks, obtains The gradient value Δ w of each node current weight in double residual error depth sorting network models_k；

(2f) is calculated in double residual error depth sorting network models after each node updates according to the gradient value of node current weight Weight:

Wherein,Weight after indicating double residual error depth sorting k-th of node updates of network model, w_kIndicate double residual error depth point The current weight of k-th of node of class network model, α indicate learning rate, and value range is (0,1)；

Whether the sample image in (2g) training of judgement data set is all selected, if so, it is deep to obtain trained double residual errors Sorter network model is spent, otherwise, executes (2c) using the weight after each node updates as after current weight.

5. the method according to claim 1, wherein the sliding sash mode convolution in (6b), is with every a pair of of template The upper left corner for detecting the detection characteristic pattern of feature centering is starting point, using 1 pixel as step-length, successively by corresponding template characteristic figure The upper right corner, the lower right corner, the lower left corner for moving to detection characteristic pattern finally translate go back to the upper left corner and carry out convolution algorithm.