CN110033473B

CN110033473B - Moving target tracking method based on template matching and depth classification network

Info

Publication number: CN110033473B
Application number: CN201910297980.9A
Authority: CN
Inventors: 田小林; 李芳�; 李帅; 李娇娇; 荀亮; 贾楠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2021-04-20
Anticipated expiration: 2039-04-15
Also published as: CN110033473A

Abstract

The invention discloses a moving target tracking method based on template matching and a depth classification network, which mainly solves the problems of low target detection speed and inaccurate tracking when a target is deformed and shielded in the prior art. The implementation scheme is as follows: 1) building a double residual error deep classification network and training the double residual error deep classification network; 2) extracting a template network and a detection network from a double residual error deep classification network; 3) extracting template features by using a template network; 4) extracting detection characteristics by using a detection network; 5) carrying out template matching on the template characteristics on the detection characteristics to obtain a template matching image; 6) determining the target position according to the template matching graph; 7) updating the template features according to the target position; 8) and judging whether the current frame is the last frame, if so, ending the target tracking, otherwise, taking the updated template characteristics as the template characteristics of the next frame, and returning to step 4). The method is high in tracking speed and accuracy and is used for tracking the video target with severe deformation and illumination change.

Description

Moving target tracking method based on template matching and depth classification network

Technical Field

The invention belongs to the technical field of image processing, and further relates to a moving target tracking method which can be used for tracking video targets with severe deformation, lens jitter, scale change, illumination change and the like.

Background

The main task of tracking the moving target requires learning a tracker under the condition of only knowing initial frame information of the target to be tracked, so that the tracker can accurately predict the position of the target to be tracked in the next frame of a video sequence. With the continuous and deep understanding of people on the computer vision field, the moving target tracking is widely applied and developed in the field, and due to the continuous application of the deep learning in the image classification and image segmentation fields, the deep learning method is also gradually applied to the target tracking field. Compared with a manual feature extraction method which excessively depends on prior knowledge of a designer in the traditional tracking method, the deep learning method can utilize the advantages of big data, and a neural network can automatically learn features through training of a large amount of data, so that a large amount of tracking algorithms exist at present to realize the tracking of moving targets. However, due to the influence of objective factors such as occlusion, background clutter, appearance deformation, illumination change, and view angle change, it is still very challenging to accurately track the target.

The patent document "an anti-occlusion target tracking method" (patent application No. 201610818828.7, publication No. 106408591a) applied by Nanjing aerospace university discloses a target tracking method based on detection, tracking and learning. Firstly, determining a target area according to an initial image frame, and forming an initial target template through the target area by a tracker; secondly, initializing parameters of a cascade detector; then, adding a shielding detection mechanism and updating a threshold value in real time; then, respectively calculating the tracking confidence and the detection confidence of the tracker and the detector to the target; and finally, integrating the tracking result according to the confidence coefficient, if the tracker fails to track, initializing by using the detection result, and updating the corresponding parameters of the detector by the tracking result through a learning module. The method has the disadvantages that the weighted result of the target template and the background template is used as a confidence value, the fluctuation condition of the response of the target to be tracked cannot be reflected, the identification capability of the classifier obtained by training is not strong enough, and the target cannot be accurately tracked for a long time when the target is in intense illumination change and moves rapidly.

A method for tracking a moving target by using local feature learning is disclosed in a patent document "target tracking method based on local feature learning" (patent application No. 201610024953.0, publication No. 108038435a) applied by the university of south china agriculture. The method comprises the specific steps of (1) decomposing a target area and a background area into a large number of local units, and training and constructing an appearance model in a deep learning mode; (2) calculating the confidence coefficient that each local area of the next frame of image belongs to the target to obtain a confidence coefficient map for positioning the target; (3) setting a threshold value T_posAnd T_negThe threshold value is larger than T_posAdding the local area into the target sample set, and reducing the threshold value to be less than T_negAdding a background sample set into the local area, and updating the appearance model. The method has the disadvantages that the type of the sample of each local area of the image needs to be judged by setting a threshold value, and when a target to be tracked is shielded to a large extent, the target sample or the background sample can be wrongly classified, so that the updated model cannot continuously and accurately track the target.

Disclosure of Invention

The invention aims to provide a moving target tracking method based on template matching and a depth classification network aiming at the defects of the prior art so as to realize accurate and effective tracking of a target under the condition that the target is deformed, changed in scale or shielded.

Firstly, selecting a training mechanism for offline training aiming at the problem of insufficient training samples; secondly, a template network and a detection network are established by utilizing the Resnet50, the template network is utilized to extract the characteristics of the template image, and the detection network is utilized to extract the characteristics of the image to be detected; and finally, matching the extracted template features with the features extracted from the detection template to determine the position of the target, wherein the specific steps comprise the following steps:

(1) building a double residual depth classification network model:

(1a) two depth residual error neural networks ResNet50 are used as front-end networks of a double-residual error depth classification network model, the parameters of input layers of the two depth residual error neural networks are different, and the parameters of other layers are the same.

(1b) Two 3-layer fully-connected networks are set up to serve as a back-end network of a double-residual-error depth classification network model, the first layer of each fully-connected network is an input layer, the second layer is a hidden layer, the third layer is an output layer, the parameters of the first layer of each fully-connected network are different, and the parameters of the second layer and the third layer of each fully-connected network are the same;

(2) inputting the ImageNet classification data set into a double-residual-error deep classification network model, and updating the weight of each node in the double-residual-error deep classification network model by using a random gradient descent method to obtain a trained double-residual-error deep classification network model;

(3) deleting all layers behind the penultimate hidden layer of the depth residual error network ResNet50 in the trained double-residual error depth classification network model to obtain a template network model and a detection network model;

(4) extracting a template feature map by using a template network:

(4a) inputting a first frame image in a video image sequence containing a target to be tracked, and determining a rectangular frame by one time of the length and width of the target to be tracked at the center of the initial position of the target to be tracked;

(4b) intercepting a target image from the rectangular frame, and adjusting the size of the image to 224 multiplied by 3 pixels to obtain a template image;

(4c) inputting a template image into a template network, extracting the features of the image, forming an image feature map by all the features, and outputting 2048 template feature maps of 7 multiplied by 7 on the last layer of the template network;

(5) extracting a detection feature map by using a detection network:

(5a) inputting an image to be detected containing a target to be tracked, and determining a rectangular frame by the length and width twice of the target to be tracked at the center of the initial position of the target to be tracked;

(5b) intercepting a target image from the rectangular frame, and adjusting the size of the image to 448 multiplied by 3 pixels to obtain a detection image;

(5c) inputting a detection image into a detection network, extracting the characteristics of the image, forming an image characteristic diagram by all the characteristics, and outputting 2048 detection characteristic diagrams of 14 multiplied by 14 at the last layer of the detection network;

(6) template matching:

(6a) 2048 template feature maps and 2048 detection feature maps are in one-to-one correspondence to form 2048 template detection feature pairs;

(6b) in each pair of template detection feature pairs, carrying out sliding frame type convolution on a 7 × 7 template feature map on a 14 × 14 detection feature map to obtain 2048 14 × 14 template matching maps;

(6c) and (3) corresponding 14 × 14 pixel points in the 2048 template matching images one by one, and performing summation operation on matching values on the corresponding points to obtain a 14 × 14 characteristic response image.

(7) Determining the position of the target:

(7a) sequencing the response values in the 14 multiplied by 14 characteristic response graph from large to small, selecting the normalization coordinate corresponding to the first 10 response values, and solving the average normalization coordinate value;

(7b) calculating the position of the tracking target in the video frame image according to the average normalized coordinate value through the following formula;

x′＝x×m+a-w，y′＝y×n+b-h

wherein x 'represents the abscissa value of the first pixel at the upper left corner of the target image in the video frame, x represents the average normalized abscissa, a represents the abscissa value of the initial position of the target to be tracked, w represents the width of the template image, m represents the width of the detection image, y' represents the ordinate value of the first pixel at the upper left corner of the target image in the video frame, y represents the average normalized ordinate, b represents the ordinate value of the initial position of the target to be tracked, h represents the height of the template image, and n represents the height of the detection image.

(8) Extracting a tracking target feature map according to the position of the tracking target in the video frame image, and updating the template feature map according to the tracking target feature map: z ═ η Z₁+(1-η)Z₂Wherein Z represents the updated template feature map, Z₁Representing the template feature map in the previous frame of image, wherein eta represents the learning rate of template updating, wherein eta is less than or equal to 1, Z₂Representing a tracking target feature map in a current video frame;

(9) and (5) judging whether the current frame video image is the last frame video image of the video image sequence to be tracked, if so, ending the tracking of the moving target to be tracked, otherwise, taking the updated template feature map as the template feature map of the next frame target to be tracked, and returning to the step (5) to finish the target tracking.

Compared with the prior art, the invention has the following advantages:

firstly, the invention uses the mechanism of the offline training classification data set, and overcomes the problems that in the prior art, when a network is trained, a first frame image is used for repeated iteration, overfitting is easy, and when a target to be tracked is deformed to a large extent, tracking is inaccurate, so that the invention can more accurately track the target when the target to be tracked is deformed to a large extent.

Secondly, because the invention constructs a double-residual-error depth classification network model, utilizes the template network and the image characteristics extracted by the detection network to carry out matching, and uses the response value to judge the position of the target to be tracked, the invention overcomes the problem that when the target to be tracked generates a large degree of shielding, positive and negative samples are easily mistakenly divided, so that the updated model cannot continuously and accurately track the target in the prior art, and the invention can more accurately track the target when the target to be tracked generates a large degree of shielding.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a simulation of the present invention.

Detailed Description

Embodiments and effects of the present invention will be further described below with reference to the accompanying drawings.

Referring to fig. 1, the specific steps for the present invention are as follows.

Step 1, building a double residual error depth classification network model.

1.1) setting a front-end network:

adjusting the input layer parameters of two existing depth residual error neural networks ResNet50, wherein the number of neurons of the first network input layer is set to be 224 multiplied by 3, the number of neurons of the second network input layer is set to be 448 multiplied by 3, the parameters of other layers are kept unchanged, and the two depth residual error neural networks are used as front-end networks of a double-residual error depth classification network model;

1.2) setting a back-end network:

the method comprises the following steps of constructing two three-layer fully-connected networks as a rear-end network of a double-residual-error depth classification network model, wherein a first layer of each fully-connected network is an input layer, a second layer is a hidden layer, a third layer is an output layer, parameters of the first layers of the two fully-connected networks are different, parameters of the second layer and the third layer are the same, and parameters of all layers in the two fully-connected networks are as follows:

the number of the first layer neurons of the first network is set to 1 × 1 × 2048, and the number of the first layer neurons of the second network is set to 2 × 2 × 2048;

the number of the neurons of the second layer of the two networks is simultaneously set to 1024, and the activation function is simultaneously set to a modified linear unit ReLU function;

the number of neurons in the third layer of the two networks was set to 1000 at the same time, and the activation function was set to the Softmax function at the same time.

And 2, training a double residual error deep classification network model.

Inputting the ImageNet classification data set into the double-residual-error deep classification network model built in the step 1, and updating the weight of each node in the double-residual-error deep classification network model by using a random gradient descent method to obtain a trained double-residual-error deep classification network model:

(2a) randomly selecting a number in the range of (0,0.1), and using the number as an initial weight of each node in the double-residual depth classification network model;

(2b) taking the initial weight of each node as the current weight of each node in the double-residual depth classification network model in the first iteration process;

(2c) random selection from ImageNet classification dataset 2ⁿThe sample image is propagated in the forward direction in the double-residual depth classification network model, and the output layer of the double-residual depth classification network model outputs 2ⁿN is more than or equal to 3 and less than or equal to 7 according to the classification result of the sample images;

(2d) calculating the average logarithmic loss value of the classification result according to the classification result of the sample image and the following formula:

wherein L represents the average log loss value of the classification result, N represents the total number of randomly selected sample images, i represents the serial number of the input sample image, y_iClass representing the ith input sample image, positiveY of class sample_iValue 1, y of negative type sample_iValue of 0, p_iRepresenting the output value of the double residual error depth classification network model of the ith sample image in the classification result;

(2e) calculating the partial derivative of the current weight of each node in the double-residual depth classification network by using the average logarithmic loss value to obtain the gradient value delta w of the current weight of each node in the double-residual depth classification network model_k；

(2f) Calculating the updated weight of each node in the double-residual depth classification network model according to the gradient value of the current weight of the node:

wherein,

represents the updated weight value, w, of the kth node of the double-residual depth classification network model_kRepresenting the current weight of the kth node of the double-residual depth classification network model, wherein alpha represents the learning rate and the value range is (0, 1);

(2g) and (3) judging whether the sample images in the training data set are all selected, if so, obtaining a trained double-residual deep classification network model, otherwise, executing (2c) after the updated weight of each node is taken as the current weight.

And 3, extracting a template network model and a detection network model.

In the trained double-residual-error deep classification network model obtained in the step 2, the network layers after the 49 th layer of the two deep networks are deleted respectively, and the rest of the networks become new networks.

And extracting a template network model and a detection network model from the rest networks according to the parameters of the input layer, namely using the rest networks with the parameters of the input layer of 224 multiplied by 3 as the template network model and using the rest networks with the parameters of the input layer of 448 multiplied by 3 as the detection network model.

And 4, extracting a template characteristic graph by using the template network.

(4a) Inputting a first frame image in a video image sequence containing a target to be tracked, and determining a rectangular frame by the length and width of the target to be tracked at the center of the initial position of the target to be tracked;

(4c) inputting the template image into the template network obtained in the step 3 to perform feature extraction on the template image, forming feature maps by using the extracted features, outputting 2048 feature maps of 7 × 7 in the last layer of the template network, and taking the 2048 feature maps of 7 × 7 as the template feature maps.

And 5, extracting a detection characteristic diagram by using the detection network.

(5a) Inputting an image to be detected containing a target to be tracked, and determining a rectangular frame by twice the length and the width of the target to be tracked at the center of the initial position of the target to be tracked.

(5c) inputting the detection image into the detection network obtained in the step 3 to perform feature extraction on the detection image, forming a feature map by using the extracted features, outputting 2048 feature maps of 14 × 14 in the last layer of the detection network, and taking the 2048 feature maps of 14 × 14 as the detection feature map.

And 6, template matching.

(6a) The 2048 template feature maps obtained in the step 4 and 2048 detection feature maps obtained in the step 5 are in one-to-one correspondence to form 2048 template detection feature pairs;

(6b) in each pair of template detection feature pairs, taking the upper left corner of a detection feature map as a starting point and 1 pixel as a step length, sequentially translating the corresponding template feature maps to the upper right corner, the lower right corner and the lower left corner of the detection feature map, and finally translating the template feature maps back to the upper left corner for convolution operation to obtain 2048 template matching maps of 14 multiplied by 14;

And 7, determining the position of the target.

(7a) Sorting the response values in the 14 × 14 characteristic response graph obtained in the step 6 from large to small, selecting a normalization coordinate corresponding to the first 10 response values, and averaging the 10 normalization coordinate values to obtain an average normalization coordinate value (x, y);

(7b) calculating the position of the tracking target in the video frame image according to the average normalized coordinate value by the following formula:

x′＝x×m+a-w，

y′＝y×n+b-h，

wherein x 'represents the abscissa value of the first pixel at the upper left corner of the target image in the video frame, a represents the abscissa value of the initial position of the target to be tracked, w represents the width of the template image, m represents the width of the detection image, y' represents the ordinate value of the first pixel at the upper left corner of the target image in the video frame, b represents the ordinate value of the initial position of the target to be tracked, h represents the height of the template image, and n represents the height of the detection image.

And 8, updating the template.

In the detection feature map, taking the position of the tracking target in the video frame image obtained in the step 7 as the center, and taking the initial size of the tracking target as the size to perform a cutting operation, so as to obtain a tracking target feature map, and according to the tracking target feature map, updating the template feature map:

Z＝ηZ₁+(1-η)Z₂，

wherein Z represents the updated template feature map, Z₁Representing the template characteristic diagram in the previous frame of image, wherein eta represents the learning rate of updating the template, eta is less than or equal to 1, and Z₂Representing the feature map of the tracked object in the current video frame.

And 9, judging whether the current frame video image is the last frame video image of the video image sequence to be tracked, if so, ending the tracking of the moving target to be tracked, otherwise, taking the template characteristic image updated in the step 8 as the template characteristic image of the next frame target to be tracked, returning to the step 5, and finishing the target tracking.

The effect of the present invention will be further explained with the simulation experiment.

1. Simulation experiment conditions are as follows:

the hardware test platform of the simulation experiment of the invention is as follows: the CPU is intel Core i5-6500, the main frequency is 3.2GHz, the memory is 8GB, and the GPU is NVIDIA TITAN Xp; the software platform is as follows: ubuntu 16.04 LTS, 64-bit operating system, python 3.6.5.

2. Simulation content and results:

the method of the invention is used for carrying out a moving Object tracking simulation experiment on a video image sequence collected from an Object tracking benchmark 2015 database, wherein the video image sequence moves on the road for a man, the video image sequence has 252 frames of video images in total, and the result of the simulation experiment is shown in a figure (2), wherein:

fig. 2(a) is a 1 st frame image of a video image sequence acquired by a simulation experiment, and a white-line rectangular frame in fig. 2(a) represents an initial position of a target to be tracked.

Fig. 2(b) is a tracking result of a video image when appearance deformation and target occlusion occur in a frame of target to be tracked for target tracking of an acquired video image sequence in a simulation experiment of the present invention, wherein a gray line rectangular frame marks a predicted position of the target to be tracked, a white line rectangular frame marks a real position of the target to be tracked, and as can be seen from the diagram, the target to be tracked has appearance deformation and target occlusion compared with the target to be tracked in fig. 2 (a).

Fig. 2(c) is a tracking result of a frame of target to be tracked when appearance deformation and illumination change occur to a target to be tracked which performs target tracking on an acquired video image sequence in a simulation experiment of the present invention, wherein a gray line rectangular frame marks a predicted position of the target to be tracked, and a white line rectangular frame marks a position of the target to be tracked. As can be seen from the figure, the target to be tracked has appearance deformation and illumination enhancement compared with the target to be tracked in fig. 2 (a).

As can be seen from fig. 2(b) and 2(c), the target framed by the rectangular frame with gray lines in the figure is consistent with the target framed by the rectangular frame with white lines, which shows that the present invention can accurately and efficiently track the target when the target to be tracked in the video image is deformed, changed in illumination, and shielded.

Claims

1. A moving target tracking method based on template matching and a depth classification network is characterized by comprising the following steps:

(1) building a double residual depth classification network model:

(1a) two depth residual error neural networks ResNet50 are used as front-end networks of a double-residual error depth classification network model, the parameters of input layers of the two depth residual error neural networks are different, and the parameters of other layers are the same;

(4) extracting a template feature map by using a template network:

(5) extracting a detection feature map by using a detection network:

(6) template matching:

(6c) the method comprises the steps of enabling 14 × 14 pixel points in 2048 template matching graphs to correspond one to one, and conducting summation operation on matching values on corresponding points to obtain a 14 × 14 characteristic response graph;

(7) determining the position of the target:

x′＝x×m+a-w，y′＝y×n+b-h，

wherein x 'represents the abscissa value of the first pixel at the upper left corner of a target image in a video frame, x represents the average normalized abscissa, a represents the abscissa value of the initial position of the target to be tracked, w represents the width of a template image, m represents the width of a detection image, y' represents the ordinate value of the first pixel at the upper left corner of the target image in the video frame, y represents the average normalized ordinate, b represents the ordinate value of the initial position of the target to be tracked, h represents the height of the template image, and n represents the height of the detection image;

2. The method of claim 1, wherein the number of neurons in the input layers of the two deep residual neural networks ResNet50 in (1a) is set to 224 × 224 × 3 and 448 × 448 × 3, respectively.

3. The method of claim 1, wherein the parameters of each of the two fully-connected networks in (1b) are set as follows:

the number of the first layer neurons is 1 × 1 × 2048 and 2 × 2 × 2048, respectively;

the number of the neurons of the second layer is 1024, and the activation function of the neurons is set as a modified linear unit ReLU function;

the number of neurons in the third layer is 1000, and the activation function of the neurons is set to be a Softmax function.

4. The method according to claim 1, wherein in (2), a random gradient descent method is used to update the weight of each node in the dual-residual depth classification network model, and the specific steps are as follows:

(2c) random selection from ImageNet classification dataset 2ⁿThe sample images are propagated in the forward direction in the double-residual depth classification network model, wherein n is more than or equal to 3 and less than or equal to 7, and the output layer of the double-residual depth classification network model outputs 2ⁿClassifying results of the sample images;

wherein L represents the average log loss value of the classification result, N represents the total number of randomly selected sample images, i represents the serial number of the input sample image, y_iY representing the class of the ith input sample image, positive class sample_iValue 1, y of negative class sample_iThe value is taken to be 0, p_iRepresenting the output value of the double residual error depth classification network model of the ith sample image in the classification result;

wherein,

representing a k-th node update of a dual residual depth classification network modelThe latter weight, w_kRepresenting the current weight of the kth node of the double-residual depth classification network model, wherein alpha represents the learning rate and the value range is (0, 1);

(2g) and (3) judging whether all the sample images in the training data set are selected, if so, obtaining a trained double-residual deep classification network model, and otherwise, executing (2c) after the updated weight of each node is taken as the current weight.

5. The method of claim 1, wherein the sliding frame convolution in (6b) is performed by using the top left corner of the detected feature map in each pair of template detected feature pairs as a starting point, using 1 pixel as a step size, sequentially translating the corresponding template feature map to the top right corner, the bottom right corner and the bottom left corner of the detected feature map, and finally translating back to the top left corner.