CN110458864A

CN110458864A - Based on the method for tracking target and target tracker for integrating semantic knowledge and example aspects

Info

Publication number: CN110458864A
Application number: CN201910590225.XA
Authority: CN
Inventors: 张索非; 冯烨
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2019-11-15

Abstract

The present invention provides a kind of based on the method for tracking target and target tracker of integrating semantic knowledge and example aspects.Described method includes following steps: extracting the picture of the 1st, t-1, t frame；Cut in step 1 the 1st, the picture of t-1, t frame, take input of the picture after cutting as convolutional neural networks；The neural network model based on Darknet-19 is constructed, and carries out trickle amendment on its backbone network；The entire tracker convolutional neural networks of training；Finally, the model performance of assessment training.Regression problem and directly the prediction target location coordinate that is passed to frame are modeled as the present invention is based on proposing a kind of new network architecture model on Darknet-19, while by Target Tracking Problem.The model of present invention training is for specific object type, it is achieved that state-of-the-art performance and speed is very fast.

Description

Target tracking method and target tracker based on integrated semantic knowledge and instance features

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a target tracking method and a target tracker based on integrated semantic knowledge and example features.

Background

As an important component of a large number of computer vision systems, target tracking technology has attracted research interest to a number of researchers. In the last decade, deep learning based methods have shown great power in the field of target tracking. Typical deep network structures, such as Convolutional Neural Networks (CNNs), can extract representative visual features in end-to-end training. Unlike traditional representations of hand-made features, this description of image data can save rich knowledge into the model to track drastic changes in the target. Therefore, the best target trackers, such as Visual Object Tracking (VOT), object tracking fiducial (OTB), etc., are all methods based on deep learning.

Unlike object detection or recognition, current research in object tracking focuses primarily on example features of the object rather than semantic knowledge. However, the human eye acts as a high performance tracker that captures both low levels of visual features and high levels of semantic knowledge. When the human eye attempts to track a car, the features that are seen are always considered part of the average car. When detailed instance features (e.g., jitter, occlusion, or perspective change, etc.) are not present, these a priori knowledge plays a key role in challenging conditions.

When a series of targets including pedestrians, vehicles and the like are processed, although the positions of the targets can be directly predicted by adopting a Region Proposal Network (RPN) structure, the targets only realize regression of different anchor points without any semantic assumption.

In view of the above, there is a need to design a target tracking method based on integrating semantic knowledge and instance features to solve the above problems.

Disclosure of Invention

The invention provides a method for tracking a target based on integrated semantic knowledge and example characteristics, which aims to solve the problem that a general target tracker only focuses on example characteristics of the target and ignores semantic prior knowledge. The method provides a new network architecture model based on Darknet-19, models a target tracking problem as a regression problem and directly predicts the target position coordinates of an incoming frame.

To achieve the above object, the present invention provides a method comprising the steps of:

step 1: extracting pictures of the 1 st frame, the t-1 th frame and the t th frame;

step 2: cutting the pictures of the 1 st, t-1 st and t frames in the step 1, and taking the cut pictures as the input of the convolutional neural network;

and step 3: constructing a neural network model based on Darknet-19, and slightly modifying the backbone network of the neural network model;

and 4, step 4: training the whole tracker convolutional neural network;

and 5: the trained model performance is evaluated.

The invention further improves the method, and before the step 4, the method also comprises a step 3.1, wherein the step 3.1 is to design a network output, and the network output comprises a classification branch and a regression branch; before step 5, a step 4.1 is also included, wherein the step 4.1 is to design a network loss function.

In step 1, the 1 st frame of picture is selected as a standard template containing the target, and the t th frame of picture is selected as a candidate area where the target may appear.

A further refinement of the invention consists in that in step 2 a standard template comprising the object is extracted for initialization.

A further improvement of the present invention is that, assuming that the size of the real bounding box is (w, h), the input 1 st frame of picture is cropped by the size S around the center of the target to obtain an example image, which is always used as a standard template throughout the tracking process, when the margin information satisfies the following relationship:

s²＝(3w)*(3h) (1)。

a further improvement of the invention is that in step 3, based on the original structure of Darknet-19, three convolutional layers and two fully-connected layers are used instead of global pooling for classification and localization, respectively.

In a further development of the invention, in the t-th frame, the network outputs a fractional vector w from the fully-connected layer^t∈R^KAs a result of the classification of the target, K is the number of classes, this vector reflecting the likelihood of the corresponding object appearing in the line of sight; at the same time, the network outputsAs a deformation prediction for each class target, assume that the t-1 frame output bounding box is p^t-1＝(x^t-1,y^t-1,w^t-1,h^t-1) Where x, y are the center coordinates of the box, w and h are the width and height of the box, and the regression of the deformation for class kConsists of four coordinates:

wherein,representing the deformation of the target under different semantic assumptions. Last result p of t-th frame^tCan be calculated from:

the invention is further improved in that a cross-entropy loss function is adopted for the classification loss w, and an L1 loss function is adopted for the bounding box regression loss d:

wherein,being the true deformation of the second input to the third input, the L1 penalty is higher for slight errors in the predicted bounding box and the true bounding box. The trained model thus has a more stable bounding box.

A further development of the invention is that step 4 comprises a first stage: pre-training 10 epochs of a backbone network on an ImageNet classification dataset, and taking an original image as a first input and a standard random contrast enhanced image and color change as a second input and a third input; and a second stage: and training the whole target tracking network to obtain a training model.

In order to achieve the purpose of the invention, the invention also provides a target tracker for realizing the method.

The invention has the following beneficial effects: the invention integrates semantic knowledge and example characteristics to track the target, provides a new network architecture model based on Darknet-19, models the target tracking problem as a regression problem and directly predicts the target position coordinates of an incoming frame, and realizes high accuracy and execution efficiency of the tracker in daily tracking tasks.

Drawings

FIG. 1 shows the 1 st, t-1 st and t frames of the extracted picture.

FIG. 2 is a diagram of the three pictures in FIG. 1 being cropped to include a target.

Fig. 3 is a convolutional neural network model.

Fig. 4 shows two output branches of a convolutional neural network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

It should be emphasized that in describing the present invention, various formulas and constraints are identified with consistent labels, but the use of different labels to identify the same formula and/or constraint is not precluded and is provided for the purpose of more clearly illustrating the features of the present invention.

The CNN target tracking model provided by the invention is trained on a mixed data set (ImageNet VID and ALOV300+ +). The ImageNet VID contains 30 different classes of targets, and we pick the commonly used 8 classes including: airplanes, bicycles, birds, buses, cars, cats, horses, and motorcycles. Since there are no pedestrians in this dataset, the pedestrians were selected from the ALOV300+ +, and finally a mixed dataset containing 9 classes was formed.

As shown in fig. 1, the present invention first extracts three pictures of a video frame as input of a network, and then obtains three target pictures input to a CNN network by cropping. As shown in fig. 2, the three target images are subjected to a CNN convolutional neural network model (as shown in fig. 3) to extract features, and two branches are finally output as shown in fig. 4, one branch is a classification branch and is used for distinguishing the category of the target, and the other branch is a regression branch and is used for regression of the bounding box.

Table 1 shows detailed parameters of the CNN network structure designed by the present invention.

As shown in table 1, the present invention fine-tunes the Darknet-19 network model, uses three convolutional layers and two fully-connected layers instead of global pooling for classification and localization, respectively, and fine-tunes the above-mentioned hybrid video data set. The first and t-1 frames are extracted every 100 frames in a video sequence. For data enhancement, a transformation to the real bounding box is performed at the t-th frame using a Gaussian distribution. This model iterates over 50 times on a 4-block NVIDIA Tesla P40 GPU, with 800 batches (512 samples) per iteration.

Specifically, the method for tracking the target based on the integrated semantic knowledge and the example characteristics comprises the following steps:

step 1, extracting pictures of the 1 st, t-1 st and t frames as input:

the first picture selects the first frame as a standard template of the target, the second picture is selected from the t-1 th frame, and the last picture is selected from a candidate area in which the target may appear in the current frame.

Step 2: and (3) cutting the three pictures input in the step (1) to enable the three pictures to comprise targets:

in the first frame, a standard template of the extraction target is initialized. Assuming that the size of the real bounding box is (w, h), the picture is cropped around the center of the target by the size S, the square region provides an example image and the contextual edge distance information satisfies the following relationship:

s²＝(3w)*(3h) (1)

this example image is the first input to the CNN network and is 288 × 288 in size. This example image is used as a template throughout the tracking process. The hyperparameter 3 in equation (1) is retained from the video statistics in the VID dataset. This configuration contains the motion of almost all objects in adjacent frames while ensuring an acceptable resolution value after scaling.

Assume that the tracking result of the t-1 th frame is p^t-1In the t-1 th frame and the t-th frame with (x)^t-1,y^t-1) Cutting at the center to obtain a cut size of (3 w)^t-1,3h^t-1) The picture after cropping is also 288 × 288 and serves as both the second and third inputs to the CNN network. Note that the ratio of objects in the first frame is to be preserved because it encodes the features of the template object. Conversely, the target of the t-1 frame is expanded to a size of 96 pixels to facilitate the CNN network to learn the bounding box regression more efficiently by normalizing the deformation between frames.

And step 3: constructing a neural network model based on Darknet-19, and slightly modifying the backbone network of the neural network model:

to balance model capacity and efficiency, Darknet-19 was developed as a backbone network. Darknet-19 has been shown to enable high performance in relevant target detection tasks. The model consists of convolution filters of 3 x 3 and 1 x 1, connected between different scales using maximal pooling, doubling the number of channels per scale. The model performs very well in tasks such as object classification and localization and uses relatively few parameters. Based on the original structure of Darknet-19, the present invention uses three convolutional layers and two fully-connected layers instead of global pooling for classification and localization, respectively. Table 1 lists the detailed network architecture.

And 4, designing network output, including classification and regression branches:

in the t-th frame, the network outputs a fractional vector w from the fully-connected layer^t∈R^KK is the number of classes as a result of the classification of the target, this vector reflecting the likelihood that the corresponding object appears in the line of sight. At the same time, the network outputsAs a deformation prediction for each class target, assume that the t-1 frame output bounding box is p^t-1＝(x^t-1,y^t-1,w^t-1,h^t-1) Where x, y are the center coordinates of the box, w and h are the width and height of the box, and the regression of the deformation for class kConsists of four coordinates:

representing the deformation of the target under different semantic assumptions. Last result p of t-th frame^tCan be calculated from:

and 5: designing a network loss function:

the cross entropy loss function is used for the classification loss w, and the L1 loss function is used for the bounding box regression loss d:

Step 6: training the convolutional neural network model of the tracker:

the first stage is as follows: the backbone network was pre-trained on ImageNet classification datasets for 10 epochs, using the original image as the first input, and standard random contrast enhanced images and color variations as the second and third inputs. The network achieves 72.5% top-1 accuracy and 91.0% top-2 accuracy in ImageNet.

And a second stage: and training the whole target tracking network to obtain a training model.

And 7: evaluating the performance of the training model:

the trained model is evaluated on a sub data set of the VOT 2016, which has 15 video sequences.

The invention integrates semantic knowledge and example characteristics to track the target, provides a new network architecture model based on Darknet-19, models the target tracking problem as a regression problem and directly predicts the target position coordinates of an incoming frame, and realizes high accuracy and execution efficiency of the tracker in daily tracking tasks.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A target tracking method based on integrated semantic knowledge and instance features is characterized by comprising the following steps:

and 4, step 4: training the whole tracker convolutional neural network;

and 5: the trained model performance is evaluated.

2. The integrated semantic knowledge and instance feature based target tracking method of claim 1, characterized in that: before step 4, a step 3.1 is also included, wherein the step 3.1 is to design a network output, and the network output comprises a classification branch and a regression branch; before step 5, a step 4.1 is also included, wherein the step 4.1 is to design a network loss function.

3. The integrated semantic knowledge and instance feature based target tracking method of claim 1, characterized in that: in step 1, the 1 st frame of picture is selected as a standard template containing a target, and the t-th frame of picture is selected as a candidate area where the target may appear.

4. The integrated semantic knowledge and instance feature based target tracking method of claim 3, wherein: in step 2, a standard template including the target is extracted for initialization.

5. The integrated semantic knowledge and instance feature based target tracking method of claim 4, wherein: assuming that the size of the real bounding box is (w, h), the input 1 st frame of picture is cropped by the size S around the center of the target to obtain an example image, which is always used as a standard template in the whole tracking process, and the edge distance information satisfies the following relation:

s²＝(3w)*(3h) (1)。

6. the integrated semantic knowledge and instance feature based target tracking method of claim 1, characterized in that: in step 3, three convolutional layers and two fully-connected layers are used instead of global pooling for classification and localization, respectively, based on the original structure of Darknet-19.

7. The integrated semantic knowledge and instance feature based target tracking method of claim 2, wherein: in the t-th frame, the network outputs a fractional vector w from the fully-connected layer^t∈R^KTo serve as the purposeThe target classification result, K is the number of classes, this vector reflects the probability that the corresponding object appears in the line of sight; at the same time, the network outputsAs a deformation prediction for each class target, assume that the t-1 frame output bounding box is p^t-1＝(x^t-1,y^t-1,w^t-1,h^t-1) Where x, y are the center coordinates of the box, w and h are the width and height of the box, and the regression of the deformation for class kConsists of four coordinates:

8. the integrated semantic knowledge and instance feature based target tracking method of claim 7, wherein: the cross entropy loss function is used for the classification loss w, and the L1 loss function is used for the bounding box regression loss d:

wherein,is the second input toThe true deformation of the three inputs, the L1 loss, penalizes the predicted bounding box and the true bounding box slightly more. The trained model thus has a more stable bounding box.

9. The integrated semantic knowledge and instance feature based object tracking method of claim 8, wherein: step 4 comprises two stages: the first stage is to pre-train 10 epochs of the backbone network on the ImageNet classification dataset, and adopt the original image as the first input and adopt the standard random contrast enhanced image and color variation as the second and third inputs; the second stage is to train the whole target tracking network to obtain a training model.

10. An object tracker implementing the method of any one of claims 1-9.