AU2019101133A4

AU2019101133A4 - Fast vehicle detection using augmented dataset based on RetinaNet

Info

Publication number: AU2019101133A4
Application number: AU2019101133A
Authority: AU
Inventors: Yaxin Bo; Ziwei Liu; Buwei WU; Tianjian Yang; Fanghong Zhu; Huayang ZhuGe
Original assignee: Bo Yaxin Miss; Zhu Fanghong Miss
Current assignee: Bo Yaxin Miss; Zhu Fanghong Miss
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2019-10-31
Anticipated expiration: 2027-09-30

Abstract

Abstract This invention lies in the field of computer vision and artificial intelligence. It is an vehicle video detection system of various kinds of images, especially cars, based on retina net. The invention consists of the following steps: Initially, we employ datasets to train convolutional networks . Then, we add the training set of data into the convolutional neural network in batches incessantly and continually adjust the parameters of the network such as base learning rate, weight, padding, stride or input and utilize back propagation, making the model approach the optimal performance. Finally, the test set of data is put into the trained neural network and different kinds of objects are recognized with % accuracy. Picture Acquisition (VOC dataset) Initialize the neural network Model training based on resnet and mobilenet Adjust parameters to train the network Yes No Reach the requiremen Prediction of the testing set Figure 3

Description

TITLE

Fast vehicle detection using augmented dataset based on RetinaNet

FIELD OF THE INVENTION

This invention is in the field of computer vision and artificial intelligence and serves as vehicle detection using augmented dataset based on RetinaNet powered by deep learning.

BACKGROUND

Because of the development of intelligent monitoring systems, in-vehicle detection systems have become more and more widely used in the fields of information fusion of vehicle identity, vehicle inspection, vehicle illegal behavior detection, and vehicle tracking. With the development of intelligent transportation system, vehicle target detection is an important part of intelligent transportation system, and it has been widely used in the fields of information fusion of vehicle identity, vehicle inspection, vehicle illegal behavior detection and vehicle tracking. Before the era of deep learning, computer vision research usually uses traditional target detection models to accomplish this task. For example, the traditional HOG algorithm can capture local shape information well and has good invariance to geometric and optical changes. However, it is

2019101133 30 Sep 2019 difficult to deal with the occlusion problem, and the posture range of the human body is too large or the direction of the object is changed.

With the development of deep learning theory and application, it is found that the Convolutional Neural Network (CNN) can autonomously extract image features and greatly reduce a series of loss values due to angle, illumination, deformation, and other factors, more adaptable to complex scenes. CNN selects many candidate boxes by selective search method, then performs convolutional network operations for each candidate box separately, extracts features, and finally puts the convolved features into the svm classifier and Bbox reg's regression device. However, because it has different neural networks in different frames during the training process, the testing process is cumbersome, slow, and requires a lot of space in the internal storage.

Therefore, this paper chooses the RetinaNet model as the basic structure, which can detect the types of items it photographs in most environments. In the training phase, using PascalVQC2007 and COCO data sets, and carrying out data enhancement and reasonable adjustment of parameters to complete the complex research on target detection in the scene.

The RetinaNet model is designed to take advantage of the efficient network feature pyramid and use the anchor boxes to overcome the class

2019101133 30 Sep 2019 imbalance problem of original one-stage. At the same time, the cross-entropy loss, used in the original training regression task, is changed to focal loss, so its detection accuracy become higher and the target object can be quickly identified.

SUMMARY

In a thought to tackle with the situation that current technology cannot consider both precision and speed while detecting objects (the extreme foreground-background class imbalance encountered when training dense detectors) , to large extent, to deal with some problems with robots’ perception and avoid errors when the abilities of describing pictures of networks strengthen with the increasingly convolutional layers, we propose an invention which refers to a object detection method based on deep learning. We conduct an experiment by respectively capitalizing on two types of networks- Res Net and Mobile Net, giving full play to the superiority of both networks, which extracts the image’s partial semantic feature to make the precise description of image features. Considering the advantages of both models, this invention significantly improves the training processes and solve some obstacles like overfitting. Not only do we apply the new networks, but also we compare these two model and analyze the performance of the two networks.

The framework of our method of deep learning object detection for

2019101133 30 Sep 2019 vehicles comprises of: collecting images of people and automobiles, using convolutional networks for training, optimizing parameters and implementing the test of object detection.

As a purpose to build the database of images for our detection, we collect image data by downloading data from pascal VOC dataset.(dataset is consist of a series of images; each image has a corresponding and signal file and the file offers the bounding box and class label of the object). We also delete some useless and unrelated image data to keep the quantity balance of image data.

Our convolutional neural network is a sequence of layers. Figure 1 displays the architecture of our network, which has 5 convolutional layers followed by 1 full connected layer.

The input layer will have preprocessing operations on the model such as deleting the average value, normalizing and reducing dimensions.

The convolutional layer will precept the regionally each feature of the image by through computing the output of neurons that are connected to local regions in the input.

The incentive layer, ReLU layer, will make a nonlinear mapping on the output from convolutional layers.

The Pooling layer will perform a down sampling operation along with

2019101133 30 Sep 2019 spatial dimension (weight, height). The function of Max Pooling compresses data and the number of parameters. Meanwhile, Max Pooling also controls the phenomenon of overfitting and increases fault-tolerance of the model efficiently.

The Full connected layer will connect every nodes in it with all the nodes in the last layer to gather up all the characteristics extracted by previous layer. As a result, the activation can be computed with a matrix multiplication.

Softmax layer is used in the process of multi-classification, mapping the output of multiple neurons to the interval of (0, 1), which can be understood as the probability to conduct multi-classification.

For the part of optimizing the parameter data set , we firstly put the data set in batches into the network for training to reduce the loss function. Then, optimize the model by introducing three nets, Featured Pyramid Net, Classification Subnet and Box Regression Subnet, with this adding, we strengthen the uses of features generated in Resnet to get more expressive feature map which contains Multi-scale target region information. Additionally, we regard gradient descent optimization as algorithm and focal loss as a way to eliminate overfitting.

Lastly, the classifier and locator of images is capable of identifying, result will be presented through locating and classifying.

2019101133 30 Sep 2019

DESCRIPTION OF DRAWINGS

Figure 1 is the Feature Pyramid Network

Figure 2 shows Layerl to Layer5 of Res Net 50

Figure 3 shows the procedure of the project

Figure 4 - Figure 7 show the results of training.

DESCRIPTION OF PREFERRED EMBODIMENT

Network design

Table 1 shows the structure of our convolutional neural network. Our network architecture is inspired by the RetinaNet model. The network is composed of a backbone network(feature pyramid net(FPN) based on Resnet), and a Sub-net(Classification Subnet and Box Regression Subnet).

There are some parameters related to the calculation of the convolution layer:

1) Input: the input image that needs to be convolved

2) Filter: the convolution kernel in CNN, in this invention we mainly use 3X3 and 1X1 convolution kernel.

3) Stride: step size of window sliding during convolution

2019101133 30 Sep 2019

4) Zero-padding: zero-padding has two modes. “Valid” means no padding. “Same” means the output image is the same size as the input image. In this program we use “Same” mode.

Tablet specific structure of ResNet with different depth

Layer name	Output size	18-layer	34-layer	50-layer	101-layer	152-layer
Convl	112x			7x7,64,stride 2
	112
Conv2_	56x56	3x3 max pool, stride 2
		[3x3,64,1 _ ^x 2	[3x3,64,1 „ x 3	[1 x 1,64 1	[1 x 1,64 1	[1 x 1,64 1
		[3 x 3, 64 J	[3 x 3, 64j	3 x 3, 64 x 3	3 x 3, 64 x 3	3 x 3, 64 x 3
				\|_1 x 1,256j	\|_1 x 1,256j	\|_1 x 1,256j
Conv3_	28x28	[3x3,1281 „ 1 ^x 2	[3x3,1281 _Λ I x 4	[1 x 1, 1281	[1 x 1, 1281	[1 x 1, 1281
X		[3 x 3, 128J	[3 x 3, 128J	3 x 3, 128 x 4	3 x 3, 128 x 4	3 x 3, 128 x 8
				\|_1 x 1, 512 J	\|_1 x 1,512J	\|_1 x 1,512J
Conv4_	14x14	[3x 3,2561 „ 1 ^x 2	[3x 3,2561 „ I x 6	[1x 1,256 1	[1 x 1,256 1	[1 x 1,256 1
X		[3 x 3, 256j	[3 x 3, 256j	3 x 3, 256 x 6	3 x 3,256 x 23	3x 3,256 x 36
				\|_1 x 1, 1024J	\|_1 x 1, 1024J	\|_1 x 1, 1024J
Conv5_	7x7	LWxZ	[3x3,5121 „ Ί x 3	[1 x 1, 512 1	[1x 1,512 1	[1x 1,512 1
X		_3x 3, 512J	_3x 3, 512J	3x3,512 x3	3x3,512 x3	3x3,512 x3
				\|_1 x 1, 2048j	\|_1 x 1,2048J	\|_1 x 1,2048J
	1x1	Average pool, 1000-d fc, softmax

1. Backbone Net

In this invention, we use ResNet, a most widely used CNN feature extraction network, as the Backbone net of the model. Based on

2019101133 30 Sep 2019

Bottleneck, ResNet 50, 101 and 152 are constructed in the same way. A mainstream neural network is composed of input layer, hidden layers and output layer. As Table 1 suggests, each layer of the network (like Conv_2x, Conv_3x, etc.) is composed of several blocks. And each block is composed of 2 or 3 sub-layers. “x2”, “^x3”„ etc. refer to the number of the blocks a layer contains. “[3X3,64]” means the sub-layer has 64 convolutional kernels with the size of 3><3.

Here, we take ResNet 50 for example, it is constructed of five layers, fifty sub-layers in all. Due to its huge size, we only illustrate its input layer, first hidden layer and output layer. The structure of the other hidden layers is similar to the first one.

(1) Convolutional Layer

Firstly, the input layer:

The input data of the Input Layer is the original convolutional layer [224x224x1] image, which is convoluted by a [7^χ7χ1] convolution kernel, and the convolution kernel generates a new pixel for each convolution of the original image. The convolution kernel moves in both the x-axis direction and the y-axis direction of the original image, and the moving step size is 2 pixel. Therefore, the convolution kernel generates the [112x112] pixels layers after convolution of the original image. There are 64 convolution kernels and the depth will be 64. We choose ReLU as

2019101133 30 Sep 2019 the nonlinear activation function in the convolutional layers. ReLU function is

ReLU(x) = max(0, x) = 1 ’ [0, x<0 (2)

ReLU can alleviate gradient disappear problems and reduce the training time, which greatly speeds up the rate of convergence of the model. The convolved pixel layers are processed by the ReLU unit, and the size of data is still [64x112x112],

Then, we use a max pooling layer with filters of size [3x3] applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the pixels. Every operation would be taking a max over 4 numbers in 3x3 region. The depth dimension remains unchanged. Thus in the project the input volume of size [64x112x112] is pooled into output volume of size [64x56x56] with filter size 2, stride 2.

Secondly, the hidden layers:

The first block of the hidden layer Conv2_x has 3 blocks, 9 sub-nets, as this shows:

Ί X 1,64 Ί x 3, 64 x 3 x 1,256J

The input pixels of 64x56x56 is convoluted by convolution kernel of

X 1 X 1 and processed by ReLU function, then 64 X 3 X 3 and 256 X 1

2019101133 30 Sep 2019

X 1 kernels and ReLU. Repeat this process three times. After the process, the network creates 256X56X56 output pixels.

Since the convolution kernel of 1x1 has a size of only 1x1, the relationship between pixels and surrounding pixels does not need to be considered. It is mainly used to adjust the number of channels, carry out linear combination of pixel points on different channels, and then carry out nonlinear operation, so as to achieve the functions of ascending and descending dimensions Then these pixel layers are processed by the ReLU unit, and the size is still [16x16x32],

The other layers, Conv3_x, Conv4_x, Conv5_x, have the similar structure as Conv2_x does. But the first block of each layer has convolution kernel of 3x3, stride 2, decreases 75% of the pixel number, which is different from the other blocks.

The difference between the residual network and the ordinary network is the introduction of jump connection, which enables the information of the previous residual block to flow into the next residual block without obstruction, improves the information flow, and avoids the vanishing gradient problem and degradation problem caused by too deep network connection. Instead of directly fitting the expected feature ad mapping with multiple stacked layers, we explicitly use them to fit a residual mapping.

io

2019101133 30 Sep 2019

Residual block.

(2) Output Layer

After finishing convolution, the model uses Fully Connected Layer to reshape the image matrix. The output data of the last hidden layer is fed in to the Fully Connected Layer. The data is reshaped from [2048 X 7 X 7] to [1000X1X1],

The fully connected layer has 2048 nodes, each of them has full connections to all activations in the input.

The final output is the high-level feature of the input image, corresponding to the probability of label for the input image through Softmax function, which is

Softmax(y) = cxpCr,)— (3)

Σ_Μ ^εχρ(Α) where means the value of the i-th element. The Softmax classification model is used as the last layer of the fully connected layer and output the probability of each category of objects valued between 0 and 1.

The Full Connection Layer is equivalent to the inner product between nerve nodes, mainly involved forward-propagation and back propagation. Forward-propagation is equivalent to formula(4), which calculates the

2019101133 30 Sep 2019 output value of nerve node, and back-propagation is equivalent to formula(5) which calculates the error term of each nerve node.

y = w^rx + b(4) cfoh — = w , — = x-(—)Q) ox dy dw dy

Among them, ve R^mxl represents the output of nerve node, xe R^nxlrepresents the input of nerve node, we R^nxm represents the weight of nerve node, b represents bias, and / represents the layer of nerve nodes.

2,Feature Pyramid Net(FPN)

Figure 1 is the structure of FPN. FPN is a pyramid form that naturally utilizes CNN hierarchical features and generates pyramids with strong semantic information on all scales and according to the features of little semantic information and accurate target location of low-level, while the high-level have much semantic information and rough target location, FPN integrates feature maps of different layers through bottom-up pathway, top-down pathway and lateral connections, making it easy to identify small targets.

In this way, the input image from a single sale sheet is realized, and the characteristic pyramid with strong semantic information on all scales is constructed rapidly without obvious cost.

2019101133 30 Sep 2019

1. Bottom-up pathway.

The feed-forward calculation of CNN is Bottom-up pathway After convolution kernel calculation, the feature graph is usually getting smaller and smaller, and the output of some feature layers is the same as the original size, which is called “same network stage”.

2. Top-down pathway and lateral connections.

The way to combine with the characteristics of low-level high-resolution is to take a more abstract, semantic high-level feature map and sample it, and then connect the lateral connections to the previous layer so that the high-level feature is enhanced. The features of the two horizontal connections are the same in spatial dimensions, which is used to take advantage of the underlying location details.

The FPN (Feature Pyramid Network) algorithm proposed by the author makes use of high resolution and high semantic information of low level features at the same time, and achieves the prediction effect by integrating the features of these different layers. Moreover, prediction is made on the feature layer after each fusion, which is different from the conventional feature fusion mode.

3. Subnet

The main network part of RetinaNet uses the FPN structure, with

2019101133 30 Sep 2019 two sub-networks of different tasks, one is the class subnet and the other is the box subnet.

The parameters of the classification sub-network and the regression sub-network are separate, but the structure is similar. Both of them use small FCN networks, the pyramid has been used as input, and then link four 3*3 convolutional layers, filter is the number of channels in the pyramid layer (256 in the paper), and there is a RELU activation function after each convolutional layer. This is followed by a 3*3 convolutional layer with filter of KA (K is the target species number, A is the number of anchors per layer), and the activation function is sigmoid.

The reason for using the two classifications is that the implementation of the loss layer combines the sigmoid operation for computing “p” with the loss computation, resulting in greater numerical stability.

Retina net uses a special initialization method for the final layer classification of the subnet, so that the training output can be close to Π=0.01, which is closer to the real situation of overwhelming background. The authors demonstrate here that the initialization strategy is important here and in later experiments. And such an initialization strategy is based on multiple sigmoid classifier. If softmax is used, it is impossible to make the output of all categories Π=0.01 for an anchor.

2019101133 30 Sep 2019

RetinaNet separates the classification subnet from the b-box regression subnet, and a large part of it is also used to classify the initial method.

Procedure

Step 1: Data Acquisition

In the data collection process of this project, we use the existing VOC data set, the Internet to collect related pictures, and manual photographs to collect project data. After collecting the data, we filtered the collected images, we eliminated the noise and some images that did not match the type of the project, and then modified the image that meets the requirements to modify to [224x224] pixel. The term of this project is to recognize to animals (birds, cats, cows, dogs, horses, sheep),vehicles (aircraft, bicycles, boats, buses, cars, motorcycles, trains) and indoor items (bottles, chairs, dining tables, potted plants, sofas, TVs). Each kind of pictures needs almost 5000 pieces of image date respectively. When we meet the obstacle that we can't collect sufficient data, we will rotate images to gain new data for the sake of finishing data acquisition. In addition, various types of pictures require 4000 image data to maintain a balance of various image data.

Step2: Data Preprocessing

2019101133 30 Sep 2019 (1) Transfer image data grid from ‘jpg’ to ‘mat’ (2) Transforming the form of the data: In order to facilitate the later data processing, it is necessary to obtain the corresponding data of the original picture in order to transfer the sequence of these pictures.

(3) Dimension reduction: Three dimensions are merged into one dimension to achieve normalization.

(4) Dividing image data: Four-fifths of the image scale, we divide the image data.

Step3: Training and optimization (1) Gradient Descent

In deep learning, the entire neural network can be viewed as a complex nonlinear function as a fitted model of the training samples. The use of gradient descent is actually the process of finding the minimum value of the objective function. The gradient descent can be divided into three forms based on the data of each sample used: batch gradient descent (BGD), small batch gradient descent (MBGD), and stochastic gradient descent (SGD). This topic uses the SGD algorithm as a way of regression. The algorithm is also used to support the linear classifier under the convex loss function such as vector machine and logistic regression. The algorithm logic is as follows:

2019101133 30 Sep 2019

The objective function for a sample is:

= - yv

Find the partial function of the target function:

Parameter update:

This optimization algorithm works by predicting the model every time it sees a training instance and repeating the iteration to several times. This process can be used to find the coefficients of the model that cause the smallest error in the training data. Since the loss function is not on all training data, but in each iteration, the loss function on a certain training data is randomly optimized, so the update speed of each round of parameters is greatly accelerated. The algorithm has been successfully applied to large-scale and sparse machine learning problems which often encountered in text classification and natural language processing.

(2) Focal Loss function

The loss function of Focal loss is modified based on the standard cross entropy loss. This function can make the model more focused on difficult-to-classify samples by reducing the weight of the easily

2019101133 30 Sep 2019 categorized samples.

The algorithm of object detection can be divided into two main categories: two-stage detector and one-stage detector. The former refers to a detection algorithm that requires a region proposal like Faster RCNN and RFCN. Such algorithms can achieve high accuracy, but at a slower speed. The latter refers to a detection algorithm similar to YOLO, SSD that does not require region proposal, direct regression, such algorithms are fast, but the accuracy is not as good as the former. The accuracy of one-stage detector is not as good as that of two-stage detector because of the imbalance of sample class. Therefore, for the problem of class imbalance, the research scientist Kaiming He proposed a new loss function: focal loss

FL(p_t) = -a_t(l-p_tf log(pf (In the experiment, the choice range of a is also very wide. Generally, when γ is increased, a needs to be reduced a little)

Focal loss has two important properties:

1. When a sample is split, it is very small, so the modulation factor tends to 1, which means that there is no major change compared to the original loss. When it tends to be 1 (when the classification is correct and the sample is easy to classify), the modulation factor tends to be 0, that is, the

2019101133 30 Sep 2019 contribution to the total loss is small.

2. When/ = 0, focal loss is the traditional cross entropy loss. When/is increased, the modulation factor will also increase.

The results of training can be seen in Figure 4 - Figure 7

Step4: Testing

We adjust parameters of the network constantly in order to reach the optimal performance. Then we put the test set into the network and get mAP as results.

Besides, some parameters are fixed, test batch size is 4391.The sheetl shows the result. We can reach the optimal mAP of 0.654.

Epoch threshoRk	01	05	09	13	17
0.4	0.5282	0.5722	0.5765	0.5861	0.5871
0.5	0.4776	0.5396	0.5510	0.5668	0.5684
0.6	0.4128	0.4979	0.5208	0.5420	0.5468

2019101133 30 Sep 2019

CLAIM

Claims

1. A fast vehicle detection using augmented dataset based on RetinaNet, wherein in the training stage, augmented datasets are used to implement the deep learning: we use a large amount of sample patterns, and carry out reasonable adjustments of parameters. Consequently, the results can be highly-accurate.

2. The fast vehicle detection using augmented dataset based on RetinaNet of claim 1, wherein a class of efficient models called MobileNets is introduced, which apply the depth wise separable convolution, which can not only reduce the computational complexity of the model, but also greatly reduce the size of the model.

3. The fast vehicle detection using augmented dataset based on RetinaNet of claim 1, wherein implementing the-state-of-the-art FPN-based one-stage detector RetinaNet, involved focal loss function, the model ensures the detection speed and improves the detection accuracy in the case of class imbalance, which is also conducive to small targets detection; the model can be used in areas such as vehicle-mounted and spam detection.