AU2019100967A4

AU2019100967A4 - An environment perception system for unmanned driving vehicles based on deep learning

Info

Publication number: AU2019100967A4
Application number: AU2019100967A
Authority: AU
Inventors: Fuming Jiang; Huifeng JIN; Shiwen Zhou
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-10-03
Anticipated expiration: 2027-08-29

Abstract

This application lies in the field of digital image processing, and it is an environment perception system for unmanned driving vehicles based on deep learning. Firstly, we need to preprocess the image data that needs to be specifically identified, which includes cars, trucks and motorcycles., and the processed image is divided into a training set and a test set. The training set is used to train the specific parameters of the neural network and save them. In the test, we can call these optimized parameters for identification. The invention has the following advantages. It needs no artificial participation to achieve environmental sensing purposes, providing a reliable, high-performance environment perception system based on deep learning.

Description

FIELD OF THE INVENTION

This invention is in the field of digital image processing and serves as classification of different types of vehicles powered by deep learning.

BACKGROUND OF THE INVENTION

The environmental perception of unmanned driving is the fundamental premise of the vehicle for navigation and positioning, road planning and motion control. It can sense the surrounding environment of the vehicle, construct local maps based on road lane information, vehicle position and status information and obstacle information obtained by the sensing system, plan local routes, and control the steering and speed of the vehicle in real time, so that the vehicle can drive safely and reliably on the road, which plays an important part on the safety and stability of unmanned driving.

The data processing of environment perception is mainly realized through deep learning. The concept of deep learning is derived from

2019100967 29 Aug 2019 the research of artificial neural network. It is a new field in machine learning research. The motivation of it is to establish a neural network that simulates human brain for analytical learning. By combining the low-level features to form the more abstract high-level representation attribute categories or features, in order to discover the distributed feature representations of data. It mimics the mechanisms of the human brain to interpret data such as images, sounds, and text [1], The early deep learning was mainly centered on the deep confidence network. The deep confidence network consisted of multiple restricted Boltzmann machines. It has different deep abstract feature extraction methods, namely the probability density distribution function through data learning [2], The probability of each category of the classification object is obtained by calculating the probability distribution function of the data

The deep confidence network was gradually replaced by a stack-based self-compiled network, the design of the self-encoder has a multi-layered artificially constructed neuron structure [3], As the name suggests, the application process includes an encoding process, which means it requires decoding, and a eigenvector can be obtained by decoding. Thus, the eigenvectors are composed of the above two structural layers in their multi-layer structure. The basic constituent cells are stack self-encoding networks with dimension reduction.

2019100967 29 Aug 2019

In 2012, Alex Krizhevsky et al. [4] applied the convolutional neural network to complete the image classification task in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and reduced the image classification error rate from 26% to 15%. The network is named Alexnet, which is a good proof of the effectiveness of CNN in complex models, and it has also attracted great attention from CNN. Since then, CNN has been widely used in the field of image recognition and image segmentation and has also begun to be applied to speech and other areas such as recognition and natural language processing, greatly promoting the development of deep learning.

Convolutional neural networks are more powerful than the first two networks in extracting data characteristics. The reason why the feature can be extracted effectively is mainly because of the use of the local perception method, a special visual mode, which only select a small part of the region of interest. The other one is weight sharing, which means the same parameters of the same kind of neurons. The last one is the magic weapon for down sampling, reducing the amount of data sharply. By incorporating the above three super-powers into the network structure, the performance of the network is greatly improved, not only to overcome the influence of displacement, but also exhibit superior performance when the size of the input image changes or the shape changes [5]

2019100967 29 Aug 2019

Sermanet [6] et al. applied deep learning to realize the recognition of traffic signs, through the Convolutional Neural Network (CNN) to learn the characteristics of traffic signs. Meanwhile, the efficiency of the algorithm was verified on the German Traffic Sign Recognition Benchmark (GTSRB) and the German Traffic Sign Detection Benchmark (GTSDB). Yang [7] et al. proposed a fast traffic sign recognition algorithm. The image features were preprocessed by applying traditional machine learning algorithms, and then the images were further classified by CNN, which reduced the computational complexity of the algorithm. Sun [8] et al. proposed a traffic identification algorithm based on extreme learning machine, which greatly reduced the complexity of the algorithm. At the same time, the effectiveness of the algorithm was verified in GTSRB.

Zhang Z et al [9] applied the deep learning convolutional neural network to the example segmentation of monocular vision and realized the segmentation of different objects in the actual scene and achieved good segmentation effect. John V et al. [10] apply the semantic segmentation based on deep learning to the actual driving scene, which can realize the segmentation of different objects at the same time. Audebert N et al. [11] simultaneously apply the semantic segmentation and object recognition technology based on deep learning to vehicle detection, which improves the robustness of the

2019100967 29 Aug 2019 whole system.

In this invention, we use TensorFlow, which is a tool that is widely used for machine learning applications, as the deep learning framework to implement the model. The environment- perception system we designed has its uniqueness. On the one hand, the data we use is collected and screened by ourselves. And we have selected the most optimized processing method after a lot of experiments. On the other hand, our architecture It is also designed and optimized by ourselves.

SUMMARY OF THE INVERNTION

The invention utilizes multi-layer Convolutional Neural Networks together with Fully-Connected Neural Networks to analyze and classify images based on TensorFlow. This method can exploit the advantages of automatic feature extraction to the full to perform a precise description to the feature in the image. This invention has improved the training precision and speed as well as making up the technical difficulties of over-fitting etc...

The whole process includes 5 steps, shown in figure 1

Data Collection

A total of M kinds of automated vehicles related data collected on the website, including trucks driven on the road, cars, etc., the total

2019100967 29 Aug 2019 number of the picture is N.

Data Preprocessing

The procedure of preprocessing is like figure2 □

First analyze and filter the collected data, deleting data less correlated with its category. Next, all the pictures are restricted to a fixed size, m*m pixels. The whole dataset is then proportionally divided into training set and testing set. After that, rearrange pictures together with their labels’ order and store them into a matrix format file through MATLAB. Before training, both image matrices from training and testing set should be reformatted to one-hot-encoding, transformed the color channels and normalized to a fixed range to simplify the calculations during the training and testing procedure.

Network Architecture

This part is to construct a network used for training and testing. The whole Convolutional Neural Networks architecture contains certain layers shown in figure3.

Convolutional layers are utilized to extract features from input images. In this layer, filters will slide through the image and divide it into several small regions. Filter matrices act as weights, together with bias are used as parameters to compute the pattern value with region matrices through flat dot production.

Following each convolutional layer and fully-connected layer, there is

2019100967 29 Aug 2019

ReLU acting as activation function, dealing with the input form the former layers.

Pooling layers’ aim is to divide the large special dimensional patterns received from convolutional layers into several smaller dimensional patterns through down sampling to avoid over-fitting.

Fully-connected layers sum all local characteristics together and get a global characteristic to perform classifications.

When an image matrix entering the network, it will first pass through W convolutional layers and ReLUs, where lower level features such as dots and lines can be caught, and then will be sent to a max pooling layer to decline the dimension of these features. The above procedure will be proceeded J times. Then through S fully-connected layers and following ReLUs, a global feature can be gathered to classify the input image to a corresponding class.

Structure Optimization

In this section, four methods, including regularization, dropout, optimizing learning rate and Adam algorithm, are invoked to avoid over-fitting and accelerate the training speed.

Regularization is added based on L2-LOSS formula, seen following formula /O55 = ^(y-/z(x,.)) +2-tL + 2-HIn the formula, is the actual value of i - th input x_;, h is the predicted

2019100967 29 Aug 2019 value of i - th input x_;, w is the weight value and b is the bias value of the current layer.xis a fixed parameter.

The summation part is the L2-LOSS function, compared with Ll-LOSS, its result is unique. Following are the weights and bias multiply a parameter, acting as regularization. The goal of it is to avoid over-fitting.

Dropout is a method used between pooling layer and fully-connected layer to randomly drop nodes in the network according to a fixed proportion during training to avoid over-fitting. At the test period, all the dropped nodes will be added to the network contributing to the prediction.

Learning rate optimization is to perform a fixed decay on a given learning rate based on training steps. Its aim is to reach the training’s converge state as quickly and precisely as possible.

Adam algorithm is used to update parameters in the deep learning network model. Compared with gradient descent algorithm and momentum update, Adam algorithm can achieve a better converge time under the same learning rate.

Train & Test

After structure optimization step, the training dataset will be sent to the network for training and test dataset will be used for checking if the generated model satisfies the requirement. If not, parameters,

2019100967 29 Aug 2019 including batch size, initial learning rate, decay rate etc., will be adjusted manually until a satisfiable model is formed.

DESCRIPTION OF DRAWINGS

Figurel illustrates the procedure of the invention;

Figure2 illustrates the procedure of data preprocessing;

Figure3 illustrates Convolutional Neural Network Abstract

Architecture;

Figure4 illustrates the CNN Detailed Architecture;

Tablel illustrates the Results of training & testing in the network.

DESCRIPTION OF PREFERRED EMBODIMENT

The total procedure of completing the model includes 5 steps, seen figurel, which will be detailed described in the following section.

Data Collection

We use images of three types of vehicles in MIO vision Traffic Camera Dataset (MIO-TCD), including cars, trucks and motorcycles. MIO-TCD is a dataset which consists of more than half a million images acquired at different times of day and different periods of the year by 8,000 traffic cameras deployed all over Canada and the United States. Those images have been selected to cover a wide range

2019100967 29 Aug 2019 of localization challenges and are representative of typical visual data captured today in urban traffic scenarios.

In order to get a better data distribution, we use data augmentation including rotation and zooming method, to make 5,000 images of each type of vehicle, 4000 for training and 1,000 for testing. Totally, the training set includes 12,000 images and the testing set includes 3,000 images.

Data Preprocessing

Firstly, for every image, we resize it to 32*32 pixels by bilinear interpolation with OPENCV. Then, with the aim of increasing the calculation speed, we change each image from three channels to a single channel. Specifically, we get the average of the values of the three channels and convert it into an interval between -1 to 1. Finally, we tag each image with one-hot encoding and store the data and labels into files through MATLAB.

Network Architecture

As the Figured, the network consists of four convolution layers followed by two fully connected layers. The input is an image of one channel and its height and width are both 32 pixels. For every convolution layer, the kernel size is 3*3 and it moves in stride of 1 in vertical and horizontal direction. Meanwhile, we pad the matrixes with zeros to keep the output images have the same size with the io

2019100967 29 Aug 2019 input images. We initialize the weights in normal distribution with standard deviation 0.1 and set the biases 0.1.

Each convolution layer outputs 32 feature maps and the results go through the nonlinear activation function RELU, which is a pixel-class operation. Showed in formula (1), RELU is used to alleviate gradient disappearing problems and speed up the convergence rate.

(, x, x>0

ReLU(x) = max(0, x) = / (I) ^0, x < 0 There is a max-pooling layer behind Convolutional Layer2 and Convolutional Layer4 respectively. The pooling is used to reduce the size of the feature maps and decrease the number of parameters in the network. In our work, we use filters of size 2*2 to down sample and the stride is 2. In max-pooling, the maximum among the four values in the filter will remain and the others will be discarded. Thus, the first max-pooling layer changes the image size from 32*32 (height*width) to 16* 16 and the second one changes the size from I6*l6to 8*8.

After the operation of Convolutional Layer4, we reshape the image matrix from 8*8*32 to 2048*1. For the first fully connected layer, the input nodes number is 2048 and the output nodes number is 128 and the results will still go through the activation function RELU. After the second fully connected layer, the output is the high-level feature

2019100967 29 Aug 2019 of the input image and it will be classified by softmax function, as formula (2), where y means the value of the i-th element. The output is the probability for each category and the sum is 1.

Softmax(y.) = — (2)

L>i^exP(^x/)

Structure Optimization

In optimization part, we adapt ADAM optimization algorithm to our invention, which can calculate independent adaptive learning rates for different parameters by calculating the first-order moment estimation and second-order moment estimation of the gradient. Meanwhile, we adapt L2 Loss function, showed in formula (3) to our model because it is very sensitive to the outliers in the dataset.

<³>

i=0

In the formula, is the actual value of z-th input xt, h is the predicted value of z-th input Xi.

To avoid over-fitting, we apply dropout method to the fully connected layer in the training stage. Dropout is a way discarding some nodes randomly with a given probability and prevents the model from overfitting. We set the dropout rate as 0.99 and get the best result. At the same time, we set the learning rate as 0.001 initially and it will decrease with a decay rate 0.99.

Train & Test

2019100967 29 Aug 2019

To mention that, the batch size is used during training period and testing batch size is fixed to 500, the accuracy is the testing accuracy and the learning rate is the initial learning rate pre-defined at the beginning of the training period.

Batch size is the number of images dealing parallelly to improve the training speed. Initial learning rate is a manually defined value which will vary automatically during the training period to achieve a better accuracy.

In the loop of training and testing period, the number of iteration steps has been increased to 2000 from 300, the initial learning rate has been increased from 0.001 to 0.01 and the dropout rate has been decreased from 0.99 to 0.8. Through changing the parameters of the network and comparing the test accuracy, a best match of parameters will be discovered, and the trained model through these parameters is best required.

From the table 1, it can be found that the best recognition model is when the number of iteration steps is 1000 and the initial learning rate is 0.001, which has the highest accuracy 93.267%.

□8 CN

REFERENCES [1] Sun Haotian. Application of Deep Learning in Unmanned

Vehicles[J], Computer Knowledge and

Technology,2015,11 (24) :121-123 [2] Fan Jialue, Xu Wei, Wu Ying et al. Human tracking using convolutional neural networks [J], IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 2010, 21(10).

[3] Xiang Zhan. Research on LeNet-5 Convolutional Neural Network Optimization Based on Particle Swarm Optimization[D], Huazhong University of Science and Technology, 2016.

[4] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks [C]//Intemational Conference on Neural Information Processing Systems. Curran

2019100967 29 Aug 2019

Associates Inc.2012: 1097-1105.

[5] Cong Bowen. Research on binocular perception method of vehicle driving three-dimensional environment based on deep leaming[D], Xi'an University of Science and Technology, 2018 [6] Senmanet P, LeCun Y. Traffic Sign Recognition With Multi· -Scale Convolutional Network[C], IEEE International Joint Conference on Neural Networks, 201 1:2809-2813.

[71] Yang Y, Luo H, Xu H, et al. Towards Real-Time Traffic Sign Detection And Classification[J], IEEE Transactions on Intelligent Transportation Systems, 2015:1-10.

[8] Sun Z L, Wang H, Lau W S, et al. Application Of BW-ELM Model On Traffic Sign Recognition[J], Neurocomputing, 2014,128(1):153-159.

[9] Zhang Z, Schwing A G, Fidler S, et al. Monocular Object Instance Segmentation and Depth Ordering with CNNs[C]//IEEE International Conference on Computer Vision. IEEE, 2015: 2614-2622.

[10] John V, Kidono K, Guo C, et al. Fast road scene segmentation using deep learning and scene-based models[C]//Pattem Recognition (ICPR), 2016 23rd International Conference on Pattern Recognition. IEEE, 2016: 3763-3768.

[11] Audebert N, Le Saux B, Lefevre S. Segment-before-detect: Vehicle detection and classification through semantic segmentation of

2019100967 29 Aug 2019 aerial images[J], Remote Sensing, 2017, 9(4): 368.

Claims

1. An environment perception system for unmanned driving vehicles based on deep learning, using a small size image detection, characterized in:

the method can achieve 93.267% accuracy in 1000 iterations, detecting 32 * 32 pixels images, wherein through the i7-6700HQ CPU, the total running time of training 12000 images together with testing 3000 images is within 10 minutes.