AU2020100048A4

AU2020100048A4 - Method of object detection for vehicle on-board video based on RetinaNet

Info

Publication number: AU2020100048A4
Application number: AU2020100048A
Authority: AU
Inventors: Mengfang Ding; Heyang Huang; Zhixu Liu; Yufei Song; Yihe Wang
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2020-02-13
Anticipated expiration: 2028-01-10

Abstract

This invention lies in the field of digital signal processing. It is an image recognition system of obstacles and road signs around automated driving vehicles based on deep learning. The invention consists of the following steps: Firstly, we collected images from cameras on several cars. Secondly, after selecting and preprocessing the images, they were divided into two data sets: one for training, the other for testing. We then put the training data set into the convolutional neural network. In order to reach the best performance, we adjusted some parameters of the network, and finally, we put the test data set into the network and the accuracy of recognition reached. In conclusion, this system can recognise different types of obstacles and road signs with high accuracy without human intervention. Download the RetinaNet training model and Pascal VOC 2007 data set Divide the Pascal VOC 2007 set into training and testing data sets Initialize the neural network Adjust parameters to train the network Do the test using the testing data set Figure 1 I subnet class--box WxH W Wx sobress x256 x5 K subnet Wo * I ubnet ) c ResNet (b) feature pyramid net (c) cass subnet (top) (d) box subnet (bottom) Figure 2

Description

Method of object detection for vehicle on-board video based on RetinaNet

Field of Invention

This invention is in the field of digital signal processing. It aims to recognise different kinds of obstacles and road signs around the automated driving vehicles in order to reduce the risk of traffic accidents and do self-adjustment in time in complicated traffic conditions.

Background

In the 20^th century, soon after the car was invented, some scientists had thoughts about inventing automated driving cars. In automated driving system, the vehicle should recogniseroad signs (e.g. zebra crossings, temporary construction signs) and obstacles (e.g. other vehicles, people). But in that time, the computers were neither small enough to be placed inside a car, nor strong enough to handle those complexed data.

Until the beginning of this century, some basic automated driving system started to appear. However, in that period, the researchers must find out the special features of these objects by themselves, and type these features into the computer. Then the computer will compare these data with the image. These had brought the researchers much difficulty, since there are too many features to be extracted. Also, the accuracy of recognition is very low, so traffic accidents happened frequently while the researchers were doing the road test. But nowadays, with the development of technology, the computer can learn how to differentiate images itself, and that is called deep learning. With this technique, the computer can automatically learn the features of different objects through a large number of images, and use these features to judge the classification of a new image, which means that no human intervention is needed anymore. It reduces the difficulty for the researchers while increased the efficiency and accuracy of recognition. Hence, we use this technique to create an image recognition system for automated driving vehicles.

In this invention, we used RetinaNet, a one-stage object detector, as our deep learning framework. ^[1]lt has a faster detecting speed then two-stage detectors like Faster R-CNN. Also, thanks to Focal Loss, we solved the problem of the imbalance between positive and negative examples in one-stage detectors, so the efficiency and accuracy of recognition will be highly improved.

Summary

This invention aims to recognise different kinds of obstacles and road signs around the automated driving vehicles in order to reduce the risk of traffic accidents and do self-adjustment in time in complicated traffic conditions. Using RetinaNet, a one-stage detector which has a faster detecting speed, and Focal Loss, which can keep a balance between foreground and background

2020100048 10 Jan 2020 classes, there will be significant improvement on both efficiency and accuracy of recognition.

The framework of ourimage recognition method includes: Pascal VOC 2007 data set, convolutional neural network based on RetinaNet, parameter adjustment andthe application of recognition.

In order to make the training process effective, the data set should be large, diverse and reliable. Therefore, we chose Pascal VOC 2007 as our training and testing data set.

Our convolutional neural network is based on RetinaNet, a network architecture Figure 2 shows the general structure of our convolutional neural network. In the network, we used a Feature Pyramid Network (FPN) backbone on top of a feedforwardResNetarchitecture (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches twosubnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network we used was downloaded from GitHub.

Description of Drawing

Figure 1 shows the procedure of our invention.

Figure 2shows the general structure of our convolutional neural network based on RetinaNet.

Description of Preferred Embodiment

Network Design

The network model we used is RetinaNet, a one-stage detector created by researchers from Facebook Al Research (FAIR). The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.

In order to verify the validity of focal loss, a one-stage target detector

2020100048 10 Jan 2020

RetinaNet was designed. Its design utilizes efficient network feature pyramid and adopts anchor boxes. The best performing RetinaNet structure is ResNet-101-FPN, with abackbone whose AP can reach 39.1 in the COCO test set, with a speed of 5fps;

Focal Loss is to be utilized for the balancedcross entropy to solve the balance problem CE(pt) = =at log(pt)by setting a, but acan only balance the importance of positive/negative samples, not easy/hard sample, therefore proposed to reconstruct CE loss in order to reduce the weight of the easy sample to put more attention on the hard negative training, the standard formula for focal loss isFL(pt)=—(1 — pt)ylog(pt), the experiment also adds a balance parameter FL(pt)=—at(1 — pt)ylog(pt).

In order to verify the effect of focal loss, a simple one-stage detector was designed for detection. The network structure is called RetinaNet.

RetinaNet is essentially composed of two FSN sub-networks of resnet+FPN+. The design idea is that backbone selects effective feature extraction networks such as vgg and resnet. It mainly tries resnet50 and resnet101. FPN is to strengthen the multi-scale features formed in resnet. Obtain a feature map with more expressive multi-scale target area information, and finally use two FCN sub-networks with the same structure but no share parameters on the FPN feature map set to complete the target frame category classification and box position regression task;

Anchor information: The area of the anchor is from 322 to 5122. On the feature pyramid p3 to p7 level, there are three different aspect ratios at each level {1:2, 1:1, 2:1 }, for the denser scale coverage, add {20, 21/3, 22/3}, three different sizes,to each level of the anchor set. Each anchor is assigned a vector of length K as classification information, and a box regression information of length 4.

During the model training and deployment, use the trained model to perform the next decoding process for the top 1000bbox with the highest target probability of occurrence at each FPN level, then summarize all the boxes of the level, filter the box with 0.5threshold nms, and finally get the target final the box position; the training loss consists of the box position information L1 loss and the category information's focal loss. In the case of model initialization, considering the extreme imbalance of the positive and negative samples, the biasing of the last conv bias parameter is made.

In the network, we used a Feature Pyramid Network (FPN) backbone on top of a feedforwardResNetarchitecture to generate a rich, multi-scale convolutional feature pyramid. To this backbone RetinaNet attaches twosubnetworks, one for classifying anchor boxes and one for regressing from anchor boxes to ground-truth object boxes.^[1]

Focal Loss is designed to solve the extreme imbalancebetween foreground and background classes in one-stage detectors duringtraining.

Procedure

2020100048 10 Jan 2020

The procedure of this invention is implemented as follows:

1. Preparing: We downloaded RetinaNet network and Pascal VOC 2007 data set from the internet.

2. Training and testing data set splitting: We divided the Pascal VOC 2007 set into training data set and testing data set.

3. We put the training data set into the convolutional neural network for training.

4. Finally, we put the testing data set into the convolutional neural network and the accuracy

Claims

1. Method of object detection for vehicle on-board video based on RetinaNet, wherein said retinaNet is essentially composed of two FSN sub-networks of resnet+FPN+, the design idea is that backbone selects effective feature extraction networks such as vgg and resnet, it mainly tries resnet50 and resnet101.