AU2019101224A4

AU2019101224A4 - Method of Human detection research and implement based on deep learning

Info

Publication number: AU2019101224A4
Application number: AU2019101224A
Authority: AU
Inventors: Zikai Shu; Zeyuan Wu; Tianyu Xin; Fengkun YANG
Original assignee: Yang Fengkun Miss
Current assignee: Yang Fengkun Miss
Priority date: 2019-10-05
Filing date: 2019-10-05
Publication date: 2020-01-16
Anticipated expiration: 2027-10-05

Abstract

This image invention lies on the field of deep learning and also improved by YoloV3. We can identify detect the pedestrian from the pictures through the follow steps: first, we acquire generous photographs from some websites, so that we have enough pictures to make our invention get plenty of learning opportunities to get a superior data. Then, we classify our photographs for training and testing after our selecting and pre-processing, using the photographs of training to make our invention have a practice of distinguish those pedestrians. Then, we download the YoloV3 from github, try to use it to train our data. By adjusting the parameter in our algorithm and also improved some structure of it, the accuracy of our invention in human detection can be improved a lot. Finally, we just put the test data in our invention, which can distinguish detect the photographs if there are pedestrians without human involvement. Mi111~ .20 20004 00100013D 04 32 111e0420 Con02D Bc L 1248 Co 03 CS02.D R Wida Slc2 20 Mc W 3"e 52 .2.2.(ntft 5245 75in Figure 1 Type Fillers Size Output Convolutional 32 3x3 256 x 256 Convolutional 64 3x3/ 2 128Nx 128 Convolutional Convolutional 32 1 2 1 xConvolutional 64 3 x C M I.kRe Residual 128 x 128 Layer Layer Layer Convolutional 128 3x 3/2 64 x 64 Convolutional 64 1 x 1 2xConvolutional 128 3 x 3 Residual 64 x 64 Convolutional 256 3x3 /2 32 x32 Convolutional 128 x 1 8x Convolutional 256 3x3 Residual Residual 32 x 32 Convolutional 512 3x3 / 2 16 x16 A dd Convolutional 256 1x1 8x Convolutional 512 3>x3 Residual 16 x 16 Convolutional Convolutional Convolutional 1024 3x3/2 8x8 -(xi) (3x3) Convolutional 512 1 x 1 4x Convolutional 1024 3x3 Residual 8x8 Avgpool Global Connected 1000 Softmax Figure 2

Description

TITLE

Method of Human detection research and implement based on deep learning

FIELD OF THE INVENTION

This image invention lies on the field of deep learning and also yolov3

BACKGROUND

Recently,with the rapid development of artificial intelligence, the requirements of machine automation have gradually improved, human detection have entered our daily life, our smart-phones recognition our face and unlock the phones. Even in the public security system, they also use it for distinguish the guilty. With the widely applied, the accuracy of face recognition also need be augmented. It is important in not only efficiency of operations but also make our safe have a superior promise. Traditional face recognition have a larger error in accuracy rate, improved by Yolo_V3, we control the parameter of artificial neural networks, which can increase their accuracy without human involvement. This will be conducive to its application in future life. Face authenticate can replace identity authentication by face recognition,which can play a greater role in many places, such as airports. It reduces the workload of staff, facilitates management, and improves the efficiency of access. Therefore, this face recognition technology will undoubtedly have more extensive

2019101224 05 Oct 2019 use space in the future.

Our invention is based on the Yolo_V3 algorithm to improve the deep learning convolution neural network. Our invention improves the performance to a certain extent and improves the accuracy by continuously optimizing the values of each parameter in the training process.

SUMMARY

In order to improve the accuracy and efficiency of image recognition, and change the error of existing neural network algorithm to some extent, we use modified Yolo_V3 algorithm to improve human detection. By adjusting the parameters, the accuracy of image recognition can be improved. Lift, so that it can make more accurate classification. Fully exploiting the advantages of automatic feature extraction in deep learning, we can judge whether there is a face in the graph and extract it from the face features, which will make it have a wide application prospect in deep learning. In order to build the image database, we will mark and convert the images from the Internet, and divide the images into training sets and test sets. Put it into the Yolo_V3 convolution neural network as shown in the figure, and we can see the network structure of Yolo_V3 intuitively by changing the graph. Among them, the processing steps of Yolo_V3 can be seen as consisting of three basic components through a series of combinations. First, CBL, which is composed of convolution, BN and

2019101224 05 Oct 2019

Leaky relu, is the smallest component of Yolo_V3. resN, N stands for numbers. This component draws on residual structure of ResNet to make the network deeper. Its basic component is also CBL. Concat, a tensor splicing component, splices the upper samples of the middle layer and some of the later layers. This component expands the dimensions of tensors rather than adding them directly. In the whole structure of Yolo_V3, there is no pooling layer and full connection layer. In the process of forward propagation, the size transformation of tensor is realized by changing the step size of convolution sum. In the whole network structure, every time the size transformation of tensor is made, the side length will be reduced by 1/2. We need to advance in the whole network. Five rows of this process, because of this, the network structure will reduce the image to 1/32 of the input, so in order to ensure the output of the feature map, we need to adjust the edge length of the image to a multiple of 32, usually 416*416. In order to improve the efficiency of training, we put data sets into the network in batches to reduce the loss function. Yolo_V3 is a multi-scale training network structure, we can choose between speed and accuracy according to our needs, which is also a manifestation of the flexibility of Yolo_V3. In order to improve its accuracy, we will make certain adjustments to its selection, by adjusting its input and parameters, so that it can discard some of us. The unnecessary samples are recognized and the results are given according to

2019101224 05 Oct 2019 our expectations.

DESCRIPTION OF DRAWING

Figure 1 shows the structure of Yolo_v3 and it’s basic components Figure 2 shows changing the step size of convolution sum Figure 3 shows prediction of Target Boundary Frame.

DESCRIPTION OF PREFERRED EMBODIMENT

Network Design

Firstly, after convolution calculation, a batch of standardization is made in BN layer to determine the training direction of most pictures and discard those data which are far away from the regression line, so that the trained pictures have more uniform features. Then, the data is activated by relu layer, and the deep neural network is divided into several parts. A shallow network subsegment is trained with short cut to control the propagation of gradient and prevent gradient dispersion or even gradient explosion. The other two basic components, Res_unit and Resblock_body, are also composed of several CBLs. In Resblock_body, we abbreviate them as resX according to X res_units. The zero padding step in resX can expand the feature graph which we gradually reduce to zero, and by extending its edge length to zero, we can make a sudden change. Characterization of the image.

Output of YOLO_V3----predictions across scales

What is predictions across scales ?

2019101224 05 Oct 2019

It uses feature pyramid networks for reference, and also multi-scale to detect different size targets. The finer the grid cell, the finer the object can be detected.

The depth of Y 3 is 255, and the rule of edge length is 13:26:52. For COCO categories, there are 80 categories, so each box should output a probability for each category.

Yolo_V3 sets three boxes for each grid cell, so each box needs five basic parameters (x, y, w, h, confidence), and then 80 categories of probability So 3* (5 + 80) = 255. That's how this 255 came about.

Yolo_V3 implements this multi-scale feature map by means of up-sampling. It can be seen from the above structure chart that the two tensors connected by concat in the graph have the same scale (the two joints are 26 x 26 scale and 52 x 52 scale respectively). The tensor scale of concat joints is the same by (2,2) up-sampling.

Each anchor prior (named anchor prior, but not anchor mechanism) consists of two numbers, one representing height and the other representing width.

Yolo_V3 uses logistic regression to predict b-box. This wave operates like a linear regression adjustment b-box in RPN. Each time V3 predicts b-box, the output is (tx, ty, tw, th, to). Then the absolute (x, y, w, h, c) is calculated by formula 1.

Eogistic regression is used to score an objectivity score on the part

2019101224 05 Oct 2019 surrounded by anchor, that is, how likely the location is to be the target. This step is done before prediction, which can remove unnecessary anchor and reduce the amount of calculation.

If the template box is not optimal, even if it exceeds the threshold we set, we will not predict it. Unlike faster R-CNN, Yolo_V3 only operates on one prior, which is the best prior. Logistic regression is used to find the highest objectness score from nine anchor priors. Logistic regression is a linear model of the mapping relationship between prior and objectivity score by using curves.

Prediction of Target Boundary Frame

Yolo_V3 network makes convolution prediction through (4+1+c) K convolution kernels of 11 sizes in three feature graphs, K is the number of bounding box prior (K defaults to 3), C is the number of categories of predicted targets, of which 4 K parameters are responsible for predicting the offset of the target boundary box, and K parameters are responsible for predicting the target boundary. The box contains the probability of the target, and C, K parameters are responsible for predicting the probability that the K preset boundaries correspond to C target categories. The dotted rectangular frame in the figure is the preset boundary box, and the real rectangular frame is the predicted boundary box calculated by the offset predicted by the network. For the center coordinate of the preset boundary box on the feature map, the width and height of the preset

2019101224 05 Oct 2019 boundary box on the feature map, the offset of the center of the boundary box predicted by the network and the ratio of the width to height are calculated respectively. For the final predicted target boundary box, the transformation process from the preset boundary box to the final predicted boundary box is shown in the formula on the right side of the figure. The sigmoid function is designed to reduce the predicted offset to between 0 and 1, so that the central coordinates of the preset boundary frame can be fixed in a cell. The author says that this can accelerate the convergence of the network.

The following figure shows the size of the feature map of three prediction layers and the size of the preset boundary box on each feature map.

Calculation of Loss Function(formula 1)

The loss function of Yolo_V3 is mainly divided into three parts: the loss of target location offset, the loss of target confidence and the loss of target classification, among which the balance coefficient is the loss.

Δ(Ο,ο, C, c,l,g) = + /L_clci(O, C) + g) (1)

Target Confidence Loss(formula 2)

Target confidence can be understood as predicting the probability of the existence of the target in the rectangular box. Binary Cross Entropy is used for the loss of the target confidence, which denotes the existence of the target in the predicted target boundary box i, 0 denotes the

2019101224 05 Oct 2019 non-existence of the target, and 1 denotes the existence of the target. Represents the “Sigmoid” probability of predicting whether there is a target in the rectangular box I of the target.

c) = -Σ G ln(c,) + (1 - o,) ln(l - c,)) =Signioid(c_i)

Target Category Loss(formula 3)

The target category loss also uses the binary cross-entropy loss (using the binary cross-entropy loss, which indicates whether the J-type target really exists in the prediction target boundary box i, 0 means nonexistence, 1 means existence). “Sigmoid” Probability Representing the Target Boundary Box I of Network Prediction in Class J Targets N,(O,C) = - Σ Σ ^ln<C,) + (1-0,) ln(l- ς)) iePn.s Ji=eta ( 3 )

Λ = Sigmoid(C_tJ) _Target

Location Loss(formula 4)

The square sum of the difference between the real deviation value and the predicted deviation value is used to represent the predicted rectangular frame coordinate offset, the coordinate offset between the matched GTbox and the default box, the predicted target rectangular box parameter, the default rectangular box parameter and the matched real target rectangular box parameter. These parameters are mapped on the

2019101224 05 Oct 2019 prediction feature map.

a„.(U)=L Σ vr-SD iefws

Procedure

Step 1: Data Acquisition

In this project, we used MS COCO, a public data set in the field of target detection. This data set is built by Microsoft, which contains detection, segmentation, key-points and other tasks. Mainly to solve the problem of the following three scene: detecting non - iconic views of objects (often say detection), contextual reasoning between objects and the precise localization of 2 d objects (corresponding to often say the segmentation problem). The average COCO data set contains 3.5 categories and 7.7 instance targets per picture, which is not only a large amount of data, but also a large number of types and instances.

Step2: Date Pre-processing

Then, we pre-process the data set. We make the features of data and

2019101224 05 Oct 2019 also label, then we converted the XML format labels of the data set to TXT labels to accommodate the Yolo_V3 network. This project is pedestrian detection, so we define the category of pictures as human. Finally, we divided the data set into INRIA_train for training data sets and INRIA_test for test data sets for training and testing.

Step3: Training and optimization

In the optimization of the parameter set, we put the data into the network for training in batches, in order to reduce the computation and improve the training effect.

Besides, we use Transfer Learning to improve our structure of Yolo_V3. Transfer learning uses pre-training model as checkpoint to train and generate neural network model to support new tasks. Transfer learning can transfer common characteristic data and information so as to avoid re-learning these knowledge and realize fast learning.

Step4: Testing

We adjust parameters of the network constantly in order to reach the optimal performance. Adjust the class to 1 to sure the recognition rate can as accurate as possibly. Then we put the test set into the network and get the recognition accuracy.

The data and Recognition rate are as follow:

Table l:The data before Transfer Learning:

Class

Images

Targets

Recognition rate

2019101224 05 Oct 2019

all	288	597	0.892
all	288	597	0.909
all	288	597	0.917

Table 2: The data before Transfer Learning:

Class	Images	Targets	Recognition rate
all	288	597	0.925
all	288	597	0.928
all	288	597	0.93

We can observe clearly that the Recognition rate is about 90 percents before transfer learning and after the transfer learning, the rate of recognition can be improved to about 93 percents.This shows that transfer learning is a good method for training convolutional neural networks.

Claims

CLAIM

1. Method of Human detection research and implement based on deep learning, wherein said method has fully trained the data, repeated random selection training from the original data, and randomly selected the final test results, so the results are true and reliable.
2. According to method of claim 1, wherein avoid over-fitting, every DBL element in yolov3 network structure has BN layer to regularize the data, which ensures the robustness of the network structure; the accuracy of recognition is also improved by parameter adjustment and transfer learning.
3. According to method of claim 1, wherein in the process of training, train the repeated data repeatedly to prevent accidental situations, the results should be reliable.