CN113673273A

CN113673273A - Design method of quantifiable front-end vehicle detection network structure

Info

Publication number: CN113673273A
Application number: CN202010400217.7A
Authority: CN
Inventors: 田凤彬; 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2021-11-19
Anticipated expiration: 2040-05-13
Also published as: CN113673273B

Abstract

The invention provides a design method of a quantifiable front-end vehicle detection network structure, which comprises the following steps: s1, adopting a two-level network, wherein the first level input is a gray scale map of 47X47, and the second level network input is a gray scale map of 49X 49; defining an excitation function by self, quantizing the result of each layer to 4 bits, and quantizing the boundary data; s2, designing a secondary network structure: a first-level network: the step size of the first three layers is 2, the step size of the fourth layer is 1, and the fourth layer uses a custom excitation function and a 3 multiplied by 3 convolution kernel; the fifth layer uses the stimulus function is sigmol; the sixth layer is not processed using the stimulus function; performing convolution on each layer by using a non-alignment processing mode; a second-level network: the step size of the first layer is 1, the convolution size is 3 multiplied by 3, and a custom excitation function is used; the step length from the second layer to the fifth layer is 2, the convolution size is 3 multiplied by 3, and a custom excitation function is used; the sixth layer does not use an excitation function; the seventh layer uses the fully-connected excitation function is sigmol; the eighth layer uses full connectivity without using a stimulus function.

Description

Design method of quantifiable front-end vehicle detection network structure

Technical Field

The invention relates to the technical field of neural networks, in particular to a design method of a quantifiable front-end vehicle detection network structure.

Background

In the current society, the development of the neural network technology in the field of artificial intelligence is rapid. MTCNN technology is also one of the more popular technologies in recent years. MTCNN, Multi-task convolutional neural network, puts face region detection and face keypoint detection together, and can be generally divided into three-layer network structures of P-Net, R-Net and O-Net. The multi-task neural network model for the face detection task mainly adopts three cascaded networks and adopts the idea of adding a classifier into a candidate frame to carry out rapid and efficient face detection. The three cascaded networks are respectively P-Net for quickly generating candidate windows, R-Net for filtering and selecting high-precision candidate windows and O-Net for generating final bounding boxes and key points of the human face.

However, MTCNN cascade detection has the following drawbacks:

1. and the three-level cascade network is used, so that the running time for processing a fixed size picture is long. The missed detection is more.

2. The excitation function used is not suitable for the quantization process.

3. The three-level network is used in face detection and is not suitable for vehicle detection.

4. The pooling process which can not be quantitatively used on chips produced by Beijing Junzhen integrated circuit GmbH, for example, is used, so that only floating-point operation can be performed on some chips (such as Junzhen chips), and the operation time is long.

In addition, the following commonly used technical terms are also included in the prior art:

1. cascading: the mode that several detectors detect in series is called cascade.

2. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a 3 x 3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, … ….

3. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.

4. Excitation function: a function that processes the convolved results.

5. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map, and the result of the full connection of the data is also called a feature map. The feature size is typically expressed as length x width x depth, or 1 x depth.

6. Step length: the length by which the center position of the convolution kernel is shifted in coordinates.

7. And (3) carrying out non-alignment treatment on two ends: when the image or data is processed by the convolution kernel with the size of 3 × 3, if one convolution kernel is not enough, the data on two sides is insufficient, and the phenomenon of discarding the data on two sides or one side is adopted, which is called non-alignment processing on two ends.

Disclosure of Invention

In order to solve the problems of the prior art, the invention aims to:

1. and by using two-stage cascade, the size of the detection target is suitable for vehicle detection. The method has the advantages of short running time, less false detection and high accuracy when fixed-size pictures are processed, and meets the actual use requirements. The detection distance increases.

2. The pooling process is eliminated using an excitation function according to a quantifiable design.

Specifically, the invention provides a design method of a quantifiable front-end vehicle detection network structure, which comprises the following steps:

s1, designing a network to adopt a two-stage network, wherein an excitation function is a self-defined function, the first-stage input size is a gray scale image of 47X47, and the second-stage network input size is a gray scale image of 49X 49; defining an excitation function by self, quantizing the calculated results of each layer to 4 bits, and quantizing the boundary data;

s2, designing a secondary network structure:

s2.1, a first-level network:

the step size used by the first three layers is 2, the step size used by the fourth layer is 1, the four layers use custom excitation functions, and each layer uses a convolution kernel of 3 multiplied by 3;

the last two layers, i.e. the fifth layer and the sixth layer, are parallel layers, and the calculation results of the fourth layer are used; the fifth layer calculation result is used for judging whether the vehicle is a value of the vehicle, and the used excitation function is sigmol; the result of the sixth layer of calculation is used for fine tuning the value of the vehicle detection frame without any excitation function processing;

performing convolution on each layer by using a non-alignment processing mode;

s2.2, a second-level network:

the step length used by the first layer is 1, the convolution size is 3 multiplied by 3, and the user-defined excitation function is used for processing;

the step length from the second layer to the fifth layer, namely the second layer, the third layer, the fourth layer and the fifth layer is 2, the convolution size is 3 multiplied by 3, the user-defined excitation function is used for processing, and the input data of each layer from the second layer is the output result of the previous layer;

the sixth layer of input data is the output result of the fifth layer, and full connection is used without any excitation function;

the last two layers, namely the seventh layer and the eighth layer, are parallel layers, and the calculation result of the sixth layer is used; the seventh layer uses full connection, and the excitation function is sigmol and is used for judging whether the result is a target result; the eighth layer uses full connectivity, without using an excitation function, for fine tuning the values of the coordinates.

In S1, the custom excitation function is that, let x be the value of the input excitation function, y be the value of the output function, and the formula is:

thus, the present application has the advantages that: the method is simple, the recall rate and the accuracy rate of the face detection are improved with a little time cost, and the network can be quantized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of a first level network model of the network architecture in the method of the present invention.

Fig. 3 is a schematic diagram of a second level network model of the network architecture in the method of the present invention.

Detailed Description

In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention relates to a design method of a quantifiable front-end vehicle detection network structure, which comprises the following steps:

s2, designing a secondary network structure:

s2.1, a first-level network:

performing convolution on each layer by using a non-alignment processing mode;

s2.2, a second-level network:

the last two layers, namely the seventh layer and the eighth layer, are parallel layers, and the calculation result of the sixth layer is used; the seventh layer uses full connection, and the excitation function is sigmol and is used for judging whether the result is a target result; the eighth layer uses full connections, not adapted to the excitation function, for fine tuning the values of the coordinates.

in said S2.1

The input data of the first layer is a gray scale image 47 multiplied by 1, the depth is 1, the user-defined excitation function is used for processing, and the depth of an output feature image is 16;

the depth of input data is 16, the user-defined excitation function is used for processing, and the depth of an output feature map is 16;

the fourth layer input data depth is 16, processed using the custom excitation function, and the output feature map depth is 32.

The fifth layer of the S2.1 has an input data depth of 32 and an output data depth of 1.

In S2.1, the sixth layer input data depth is 32, and the output data depth is 4.

In said S2.2

The first layer input data is a gray scale map 49 × 49 × 1, and the output result is 47 × 47 × 16;

the second layer input data is 47 × 47 × 16, and the output feature map result is 23 × 23 × 32;

the third layer of input data is 23 × 23 × 32, and the output feature map result is 11 × 11 × 48;

the fourth layer input data is 11 × 11 × 48, and the output feature map result is 5 × 5 × 64;

the fifth layer input data is 5 × 5 × 64, and the output profile result is 2 × 2 × 64.

The sixth layer of input data in S2.2 is 2 multiplied by 64, and the output characteristic diagram result is 1 multiplied by 128; the seventh layer input data is 1 × 128, and the output characteristic diagram is 1 × 1; the eighth layer input data is 1 × 128 and the output profile is 1 × 4.

The eighth layer in S2.2 is used to fine tune four values of coordinates.

The technical solution of the present invention can be further explained as follows:

1. the network adopts a secondary network. The excitation function is a custom function, the first level input size is a gray scale map of 47X47, and the second level network input size is a gray scale map of 49X 49. Defining an excitation function by self, setting x as the value of an input excitation function and y as the value of an output function, and adopting the formula as follows:

because the calculated results of each layer need to be quantized to 4 bits, through theoretical analysis and various experimental tests, the excitation function is fixed between 0 and 4, and better accuracy can be obtained. Since the boundary data is quantized, when x <0.1, y is 0.

2. A two-level network architecture.

1) A first level network. The first three layers use a step size of 2 and the fourth layer uses a step size of 1, the four layers use custom excitation functions, and each layer uses a convolution kernel of 3 × 3.

The first layer of input data is a gray scale map 47 × 47 × 1 with a depth of 1, processed using a custom excitation function, and the output feature map depth is 16.

The second layer and the third layer have the input data depth of 16, are processed by using the custom excitation function, and have the output feature map depth of 16. The fourth layer input data depth is 16, processed using the custom excitation function, and the output feature map depth is 32.

The last two layers, i.e. the fifth and sixth layers, are parallel layers, using the calculation of the fourth layer. The fifth layer of calculation results is used for judging whether the vehicle is a value of the vehicle, and the used excitation function is sigmol, which is determined by the specificity of classification, otherwise, great precision loss is caused. The layer input data depth is 32 and the output data depth is 1.

The result of the sixth layer of calculations is to fine tune the values of the vehicle detection box without using any excitation function processing. The input data depth is 32 and the output data depth is 4.

Each layer is convoluted by using a non-alignment processing mode, and all calculation results can be effectively utilized by non-alignment and the design of the network, so that the waste of the calculation results is not caused.

The whole network structure needs to be considered from the whole calculated amount and practical application, not only the actual precision is met, but also the calculated amount is less, a first-level network is designed, the calculated amount of the network is small, and the precision is not reduced. The network flow diagram is shown in figure 2 of the network architecture of the drawings.

2) A second level network.

The first layer input data is the grayscale map 49 × 49 × 1, the step size is 1, the convolution size is 3 × 3, and the output result is 47 × 47 × 16 using custom excitation function processing.

The second layer input data is 47 × 47 × 16, the step size is 2, the convolution size is 3 × 3, and the output profile result is 23 × 23 × 32 using custom excitation function processing.

The third layer of input data is 23 × 23 × 32, the step size is 2, the convolution size is 3 × 3, and the output profile result is 11 × 11 × 48 using custom excitation function processing.

The fourth layer input data is 11 × 11 × 48, the step size is 2, the convolution size is 3 × 3, and the output profile result is 5 × 5 × 64 using custom excitation function processing.

The fifth layer input data is 5 × 5 × 64, the step size is 2, the convolution size is 3 × 3, the result of the output feature map is 2 × 2 × 64 by using custom excitation function processing.

The sixth layer input data is 2 × 2 × 64, with full concatenation, without any excitation function, and the output profile result is 1 × 128. The seventh and eighth layers are parallel layers.

In the seventh layer, the input data is 1 × 128, the excitation function is sigmol using full concatenation, and the output characteristic map is 1 × 1, which is a result of determining whether or not the target is present.

In the eighth layer, the input data is 1 × 128, full concatenation is used, no excitation function is used, and the input feature map is 1 × 4, which is four values for fine tuning coordinates. The network flow diagram is shown in figure 3 of the network architecture of the drawings.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for designing a quantifiable front-end vehicle detection network structure, the method comprising the steps of:

s2, designing a secondary network structure:

s2.1, a first-level network:

performing convolution on each layer by using a non-alignment processing mode;

s2.2, a second-level network:

2. The method of claim 1, wherein in S1, the custom excitation function is that, assuming x is the value of the input excitation function and y is the value of the output function, the formula is:

3. the method of claim 1, wherein the step S2.1 is performed by a quantifiable design method for a front-end vehicle inspection network structure

4. The method as claimed in claim 1, wherein the fifth layer of the input data depth of S2.1 is 32, and the output data depth is 1.

5. The method of claim 1, wherein the sixth layer of the S2.1 is 32 in input data depth and 4 in output data depth.

6. The method of claim 1, wherein the step S2.2 is performed by a quantifiable design method for a front-end vehicle inspection network structure

7. The method of claim 1, wherein the sixth layer of input data in S2.2 is 2 x 64, and the output feature map result is 1 x 128; the seventh layer input data is 1 × 128, and the output characteristic diagram is 1 × 1; the eighth layer input data is 1 × 128 and the output profile is 1 × 4.

8. The method of claim 1, wherein the eighth layer of S2.2 is used for fine tuning four values of coordinates.