CN109447239B

CN109447239B - Embedded convolutional neural network acceleration method based on ARM

Info

Publication number: CN109447239B
Application number: CN201811121051.4A
Authority: CN
Inventors: 毕盛; 张英杰; 董敏
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2022-03-25
Anticipated expiration: 2038-09-26
Also published as: CN109447239A

Abstract

The invention discloses an ARM-based embedded convolutional neural network acceleration method, which overcomes the defects of hardware resources of embedded equipment and the problem of high computational complexity of a convolutional neural network. The time-consuming, heavily convolved 1 x 1 convolution and 3 x 3 deep separable convolution, which are commonly used in lightweight convolutional neural networks, are optimized using the ARM NEON technique. Particularly, the 1 × 1 convolution is subjected to memory rearrangement first, then ARM NEON vector optimization is used, 3 × 3 depth separable convolution is performed, ARM NEON vector optimization is directly performed, calculation of the convolutional neural network is accelerated, hardware calculation resources of embedded equipment are fully utilized, and therefore the convolutional neural network deployed at the embedded terminal is accelerated in operation speed and is more practical.

Description

Embedded convolutional neural network acceleration method based on ARM

Technical Field

The invention relates to the technical field of embedded convolutional neural network acceleration, in particular to an embedded convolutional neural network acceleration method based on ARM.

Background

Deep learning algorithms based on convolutional neural networks have enjoyed great success in various fields of computer vision. However, as the performance of the deep convolutional neural network is improved, the number of parameters of the network is increased, and the calculation amount is also increased. Since the requirement of the deep convolutional neural network on hardware power is too high, it becomes a challenge to deploy the deep convolutional neural network on a device with limited computing resources, such as an embedded device.

At present, it is a viable approach to design a lightweight convolutional neural network structure and deploy the structure onto embedded devices for commercial applications. Although most object detection networks are designed for PC devices, there are also many feature extraction networks designed specifically for embedded devices.

In the paper "SqueezeNet," AlexNet-level accuracy with 50x power parameters and <0.5MB model size, "Forrest N.Iandola et al designed the network to be in the shape of a bottleneck, which achieved the same level of accuracy on the ImageNet dataset with fewer parameters and smaller network structure. MobileNet V1 using deep separable convolution is proposed in the paper "mobilenes: Efficient networks for mobile vision applications", which has two hyper-parameters that can be used to achieve a trade-off between tuning accuracy and computational complexity. In the paper "Shuffle Net: An application of generated artifacts to multi-hop lighting networks", Shuffle Net uses channel rearrangement operations and packet convolutions to reduce the computational complexity and achieve a higher accuracy than MobileNet V1.

While there are so many lightweight convolutional neural networks, it is not efficient to deploy them directly onto embedded devices.

Disclosure of Invention

The invention aims to overcome the defects of hardware computing resources of ARM embedded equipment, and provides an ARM-based embedded convolutional neural network acceleration method, which can be used for deploying a deep convolutional neural network on a terminal and realizing the full utilization of embedded equipment resources.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: an embedded convolutional neural network acceleration method based on ARM comprises the following steps:

1) training a lightweight convolutional neural network by using a deep learning framework;

2) exporting the trained convolutional neural network structure and weight to a file;

3) the design program is imported into the weight file in the step 2), and forward calculation of the neural network is realized according to the trained network structure in the step 2);

4) the method comprises the steps of optimizing 1 x 1 convolution and 3 x 3 depth separable convolution which are time-consuming in running of the neural network by using a NEON technology;

5) and (3) replacing the optimized 1 × 1 convolution and 3 × 3 depth separable convolution in the step 4) with the corresponding operation in the step 3), and accelerating the operation of the convolutional neural network.

Further, in step 2), for the storage method of the convolutional neural network structure and the weight: and (4) sequencing structural data such as network structures, weights and the like by using an automation tool Protocol Buffers.

Further, in step 4), the optimization method for the 1 × 1 convolution: firstly, performing memory rearrangement on the input characteristic diagram and the convolution kernel of the 1 × 1 convolution to enable the input characteristic diagram and the convolution kernel to accord with the locality principle of a memory in the 1 × 1 convolution calculation process, and then optimizing the calculation process by using a NEON single instruction multiple data stream technology to reduce the 1 × 1 convolution operation time.

Further, in step 4), for the optimization method of 3 × 3 depth separable convolution, during the optimization process, the nein registers are used to store the adjacent four elements of the output channel, and the vectorization calculation process of one nein register is as follows:

the element value of a specific channel of the input characteristic diagram is recorded as I_m,nM and n are horizontal and vertical coordinates respectively, and registers R in the input characteristic diagram_m,nComprises the following steps:

R_m,n＝(I_m,n,I_m,n+1,I_m,n+2,I_m,n+3),n％4≠0

recording the element value of a specific channel of the output characteristic diagram as O_x,yX and y are respectively horizontal and vertical coordinates, and the register OutR in the output characteristic diagram_x,yComprises the following steps:

OutR_x,y＝(O_x,y,O_x,y+1,O_x,y+2,O_x,y+3),y％4＝1

the operation rule according to convolution includes:

wherein k is_i,jRepresents the corresponding convolution kernel, then

To obtain

Wherein OutR_x,yAnd R_i+x-1,j+y-1Are registers, i.e. the optimized result of the 3 x 3 depth separable convolution is obtained by vector operation.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention overcomes the defects of ARM embedded equipment computing resources and realizes the efficient deployment of the convolutional neural network on an embedded platform.

2. The invention carries out memory rearrangement on the input of the 1 × 1 convolution and the convolution kernel, and greatly improves the calculation efficiency of the 1 × 1 convolution on the premise of not increasing the use of the memory.

3. The invention avoids the memory rearrangement with extra memory usage for the 3 × 3 depth separable convolution, and optimizes the calculation of the 3 × 3 depth separable convolution directly through vectorization.

4. The invention has wide use space in the embedded deployment of the neural network, and the calculation of the vectorization optimization neural network can maximally utilize the hardware resources of the embedded equipment.

Drawings

FIG. 1 is a schematic diagram of the 1 × 1 convolution optimization method of the present invention.

FIG. 2 is a schematic diagram of the 3 × 3 depth separable convolution optimization method of the present invention.

FIG. 3 is a flow chart of the present invention.

Detailed Description

The optimization method of the present invention is described in further detail below with reference to the drawings and MobileNetV1, but the present invention is also applicable to other neural networks using 1 × 1 convolution and 3 × 3 deep separable convolution.

As shown in fig. 3, the embedded convolutional neural network acceleration method based on ARM provided by the present invention includes the following steps:

step 1, training a lightweight convolutional neural network MobileNet V1 by using Caffe or other deep learning frameworks.

And 2, exporting the trained network structure and weight of the MobileNet V1 to a file.

And 3, importing a weight file by a design program, and realizing forward calculation of the neural network according to the trained network structure. Different layers in the neural network can be respectively represented by different functions, the function parameters comprise specification parameters of the layers, input characteristic diagrams, weights of the layers, and function return values are output characteristic diagrams. Then, the functions of each layer are connected in series to operate to obtain the output of the neural network.

And 4a, optimizing the 1 x 1 convolution which is long in running time of the neural network by using a NEON technology.

In the network structure similar to MobileNetV1, deep separable convolution and convolution of 1 × 1 are used to replace the ordinary convolution operation, so as to achieve the purpose of reducing the computational complexity of the convolution operation. In addition, the 1 × 1 convolution can also be used to perform dimension increasing and dimension decreasing on the feature map, for example, the computation complexity of the network can be reduced by performing dimension decreasing on the original feature map, performing convolution on the feature map after the dimension decreasing, and then performing dimension increasing by using the 1 × 1 convolution.

However, the computational complexity of the 1 × 1 convolution is also very large, so that optimizing the 1 × 1 convolution can significantly increase the operating speed of the convolutional neural network.

As shown in fig. 1, the 1 × 1 convolution is to multiply and sum the 1 × 1 convolution kernel and the element at the specific position of the input feature map to obtain the element value at the specific position of the output feature map.

In the specific implementation of the convolution operation, we store both the feature map and the convolution kernel as a one-dimensional array, and then we align each channel in the feature map according to a certain byte, here we choose 16-byte alignment, because one NEON register can store 4 float-type data, we read 16 bytes of data each time we read data in units of NEON registers, and it is more efficient to read these data if they are stored in 16-byte alignment. The memory distribution section in fig. 1 shows the distribution of the signature and convolution kernels in memory.

The optimization of the 1 × 1 convolution mainly rearranges the memory distribution of the input feature graph and the convolution kernel, so that the memory locality principle, namely the temporal locality and spatial locality principle, is satisfied as much as possible when the output feature is circularly calculated.

The specific optimization process comprises the following steps: during each 1 × 1 convolution operation, the output characteristic graph is divided equally to each CPU of the embedded device by using OpenMP according to 8 groups, so as to fully utilize hardware resources. Then, for every 8 output feature maps, we will perform 8 output feature map calculations simultaneously in units of 1 × 8 small blocks. When we need to calculate the first 1 × 8 small block, we need to access a column in the input feature map, so we will group the input channels in units of 4, and then we will rearrange the first four columns in a continuous memory, so we can access more efficiently. Similarly, we can greatly improve the access efficiency of our access by rearranging the 1 × 1 convolution kernel according to the access order.

Thus, in the calculation process, 8 output feature maps, each feature map being a 1 × 8 block, require 16 NEON registers, 4 input feature maps, each feature map being a 1 × 8 block, require 8 NEON registers, and the NEON registers of the 8 1 × 1 convolution kernels are exactly 32 NEON registers, so that the 32 NEON registers of AArch64 are fully utilized for calculation.

Step 4b, the 3 x 3 deep separable convolution of the neural network is optimized using the NEON technique.

The optimization of the 3 × 3 depth separable convolution is different from the 1 × 1 convolution, and in the calculation process of the 3 × 3 depth separable convolution, the calculation of each output pixel point is related to the surrounding 3 × 3 input pixel points, and if the memory storage is rearranged according to the method of optimizing the 1 × 1 convolution, additional memory overhead is caused because a large part of the 3 × 3 small blocks are overlapped. The optimization can be achieved by directly vectorizing the computation process of the 3 x 3 depth separable convolution.

As shown in fig. 2, in the process of 3 × 3 depth separable convolution, the number of channels of the input feature map and the output feature map is kept unchanged, and the channels of the input feature map and the channels of the output feature map are in one-to-one correspondence, that is, when 3 × 3 depth separable convolution is performed, a first convolution kernel is convolved with a first channel of the input feature map to obtain a first channel of the output feature map.

While during the optimization process of 3 × 3 deep separable convolution, we will use the nen register to store the result of the output feature map obtained by the final calculation, as shown in fig. 2, we will use four nen registers to store a 2 × 8 block in the output feature map, and then use the nen registers to vectorize and calculate the result of the four nen registers.

The vectorization calculation process for one NEON register is as follows:

the element value of a specific channel of the input characteristic diagram is recorded as I_x,yAnd x and y are horizontal and vertical coordinates respectively. The registers in the input signature graph are:

R_m,n＝(I_m,n,I_m,n+1,I_m,n+2,I_m,n+3),n％4≠0

recording the element value of a specific channel of the output characteristic diagram as O_x,yAnd x and y are horizontal and vertical coordinates respectively. Register OutR in output signature graph_x,yComprises the following steps:

OutR_x,y＝(O_x,y,O_x,y+1,O_x,y+2,O_x,y+3),x％4＝1

the operation rule according to convolution includes:

wherein k is_i,jRepresenting the corresponding convolution kernel. Then

Can obtain

Wherein OutR_x,yAnd R_i+x-1,j+y-1Are all hostThe memory, we have obtained the optimized result of 3 x 3 depth separable convolution by vector operation.

During the 3 × 3 depth separable convolution calculation, we will store 3 × 3 convolution kernels into 3 NEON registers to facilitate the calculation.

And finally, using OpenMP to evenly distribute the output characteristic graph to each CPU according to the channels for calculation acceleration, and improving the utilization of hardware resources.

And 5, replacing the optimized 1 × 1 convolution and the optimized 3 × 3 depth separable convolution with corresponding operations in the convolutional neural network, and accelerating the operation of the convolutional neural network.

The effects of the present invention will be further described below with reference to experiments.

The hardware used in this experiment was Firefoy-RK 3399. The mainboard adopts a RuiKe micro RK 33996 core slice scheme, and has double Cortex-A72 large cores and four Cortex-A53 small cores with the main frequency of 2.0 Hz. The software system is Ubuntu16.04, and the GCC version is 5.4.0.

The experimental process includes the steps that firstly, OpenCV is used for reading a camera video stream, image frames are extracted from the video, then the loaded images are preprocessed, and mainly, in order to enable the accuracy of a deep learning model to be higher, the images are zoomed and subjected to mean value processing. Afterwards, the Protocol Buffers are used for reading the MobileNet V1 model trained on Caffe into a memory, and a directed acyclic graph is established for calculating the neural network. Then, we will take the preprocessed image as the input of the neural network, and through the forward calculation of the neural network, we will obtain the output of the network, that is, the probability of each object appearing in the image. Finally, the results with higher credibility are screened out from the output of the neural network and displayed.

After the MobileNetV1 is deployed, the speed of 12FPS on the embedded platform of Firefly-RK3399 is 6 times higher than that of the embedded platform (about 2FPS) which runs by directly using the Caffe framework, and it can be seen that the NEON optimization of 1 × 1 convolution and 3 × 3 deep separable convolution obviously improves the running performance of the convolutional neural network.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. An embedded convolutional neural network acceleration method based on ARM is characterized by comprising the following steps:

optimization method for 1 × 1 convolution: firstly, performing memory rearrangement on the input characteristic diagram and the convolution kernel of the 1 × 1 convolution to enable the input characteristic diagram and the convolution kernel to accord with the locality principle of a memory in the 1 × 1 convolution calculation process, and then optimizing the calculation process by using a NEON single instruction multiple data stream technology to reduce the 1 × 1 convolution operation time;

in the optimization process, NEON registers are used for storing adjacent four elements of an output channel, and the vectorization calculation process of one NEON register is as follows:

R_m,n＝(I_m,n,I_m,n+1,I_m,n+2,I_m,n+3),n％4≠0

OutR_x,y＝(O_x,y,O_x,y+1,O_x,y+2,O_x,y+3),y％4＝1

the operation rule according to convolution includes:

wherein k is_i,jRepresents the corresponding convolution kernel, then

To obtain

Wherein OutR_x,yAnd R_i+x-1,j+y-1The data are all registers, namely, the optimization result of the 3 multiplied by 3 depth separable convolution is obtained through vector operation;

2. The method of claim 1 for accelerating an ARM-based embedded convolutional neural network, comprising: in step 2), the storage method for the convolutional neural network structure and the weight value is as follows: the network structure and weight data were serialized using the automation tool Protocol Buffers.