CN109447239B - Embedded convolutional neural network acceleration method based on ARM - Google Patents

Embedded convolutional neural network acceleration method based on ARM Download PDF

Info

Publication number
CN109447239B
CN109447239B CN201811121051.4A CN201811121051A CN109447239B CN 109447239 B CN109447239 B CN 109447239B CN 201811121051 A CN201811121051 A CN 201811121051A CN 109447239 B CN109447239 B CN 109447239B
Authority
CN
China
Prior art keywords
convolution
neural network
convolutional neural
neon
characteristic diagram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811121051.4A
Other languages
Chinese (zh)
Other versions
CN109447239A (en
Inventor
毕盛
张英杰
董敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201811121051.4A priority Critical patent/CN109447239B/en
Publication of CN109447239A publication Critical patent/CN109447239A/en
Application granted granted Critical
Publication of CN109447239B publication Critical patent/CN109447239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an ARM-based embedded convolutional neural network acceleration method, which overcomes the defects of hardware resources of embedded equipment and the problem of high computational complexity of a convolutional neural network. The time-consuming, heavily convolved 1 x 1 convolution and 3 x 3 deep separable convolution, which are commonly used in lightweight convolutional neural networks, are optimized using the ARM NEON technique. Particularly, the 1 × 1 convolution is subjected to memory rearrangement first, then ARM NEON vector optimization is used, 3 × 3 depth separable convolution is performed, ARM NEON vector optimization is directly performed, calculation of the convolutional neural network is accelerated, hardware calculation resources of embedded equipment are fully utilized, and therefore the convolutional neural network deployed at the embedded terminal is accelerated in operation speed and is more practical.

Description

Embedded convolutional neural network acceleration method based on ARM
Technical Field
The invention relates to the technical field of embedded convolutional neural network acceleration, in particular to an embedded convolutional neural network acceleration method based on ARM.
Background
Deep learning algorithms based on convolutional neural networks have enjoyed great success in various fields of computer vision. However, as the performance of the deep convolutional neural network is improved, the number of parameters of the network is increased, and the calculation amount is also increased. Since the requirement of the deep convolutional neural network on hardware power is too high, it becomes a challenge to deploy the deep convolutional neural network on a device with limited computing resources, such as an embedded device.
At present, it is a viable approach to design a lightweight convolutional neural network structure and deploy the structure onto embedded devices for commercial applications. Although most object detection networks are designed for PC devices, there are also many feature extraction networks designed specifically for embedded devices.
In the paper "SqueezeNet," AlexNet-level accuracy with 50x power parameters and <0.5MB model size, "Forrest N.Iandola et al designed the network to be in the shape of a bottleneck, which achieved the same level of accuracy on the ImageNet dataset with fewer parameters and smaller network structure. MobileNet V1 using deep separable convolution is proposed in the paper "mobilenes: Efficient networks for mobile vision applications", which has two hyper-parameters that can be used to achieve a trade-off between tuning accuracy and computational complexity. In the paper "Shuffle Net: An application of generated artifacts to multi-hop lighting networks", Shuffle Net uses channel rearrangement operations and packet convolutions to reduce the computational complexity and achieve a higher accuracy than MobileNet V1.
While there are so many lightweight convolutional neural networks, it is not efficient to deploy them directly onto embedded devices.
Disclosure of Invention
The invention aims to overcome the defects of hardware computing resources of ARM embedded equipment, and provides an ARM-based embedded convolutional neural network acceleration method, which can be used for deploying a deep convolutional neural network on a terminal and realizing the full utilization of embedded equipment resources.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: an embedded convolutional neural network acceleration method based on ARM comprises the following steps:
1) training a lightweight convolutional neural network by using a deep learning framework;
2) exporting the trained convolutional neural network structure and weight to a file;
3) the design program is imported into the weight file in the step 2), and forward calculation of the neural network is realized according to the trained network structure in the step 2);
4) the method comprises the steps of optimizing 1 x 1 convolution and 3 x 3 depth separable convolution which are time-consuming in running of the neural network by using a NEON technology;
5) and (3) replacing the optimized 1 × 1 convolution and 3 × 3 depth separable convolution in the step 4) with the corresponding operation in the step 3), and accelerating the operation of the convolutional neural network.
Further, in step 2), for the storage method of the convolutional neural network structure and the weight: and (4) sequencing structural data such as network structures, weights and the like by using an automation tool Protocol Buffers.
Further, in step 4), the optimization method for the 1 × 1 convolution: firstly, performing memory rearrangement on the input characteristic diagram and the convolution kernel of the 1 × 1 convolution to enable the input characteristic diagram and the convolution kernel to accord with the locality principle of a memory in the 1 × 1 convolution calculation process, and then optimizing the calculation process by using a NEON single instruction multiple data stream technology to reduce the 1 × 1 convolution operation time.
Further, in step 4), for the optimization method of 3 × 3 depth separable convolution, during the optimization process, the nein registers are used to store the adjacent four elements of the output channel, and the vectorization calculation process of one nein register is as follows:
the element value of a specific channel of the input characteristic diagram is recorded as Im,nM and n are horizontal and vertical coordinates respectively, and registers R in the input characteristic diagramm,nComprises the following steps:
Rm,n=(Im,n,Im,n+1,Im,n+2,Im,n+3),n%4≠0
recording the element value of a specific channel of the output characteristic diagram as Ox,yX and y are respectively horizontal and vertical coordinates, and the register OutR in the output characteristic diagramx,yComprises the following steps:
OutRx,y=(Ox,y,Ox,y+1,Ox,y+2,Ox,y+3),y%4=1
the operation rule according to convolution includes:
Figure BDA0001811280480000031
wherein k isi,jRepresents the corresponding convolution kernel, then
Figure BDA0001811280480000032
To obtain
Figure BDA0001811280480000033
Wherein OutRx,yAnd Ri+x-1,j+y-1Are registers, i.e. the optimized result of the 3 x 3 depth separable convolution is obtained by vector operation.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention overcomes the defects of ARM embedded equipment computing resources and realizes the efficient deployment of the convolutional neural network on an embedded platform.
2. The invention carries out memory rearrangement on the input of the 1 × 1 convolution and the convolution kernel, and greatly improves the calculation efficiency of the 1 × 1 convolution on the premise of not increasing the use of the memory.
3. The invention avoids the memory rearrangement with extra memory usage for the 3 × 3 depth separable convolution, and optimizes the calculation of the 3 × 3 depth separable convolution directly through vectorization.
4. The invention has wide use space in the embedded deployment of the neural network, and the calculation of the vectorization optimization neural network can maximally utilize the hardware resources of the embedded equipment.
Drawings
FIG. 1 is a schematic diagram of the 1 × 1 convolution optimization method of the present invention.
FIG. 2 is a schematic diagram of the 3 × 3 depth separable convolution optimization method of the present invention.
FIG. 3 is a flow chart of the present invention.
Detailed Description
The optimization method of the present invention is described in further detail below with reference to the drawings and MobileNetV1, but the present invention is also applicable to other neural networks using 1 × 1 convolution and 3 × 3 deep separable convolution.
As shown in fig. 3, the embedded convolutional neural network acceleration method based on ARM provided by the present invention includes the following steps:
step 1, training a lightweight convolutional neural network MobileNet V1 by using Caffe or other deep learning frameworks.
And 2, exporting the trained network structure and weight of the MobileNet V1 to a file.
And 3, importing a weight file by a design program, and realizing forward calculation of the neural network according to the trained network structure. Different layers in the neural network can be respectively represented by different functions, the function parameters comprise specification parameters of the layers, input characteristic diagrams, weights of the layers, and function return values are output characteristic diagrams. Then, the functions of each layer are connected in series to operate to obtain the output of the neural network.
And 4a, optimizing the 1 x 1 convolution which is long in running time of the neural network by using a NEON technology.
In the network structure similar to MobileNetV1, deep separable convolution and convolution of 1 × 1 are used to replace the ordinary convolution operation, so as to achieve the purpose of reducing the computational complexity of the convolution operation. In addition, the 1 × 1 convolution can also be used to perform dimension increasing and dimension decreasing on the feature map, for example, the computation complexity of the network can be reduced by performing dimension decreasing on the original feature map, performing convolution on the feature map after the dimension decreasing, and then performing dimension increasing by using the 1 × 1 convolution.
However, the computational complexity of the 1 × 1 convolution is also very large, so that optimizing the 1 × 1 convolution can significantly increase the operating speed of the convolutional neural network.
As shown in fig. 1, the 1 × 1 convolution is to multiply and sum the 1 × 1 convolution kernel and the element at the specific position of the input feature map to obtain the element value at the specific position of the output feature map.
In the specific implementation of the convolution operation, we store both the feature map and the convolution kernel as a one-dimensional array, and then we align each channel in the feature map according to a certain byte, here we choose 16-byte alignment, because one NEON register can store 4 float-type data, we read 16 bytes of data each time we read data in units of NEON registers, and it is more efficient to read these data if they are stored in 16-byte alignment. The memory distribution section in fig. 1 shows the distribution of the signature and convolution kernels in memory.
The optimization of the 1 × 1 convolution mainly rearranges the memory distribution of the input feature graph and the convolution kernel, so that the memory locality principle, namely the temporal locality and spatial locality principle, is satisfied as much as possible when the output feature is circularly calculated.
The specific optimization process comprises the following steps: during each 1 × 1 convolution operation, the output characteristic graph is divided equally to each CPU of the embedded device by using OpenMP according to 8 groups, so as to fully utilize hardware resources. Then, for every 8 output feature maps, we will perform 8 output feature map calculations simultaneously in units of 1 × 8 small blocks. When we need to calculate the first 1 × 8 small block, we need to access a column in the input feature map, so we will group the input channels in units of 4, and then we will rearrange the first four columns in a continuous memory, so we can access more efficiently. Similarly, we can greatly improve the access efficiency of our access by rearranging the 1 × 1 convolution kernel according to the access order.
Thus, in the calculation process, 8 output feature maps, each feature map being a 1 × 8 block, require 16 NEON registers, 4 input feature maps, each feature map being a 1 × 8 block, require 8 NEON registers, and the NEON registers of the 8 1 × 1 convolution kernels are exactly 32 NEON registers, so that the 32 NEON registers of AArch64 are fully utilized for calculation.
Step 4b, the 3 x 3 deep separable convolution of the neural network is optimized using the NEON technique.
The optimization of the 3 × 3 depth separable convolution is different from the 1 × 1 convolution, and in the calculation process of the 3 × 3 depth separable convolution, the calculation of each output pixel point is related to the surrounding 3 × 3 input pixel points, and if the memory storage is rearranged according to the method of optimizing the 1 × 1 convolution, additional memory overhead is caused because a large part of the 3 × 3 small blocks are overlapped. The optimization can be achieved by directly vectorizing the computation process of the 3 x 3 depth separable convolution.
As shown in fig. 2, in the process of 3 × 3 depth separable convolution, the number of channels of the input feature map and the output feature map is kept unchanged, and the channels of the input feature map and the channels of the output feature map are in one-to-one correspondence, that is, when 3 × 3 depth separable convolution is performed, a first convolution kernel is convolved with a first channel of the input feature map to obtain a first channel of the output feature map.
While during the optimization process of 3 × 3 deep separable convolution, we will use the nen register to store the result of the output feature map obtained by the final calculation, as shown in fig. 2, we will use four nen registers to store a 2 × 8 block in the output feature map, and then use the nen registers to vectorize and calculate the result of the four nen registers.
The vectorization calculation process for one NEON register is as follows:
the element value of a specific channel of the input characteristic diagram is recorded as Ix,yAnd x and y are horizontal and vertical coordinates respectively. The registers in the input signature graph are:
Rm,n=(Im,n,Im,n+1,Im,n+2,Im,n+3),n%4≠0
recording the element value of a specific channel of the output characteristic diagram as Ox,yAnd x and y are horizontal and vertical coordinates respectively. Register OutR in output signature graphx,yComprises the following steps:
OutRx,y=(Ox,y,Ox,y+1,Ox,y+2,Ox,y+3),x%4=1
the operation rule according to convolution includes:
Figure BDA0001811280480000071
wherein k isi,jRepresenting the corresponding convolution kernel. Then
Figure BDA0001811280480000072
Can obtain
Figure BDA0001811280480000073
Wherein OutRx,yAnd Ri+x-1,j+y-1Are all hostThe memory, we have obtained the optimized result of 3 x 3 depth separable convolution by vector operation.
During the 3 × 3 depth separable convolution calculation, we will store 3 × 3 convolution kernels into 3 NEON registers to facilitate the calculation.
And finally, using OpenMP to evenly distribute the output characteristic graph to each CPU according to the channels for calculation acceleration, and improving the utilization of hardware resources.
And 5, replacing the optimized 1 × 1 convolution and the optimized 3 × 3 depth separable convolution with corresponding operations in the convolutional neural network, and accelerating the operation of the convolutional neural network.
The effects of the present invention will be further described below with reference to experiments.
The hardware used in this experiment was Firefoy-RK 3399. The mainboard adopts a RuiKe micro RK 33996 core slice scheme, and has double Cortex-A72 large cores and four Cortex-A53 small cores with the main frequency of 2.0 Hz. The software system is Ubuntu16.04, and the GCC version is 5.4.0.
The experimental process includes the steps that firstly, OpenCV is used for reading a camera video stream, image frames are extracted from the video, then the loaded images are preprocessed, and mainly, in order to enable the accuracy of a deep learning model to be higher, the images are zoomed and subjected to mean value processing. Afterwards, the Protocol Buffers are used for reading the MobileNet V1 model trained on Caffe into a memory, and a directed acyclic graph is established for calculating the neural network. Then, we will take the preprocessed image as the input of the neural network, and through the forward calculation of the neural network, we will obtain the output of the network, that is, the probability of each object appearing in the image. Finally, the results with higher credibility are screened out from the output of the neural network and displayed.
After the MobileNetV1 is deployed, the speed of 12FPS on the embedded platform of Firefly-RK3399 is 6 times higher than that of the embedded platform (about 2FPS) which runs by directly using the Caffe framework, and it can be seen that the NEON optimization of 1 × 1 convolution and 3 × 3 deep separable convolution obviously improves the running performance of the convolutional neural network.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (2)

1. An embedded convolutional neural network acceleration method based on ARM is characterized by comprising the following steps:
1) training a lightweight convolutional neural network by using a deep learning framework;
2) exporting the trained convolutional neural network structure and weight to a file;
3) the design program is imported into the weight file in the step 2), and forward calculation of the neural network is realized according to the trained network structure in the step 2);
4) the method comprises the steps of optimizing 1 x 1 convolution and 3 x 3 depth separable convolution which are time-consuming in running of the neural network by using a NEON technology;
optimization method for 1 × 1 convolution: firstly, performing memory rearrangement on the input characteristic diagram and the convolution kernel of the 1 × 1 convolution to enable the input characteristic diagram and the convolution kernel to accord with the locality principle of a memory in the 1 × 1 convolution calculation process, and then optimizing the calculation process by using a NEON single instruction multiple data stream technology to reduce the 1 × 1 convolution operation time;
in the optimization process, NEON registers are used for storing adjacent four elements of an output channel, and the vectorization calculation process of one NEON register is as follows:
the element value of a specific channel of the input characteristic diagram is recorded as Im,nM and n are horizontal and vertical coordinates respectively, and registers R in the input characteristic diagramm,nComprises the following steps:
Rm,n=(Im,n,Im,n+1,Im,n+2,Im,n+3),n%4≠0
recording the element value of a specific channel of the output characteristic diagram as Ox,yX and y are respectively horizontal and vertical coordinates, and the register OutR in the output characteristic diagramx,yComprises the following steps:
OutRx,y=(Ox,y,Ox,y+1,Ox,y+2,Ox,y+3),y%4=1
the operation rule according to convolution includes:
Figure FDA0003367792780000011
wherein k isi,jRepresents the corresponding convolution kernel, then
Figure FDA0003367792780000021
To obtain
Figure FDA0003367792780000022
Wherein OutRx,yAnd Ri+x-1,j+y-1The data are all registers, namely, the optimization result of the 3 multiplied by 3 depth separable convolution is obtained through vector operation;
5) and (3) replacing the optimized 1 × 1 convolution and 3 × 3 depth separable convolution in the step 4) with the corresponding operation in the step 3), and accelerating the operation of the convolutional neural network.
2. The method of claim 1 for accelerating an ARM-based embedded convolutional neural network, comprising: in step 2), the storage method for the convolutional neural network structure and the weight value is as follows: the network structure and weight data were serialized using the automation tool Protocol Buffers.
CN201811121051.4A 2018-09-26 2018-09-26 Embedded convolutional neural network acceleration method based on ARM Active CN109447239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811121051.4A CN109447239B (en) 2018-09-26 2018-09-26 Embedded convolutional neural network acceleration method based on ARM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811121051.4A CN109447239B (en) 2018-09-26 2018-09-26 Embedded convolutional neural network acceleration method based on ARM

Publications (2)

Publication Number Publication Date
CN109447239A CN109447239A (en) 2019-03-08
CN109447239B true CN109447239B (en) 2022-03-25

Family

ID=65544337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811121051.4A Active CN109447239B (en) 2018-09-26 2018-09-26 Embedded convolutional neural network acceleration method based on ARM

Country Status (1)

Country Link
CN (1) CN109447239B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516796A (en) * 2019-08-28 2019-11-29 西北工业大学 A kind of grouping convolution process optimization method of Embedded platform
CN111008629A (en) * 2019-12-07 2020-04-14 怀化学院 Cortex-M3-based method for identifying number of tip
CN114742211B (en) * 2022-06-10 2022-09-23 南京邮电大学 Convolutional neural network deployment and optimization method facing microcontroller

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015083199A1 (en) * 2013-12-04 2015-06-11 J Tech Solutions, Inc. Computer device and method executed by the computer device
CN107704921A (en) * 2017-10-19 2018-02-16 北京智芯原动科技有限公司 The algorithm optimization method and device of convolutional neural networks based on Neon instructions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015083199A1 (en) * 2013-12-04 2015-06-11 J Tech Solutions, Inc. Computer device and method executed by the computer device
CN107704921A (en) * 2017-10-19 2018-02-16 北京智芯原动科技有限公司 The algorithm optimization method and device of convolutional neural networks based on Neon instructions

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
3x3 convolution optimized speed using (NEON SIMD ) or (NEON SIMD and Open MP ) on S7/Note7;jaeho;<<谷歌>>;20160927;正文第1-2页 *
卷积神经网络的加速及压缩;陈伟杰;<<中国优秀硕士学位论文全文数据库 信息科技辑>>;20180715;正文第20-50页 *
腾讯优图首度开源深度学习框架ncnn;国际在线;<<网页>>;20170725;正文第1-5页 *

Also Published As

Publication number Publication date
CN109447239A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN108765247B (en) Image processing method, device, storage medium and equipment
JP7431913B2 (en) Efficient data layout for convolutional neural networks
US10394929B2 (en) Adaptive execution engine for convolution computing systems
US11392822B2 (en) Image processing method, image processing apparatus, and computer-readable storage medium
Cho et al. MEC: Memory-efficient convolution for deep neural network
US10706348B2 (en) Superpixel methods for convolutional neural networks
US11609968B2 (en) Image recognition method, apparatus, electronic device and storage medium
JP7007488B2 (en) Hardware-based pooling system and method
CN109447239B (en) Embedded convolutional neural network acceleration method based on ARM
US20190340510A1 (en) Sparsifying neural network models
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
CN108388537B (en) Convolutional neural network acceleration device and method
CN109522902B (en) Extraction of space-time feature representations
CN110796236B (en) Vectorization implementation method for pooling of multi-sample multi-channel convolutional neural network
US20190318461A1 (en) Histogram Statistics Circuit and Multimedia Processing System
CN114461978B (en) Data processing method and device, electronic equipment and readable storage medium
CN112990157B (en) Image target identification acceleration system based on FPGA
CN113705803A (en) Image hardware identification system based on convolutional neural network and deployment method
CN114003201A (en) Matrix transformation method and device and convolutional neural network accelerator
US11481994B2 (en) Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
CN116524180A (en) Dramatic stage scene segmentation method based on lightweight backbone structure
CN117897708A (en) Parallel depth-by-depth processing architecture for neural networks
CN112927125B (en) Data processing method, device, computer equipment and storage medium
CN110930290B (en) Data processing method and device
Jinguji et al. Weight sparseness for a feature-map-split-cnn toward low-cost embedded fpgas

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant