WO2023236319A1 - 一种面向微控制器的卷积神经网络部署和优化方法 - Google Patents

一种面向微控制器的卷积神经网络部署和优化方法 Download PDF

Info

Publication number
WO2023236319A1
WO2023236319A1 PCT/CN2022/106634 CN2022106634W WO2023236319A1 WO 2023236319 A1 WO2023236319 A1 WO 2023236319A1 CN 2022106634 W CN2022106634 W CN 2022106634W WO 2023236319 A1 WO2023236319 A1 WO 2023236319A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
layer
convolution
neural network
convolutional neural
Prior art date
Application number
PCT/CN2022/106634
Other languages
English (en)
French (fr)
Inventor
孙雁飞
王子牛
亓晋
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Publication of WO2023236319A1 publication Critical patent/WO2023236319A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the field of microcontroller design, and in particular to a convolutional neural network deployment and optimization method for microcontrollers.
  • Microcontrollers usually only have tens to hundreds of KB of memory space and storage space, and the operating frequency ranges from a few MHz to a few hundred MHz.
  • the parameters of mainstream convolutional neural network models range from a few M to hundreds of M, which is difficult to meet the needs of microcontrollers. Controller storage space constraints.
  • In response to the demand for lightweight convolutional neural network models academia and industry have proposed some methods for designing lightweight neural networks. Although they effectively reduce the number of parameters and calculations of the model, they are still insufficient for microcontrollers. Taking the lightweight convolutional neural network model MobileNet V3 as an example, the number of parameters is 2.9M. Even after the weights are quantized, they cannot be stored on the microcontroller.
  • the academic community mainly focuses on the accuracy, calculation amount and parameter amount of the convolutional neural network, but ignores the memory consumption of the convolutional neural network during the inference process, and the memory consumption also determines whether the convolutional neural network can run on micro-controllers. on the device.
  • the microcontroller is mainly responsible for collecting data and transmitting sensor readings to the server, and running the convolutional neural network on the server.
  • Convolutional neural networks are used to make decisions, which imposes certain limitations on the application scenarios of convolutional neural networks.
  • Microfungal crop disease detection method (CN11351664A) discloses a method of running a deep learning algorithm on a microcontroller.
  • the patent only provides methods for training the deep learning algorithm, quantifying the model and deploying it on the microcontroller.
  • the model relies on Manual design/selection, there is no method for model design, model compression, memory optimization and calculation acceleration for microcontrollers.
  • the purpose of this invention is to provide a convolutional neural network deployment and optimization method for microcontrollers.
  • a neural network architecture search-based method is proposed. method.
  • the accuracy, calculation time and parameter constraints of the convolutional neural network are considered, so as to search for a convolutional neural network model suitable for microcontrollers with small calculation volume and parameter volume; for the problem of limited memory space of microcontrollers,
  • An optimization method is proposed to optimize the memory occupied by convolution calculations.
  • a microcontroller-oriented convolutional neural network deployment and optimization method includes three parts: the design of the convolutional neural network model, the optimization of the convolutional computing memory, and the deployment of the convolutional neural network. in,
  • Figure 1 is the neural network architecture search flow chart.
  • the search space is a series of optional operations.
  • the modules in the search space form a super network.
  • the calculation time consumption and memory space consumption of the microcontroller are added to the loss function of the super network, and the accuracy is used as the optimization goal.
  • the module with the largest generalization in each layer of the super network is selected as the module retained in this layer, other modules are removed, and together with the modules retained in other layers, the searched target network is formed.
  • Model compression can use the automatic model compression algorithm based on AutoML.
  • the model searched in the previous step is used as the baseline model.
  • the agent part uses the deep deterministic policy gradient to accept the embedding from the l-th layer and output sparse ratio and perform model compression on the l layer according to the sparsity ratio, and then move to the l+1 layer in the environment part for operation.
  • evaluate the accuracy of the entire network (the evaluation process is the same as the conventional network, that is, The test set data is input into the network model, and the number of correct predictions by the network model is calculated ⁇ the total number of test sets).
  • rewards including accuracy, parameter volume and actual calculation time are fed back to the agent part.
  • the following reward algorithm is designed based on the application scenario of the microcontroller:
  • Reward is the reward obtained
  • Lat represents the model calculation time
  • Mem represents the memory consumption of the model
  • Error is the coefficient
  • convolutions in convolutional neural networks include standard convolution, depth convolution and point convolution.
  • the present invention proposes a memory-optimized convolution calculation method, using a memory multiplexing method to achieve Reduce memory consumption as follows.
  • C in , W in , H in convolution input layer channel number, width, height; C out , W out , H out convolution output layer channel number, width, height; W k , H k convolution kernel width , height; h The height of the allocated memory space.
  • Step 1 Allocate memory space m with size C out ⁇ W out ⁇ h (h ⁇ H k /2).
  • Step 2 Convolve part of the input layer data and the convolution kernel to fill the memory space m.
  • Step 3 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 4 Copy the upper data in memory space m to the lower data in memory space m, overwriting the original data.
  • Steps 2, 3, and 4 temporarily store the output in m. Since the convolution calculation involves adjacent rows and columns, the calculated results cannot be directly stored in the input layer. You need to wait until the data is input at that position in the input layer. The data in m can be copied to the corresponding position of the input layer data only when it is no longer used in convolution calculations of adjacent rows and columns.
  • Step 5 Calculate part of the data of the convolution input layer and the convolution kernel operation in order to fill the upper layer data in the memory space m.
  • Step 6 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 7. Repeat steps 4 to 6 until all data of the convolution input layer are calculated.
  • Step 8 Reshape the data stored in the input layer after calculation so that it matches the number of channels, width and height of the output layer.
  • Step 1 Allocate memory space m with size C out ⁇ W out ⁇ h (h ⁇ H k /2). Allocate memory space M with size C out ⁇ W out ⁇ H out -C in ⁇ W in ⁇ H in .
  • Step 2 Convolve part of the data of the input layer and the convolution kernel to fill the memory space M.
  • Step 3 Calculate the convolution input layer part and the convolution kernel operation according to the calculation sequence to fill the memory space m.
  • Step 4 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 5 Copy the upper-level data in memory space m to the lower-level data in memory space m, overwriting the original data.
  • Steps 3, 4, and 5 temporarily store the output in m. Since the convolution calculation involves adjacent rows and columns, the calculated results cannot be directly stored in the input layer. You need to wait until the data is input at that position in the input layer. The data in m can be copied to the corresponding position of the input layer data only when it is no longer used in convolution calculations of adjacent rows and columns.
  • Step 6 Calculate part of the data of the convolution input layer and the convolution kernel operation in order to fill the upper layer data in the memory space m.
  • Step 7 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 8 Repeat steps 5 to 7 until all data of the convolution input layer is calculated.
  • Step 9 Connect the calculated data stored in the input layer with the data in M, and perform a reshape operation to make it conform to the number of channels, width and height of the output layer.
  • FIG. 5 is the depth convolution calculation flow chart. The specific steps are as follows:
  • Step 1 Allocate memory space m, whose size is 1 ⁇ W out ⁇ H out , that is, allocate the memory space occupied by a single output channel.
  • Step 2 Perform depth convolution on the first input channel and the first convolution kernel and store the output in the memory space m.
  • Step 3 Perform depth convolution on the nth (n>1) channel of the input layer and the corresponding nth convolution kernel and store the result in the n-1th channel.
  • Step 4 Copy the data stored in memory space m to the last channel.
  • Step 5 Release the memory space m.
  • Step 6 Reshape the data stored in the input layer after calculation so that it matches the number of channels, width and height of the output layer.
  • Point convolution can be regarded as a standard convolution with a convolution kernel size of 1 ⁇ 1, so the standard convolution calculation method of the present invention can be used.
  • the present invention also provides a calculation method for point convolution memory optimization, which compresses the m memory space allocated in the standard convolution memory optimization into C out ⁇ 1 ⁇ 1 size to achieve lower memory consumption.
  • the point convolution flow chart is shown in Figure 7. The steps are as follows:
  • Case 1 C out ⁇ C in , that is, the number of output channels is not greater than the number of input channels (in this case, the input layer space can store all output layer data).
  • Figure 8 is a calculation diagram for this situation.
  • Step 1 Allocate memory space m, whose size is C out ⁇ 1 ⁇ 1, that is, each output channel is allocated a position size to temporarily store point convolution calculation data.
  • Step 2 Calculate the convolution of each channel position (i, j) (i ⁇ [1, W in ], j ⁇ [1, H in ]) of the input layer with the point, and store the calculation result in the memory space m.
  • Step 3 Copy the data in the middle m of the memory to the corresponding channel position (i, j) of the input layer, overwriting the original data.
  • Step 4. Repeat steps 2 and 3 until all input data are calculated.
  • Step 5 Release the memory space m.
  • Step 6 Reshape the data stored in the input layer after calculation so that it matches the number of channels, width and height of the output layer.
  • Case 2 C out > C in , that is, the number of output channels is greater than the number of input channels (in this case, the input layer space cannot store all the output layer data, and additional memory space M is required).
  • Figure 9 is a calculation diagram for this situation.
  • Step 1 Allocate memory space m, whose size is C out ⁇ 1 ⁇ 1, that is, each output channel is allocated a position size to temporarily store point convolution calculation data. Allocate memory space M, whose size is (C out -C in ) ⁇ W out ⁇ H out .
  • Step 2 Calculate the convolution of each channel position (i, j) (i ⁇ [1, W in ], j ⁇ [1, H in ]) of the input layer with the point, and store the calculation result in the memory space m.
  • Step 3 Copy the first C in data in the middle m of the memory to the corresponding channel position (i, j) of the input layer, overwriting the original data.
  • the remaining C out -C in data in the middle m of the memory is copied to the corresponding channel position (i, j) of the memory space M.
  • Step 4. Repeat steps 2 and 3 until all input data are calculated.
  • Step 5 Release the memory space m.
  • Step 6 Connect the calculated data stored in the input layer with the data in M, and perform a reshape operation to make it conform to the number of channels, width and height of the output layer.
  • the microcontroller-oriented convolutional neural network deployment method includes three parts: convolutional neural network model design (that is, the design of the convolutional neural network model mentioned above), convolutional neural network model verification, and convolutional neural network model deployment, such as As shown in Figure 10.
  • Model design including the steps of data set acquisition, data preprocessing, model search and training, and model compression.
  • (1) Data set acquisition Taking image data as an example, the data set uses image data collected by a microcontroller.
  • the image data collected by the microcontroller is stored in a storage unit, such as a memory card or FLASH. After the collection is completed, the data set is Transfer it to the computer and label it accordingly as the training set and verification set.
  • Data preprocessing includes image enhancement, cropping, rotating, and color adjustment of the collected image data to expand the number of data set samples; adjusting the size to suit the convolutional neural network model Training size; normalization, processing the mean and standard deviation of the collected image data to accelerate the convolutional neural network model training process.
  • Convolutional neural network model search and training Use neural network architecture search technology to search for a suitable network structure in the set search space based on the three indicators of accuracy, calculation time, and memory consumption, and then use the AutoML automatic model compression algorithm The searched model is compressed to obtain the target convolutional neural network model, and the preprocessed image data is used for training on the computer to obtain the trained convolutional neural network model.
  • Model verification includes two steps: computer-side model verification and microcontroller-side model verification.
  • Microcontroller side model verification First, verify the consistency of the results of the deep learning framework using TensorFlow Lite for Micro deep learning inference framework and training deep learning model on the microcontroller side.
  • Model deployment including data collection, data preprocessing and convolutional neural network detection steps.
  • Data collection For example, a camera is used as a data collection device.
  • the microcontroller controls the camera to collect data.
  • the collected image data is sent to the data preprocessing step and saved to an external storage unit.
  • Data preprocessing Data preprocessing cuts and normalizes the image data to be detected, and processes the mean and standard deviation of the image data.
  • Convolutional neural network detection inputs preprocessed data into the model inference framework to obtain detection results. The detection results are handed over to the application part for subsequent processing and corresponding actions are performed. This step includes: convolutional neural network application layer, model layer, model inference framework layer, CMSIS-NN hardware acceleration layer, ARM Cortex-M layer and storage layer.
  • the convolutional neural network detection deployment block diagram is shown in Figure 11.
  • Convolutional neural network application layer used to adopt different detection strategies according to actual application scenarios. Strategies such as a single detection model or multiple cascade models can be used to detect the data to be detected.
  • Model layer The convolutional neural network model used to detect the data to be detected. This model is the model designed in the first part of the model.
  • Model reasoning framework layer used to parse and perform model reasoning.
  • the framework uses TensorFlow Lite for Micro to perform reasoning calculations on the microcontroller.
  • CMSIS-NN computing layer used to accelerate model inference speed.
  • This layer provides hardware acceleration for the upper-layer inference framework by encapsulating the digital signal processor (DSP) in the ARM core. Compared with using a general-purpose CPU for inference operations, the DSP is used for inference operations. It can increase the inference speed by 5-10 times. In addition, this layer is optional. For microcontrollers without DSP, this layer can be removed and the CPU can be used directly for reasoning.
  • DSP digital signal processor
  • ARM Cortex-M layer used to perform actual operations of model inference, and is also responsible for performing functions of other modules, including functions such as data acquisition, data preprocessing, and execution of actions.
  • the storage layer includes RAM and FLASH parts.
  • RAM is used to store temporary data of the middle layer during the model inference process
  • FLASH is used to store the weight file of the model.
  • the storage layer is also used to store programs of other modules.
  • the present invention proposes a method based on neural network architecture search for microcontrollers running convolutional neural networks.
  • the search is suitable for convolutional neural network models with small calculation volume, parameter volume and memory requirements for microcontrollers.
  • the present invention proposes a convolution calculation method that optimizes memory usage.
  • the standard convolution, depth convolution and point convolution commonly used in convolutional neural networks are optimized respectively to reduce the memory usage during the convolutional neural network inference process, so that the convolutional neural network can run on more memory-limited systems. on the microcontroller.
  • the present invention designs a method from construction to application of a convolutional neural network running on a microcontroller, which improves the ease of use and practicality of the convolutional neural network model running on the microcontroller.
  • Figure 1 is a flow chart of the neural network architecture search in the present invention.
  • Figure 2 is a standard convolution calculation flow chart in the present invention.
  • Figure 3 is a schematic diagram of standard convolution calculation in the present invention.
  • Figure 4 is a schematic diagram 2 of the standard convolution calculation in the present invention.
  • Figure 5 is a depth convolution calculation flow chart in the present invention.
  • Figure 6 is a schematic diagram of depth convolution calculation in the present invention.
  • Figure 7 is a flow chart of point convolution in the present invention.
  • Figure 8 is a schematic diagram of point convolution calculation in the present invention.
  • Figure 9 is the second schematic diagram of point convolution calculation in the present invention.
  • Figure 10 is a block diagram of the workpiece surface detection method based on deep learning technology in the present invention.
  • Figure 11 is a block diagram of the convolutional neural network detection deployment in the present invention.
  • Figure 12 is a schematic diagram of the neural network architecture search module in the embodiment of the present invention.
  • Figure 13 is a schematic diagram of neural network architecture search in an embodiment of the present invention.
  • Figure 14 is a histogram comparison diagram of memory overhead of convolution algorithms in the embodiment of the present invention.
  • a microcontroller-oriented convolutional neural network deployment and optimization method includes three parts: the design of the convolutional neural network model, the optimization of the convolutional computing memory, and the deployment of the convolutional neural network.
  • n modules as candidate solutions for neural network structure search.
  • Each module can be composed of several operators, such as convolution operators, etc., as shown in Figure 12.
  • n is the number of training set samples
  • Loss is the loss function
  • the cross-entropy loss function is used here
  • y i is the actual label value
  • p( xi ; W, ⁇ ) is the prediction of the network based on the input x i and parameters W, ⁇ value, using the cross-entropy loss function as the loss between the predicted and actual values.
  • Indicates the memory size occupied by the network in and Respectively represent the width, height and number of channels of the i-th operator output feature of the j-th module in the l-th layer.
  • ⁇ and ⁇ represent the loss weight of calculation time and memory consumption. The larger ⁇ and ⁇ are, the smaller the network calculation time and memory consumption will be.
  • Model compression uses the automatic model compression algorithm based on AutoML, using the model searched in the previous step as the baseline model.
  • the agent part uses the deep deterministic policy gradient to accept the embedding from the l-th layer, output the sparse ratio and compare the l-layer according to the sparse ratio. Carry out model compression, and then move to the l+1 layer in the environment part for operation. After completing the operation on all layers, evaluate the accuracy of the entire network. Finally, rewards including accuracy, parameter volume and actual calculation time are fed back to the agent part.
  • the following reward algorithm is designed based on the application scenario of the microcontroller:
  • Reward is the reward obtained
  • Lat represents the model calculation time
  • Mem represents the memory consumption of the model
  • Error is the coefficient
  • Step 2 Convolve part of the input layer data and the convolution kernel to fill the memory space m.
  • Step 3 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 4 Copy the upper data in memory space m to the lower data in memory space m, overwriting the original data.
  • Step 5 Calculate part of the data of the convolution input layer and the convolution kernel operation in order to fill the upper layer data in the memory space m.
  • Step 6 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 7. Repeat steps 4 to 6 until all data of the convolution input layer are calculated.
  • Step 8 Reshape the data stored in the input layer after calculation so that it matches the number of channels, width and height of the output layer.
  • Step 2 Convolve part of the data of the input layer and the convolution kernel to fill the memory space M.
  • Step 3 Calculate the convolution input layer part and the convolution kernel operation according to the calculation sequence to fill the memory space m.
  • Step 4 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 5 Copy the upper-level data in memory space m to the lower-level data in memory space m, overwriting the original data.
  • Step 6 Calculate part of the data of the convolution input layer and the convolution kernel operation in order to fill the upper layer data in the memory space m.
  • Step 7 Copy the lower layer data in the memory space m at this time to the appropriate location of the convolution input layer, overwriting the original input data.
  • Step 8 Repeat steps 5 to 7 until all data of the convolution input layer is calculated.
  • Step 9 Connect the calculated data stored in the input layer with the data in M, and perform a reshape operation to make it conform to the number of channels, width and height of the output layer.
  • Step 1 Allocate memory space m, whose size is 1 ⁇ W out ⁇ H out , that is, allocate the memory space occupied by a single output channel.
  • Step 2 Perform depth convolution on the first input channel and the first convolution kernel and store the output in the memory space m.
  • Step 3 Perform depth convolution on the nth (n>1) channel of the input layer and the corresponding nth convolution kernel and store the result in the n-1th channel.
  • Step 4 Copy the data stored in memory space m to the last channel.
  • Step 5 Release the memory space m.
  • Step 6 Reshape the data stored in the input layer after calculation to make it conform to the number of channels, width and height of the output layer.
  • Case 1 C out ⁇ C in , that is, the number of output channels is not greater than the number of input channels (in this case, the input layer space can store all output layer data).
  • the calculation process is shown in Figure 8.
  • Step 1 Allocate memory space m, whose size is C out ⁇ 1 ⁇ 1, that is, each output channel is allocated a position size to temporarily store point convolution calculation data.
  • Step 2 Calculate the convolution of each channel position (i, j) (i ⁇ [1, W in ], j ⁇ [1, H in ]) of the input layer with the point, and store the calculation result in the memory space m.
  • Step 3 Copy the data in the middle m of the memory to the corresponding channel position (i, j) of the input layer, overwriting the original data.
  • Step 4. Repeat steps 2 and 3 until all input data are calculated.
  • Step 5 Release the memory space m.
  • Step 6 Reshape the data stored in the input layer after calculation so that it matches the number of channels, width and height of the output layer.
  • Case 2 C out > C in , that is, the number of output channels is greater than the number of input channels (in this case, the input layer space cannot store all the output layer data, and additional memory space M is required).
  • the calculation process is shown in Figure 9.
  • Step 1 Allocate memory space m, whose size is C out ⁇ 1 ⁇ 1, that is, each output channel is allocated a position size to temporarily store point convolution calculation data. Allocate memory space M, whose size is (C out -C in ) ⁇ W out ⁇ H out .
  • Step 2 Calculate the convolution of each channel position (i, j) (i ⁇ [1, W in ], j ⁇ [1, H in ]) of the input layer with the point, and store the calculation result in the memory space m.
  • Step 3 Copy the first C in data in the middle m of the memory to the corresponding channel position (i, j) of the input layer, overwriting the original data.
  • the remaining C out -C in data in the middle m of the memory is copied to the corresponding channel position (i, j) of the memory space M.
  • Step 4. Repeat steps 2 and 3 until all input data are calculated.
  • Step 5 Release the memory space m.
  • Step 6 Connect the calculated data stored in the input layer with the data in M, and perform a reshape operation to make it conform to the number of channels, width and height of the output layer.
  • the sample data is collected and stored in a storage unit such as FLASH or memory card inside the microcontroller.
  • Model compression can significantly reduce memory usage and calculation time.
  • the compressed model can be saved in tflite, onnx, h5 and other formats.
  • TensorFlow Lite for Micro inference framework and CMSIS-NN neural network hardware acceleration component on the microcontroller.
  • the TensorFlow Lite for Micro deep learning inference framework and CMSIS-NN computing layer are combined.
  • TensorFlow Lite for Micro is responsible for parsing and executing the deep learning model, and calling the CMSIS-NN computing layer to perform calculation operations.
  • CMSIS-NN The computing layer is responsible for calling the DSP to perform actual calculations during the model inference process. For microcontrollers whose core does not contain a DSP, CMSIS-NN can not be used, and the CPU performs the actual calculations during the inference process.
  • the microcontroller sends the collected image data to the inference framework.
  • the inference framework returns the inference results after executing the inference.
  • the microcontroller performs corresponding actions based on the inference results and actual needs.
  • Table 1 shows the experimental test data.
  • Table 2 shows the experimental test results.
  • the memory usage includes the additional memory usage during the convolution calculation process and the memory usage of the output matrix. It does not include the memory usage of the input matrix and convolution kernel.
  • M im2 col , M MEC , M direct conv and M ours respectively represent im2col+GEMM, MEC, direct convolution and the memory usage of this method.
  • Figure 14 is a histogram comparison of the data in Table 2. It can be seen that this method significantly reduces the usage overhead of computing memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种面向微控制器的卷积神经网络部署和优化方法,包含卷积神经网络模型的设计、卷积计算内存的优化以及卷积神经网络的部署三部分。卷积神经网络模型的设计基于神经网络架构搜索,搜索适用于微控制器计算量、参数量和内存需求小的卷积神经网络模型;对卷积神经网络中常用到的标准卷积、深度卷积和点卷积分别进行优化,减少卷积神经网络推理过程中的内存占用,使卷积神经网络可以运行在更多内存受限的微控制器上;提供一种运行在微控制器上的卷积神经网络从构建到应用的方法,提高了微控制器运行卷积神经网络模型的易用性和实用性。

Description

一种面向微控制器的卷积神经网络部署和优化方法 技术领域
本发明涉及微控制器设计领域,具体涉及一种面向微控制器的卷积神经网络部署和优化方法。
背景技术
微控制器通常只有几十到几百KB的内存空间和存储空间,运行频率从几MHz到几百MHz,而主流卷积神经网络模型参数量从几M到几百M不等,难以满足微控制器的存储空间约束。针对轻量化卷积神经网络模型的需求,学术界和工业界提出一些设计轻量化神经网络的方法,尽管有效降低了模型的参数量和计算量,但对于微控制器来说仍有不足。以轻量化卷积神经网络模型MobileNet V3为例,参数量有2.9M,即使在权重量化后也无法存储到微控制器上,较大的计算量也使得在微控制器上难以实现实时检测。此外学术界主要关注卷积神经网络的准确率、计算量和参数量,而忽视卷积神经网络在推理过程中的内存消耗,而内存消耗大小也决定该卷积神经网络能否运行在微控制器上。
目前卷积神经网络计算过程中需要大量内存,难以在微控制器上运行,使得微控制器在卷积神经网络实际应用中主要负责采集数据,并将传感器读数传送到服务器,在服务器上运行卷积神经网络进行决策,这种方式对卷积神经网络的应用场景造成一定限制。
现有技术中,“一种基于嵌入式GPU和卷积计算的图像处理方法和装置”(CN110246078B)公开了一种减少运算时内存开销的方法,该专利是与im2col卷积计算方法相比,减少内存开销。im2col卷积计算通过使用额外的内存空间优化数据布局,从而减少调用通用矩阵乘的次数加速卷积计算速度。相较于普通卷积计算,im2col和该专利公布的卷积计算方法都消耗更多的内存空间。“一种视觉图像的卷积计算优化方法”(CN108564524A)公开了一种卷积计算优化方法,该专利优化内存传输,提高卷积计算效率,但并未减少内存使用量。“微型真菌类作物病害检测方法”(CN11351664A)公开了一种在微控制器上运行深度学习算法的方法,该专利仅提供训练深度学习算法、量化模型并部署微控制器上的方法,模型依靠人工设计/选取,没有针对微控制器进行模型设计、模型压缩、内存优化和计算加速等方法。
发明内容
本发明的目的是提供种面向微控制器的卷积神经网络部署和优化方法,针对微控制器算力低、存储空间有限,难以运行主流卷积神经网络的问题,提出基于神经网络架构搜索的方法。在搜索过程中考虑卷积神经网络准确率、计算时间和参数量约束,从而搜索到适用于微控制器计算量和参数量小的卷积神经网络模型;针对微控制器内存空间有限的问题,提出卷积计算占用内存的优化方法,对卷积神经网络中常用到的标准卷积、深度卷积和点卷积分别进行优化,通过就地计算等方法减少卷积神经网络推理过程中的内存占用;针对卷积神经网络在微控制器 上应用问题,设计运行在微控制器上的卷积神经网络从构建到应用的方法,包括数据获取、网络设计、训练、部署、加速等流程。
一种面向微控制器的卷积神经网络部署和优化方法,包含卷积神经网络模型的设计、卷积计算内存的优化以及卷积神经网络的部署三部分。其中,
卷积神经网络模型的设计:
使用神经网络架构搜索技术在设定的搜索空间中针对准确率、计算时间、内存消耗三个指标搜索最优的网络结构,图1为神经网络架构搜索流程图。
搜索空间是一系列的可选操作,由搜索空间中的模块组成超级网络,在超级网络的损失函数中加入微控制器端的计算时间消耗和内存空间消耗,和准确率一起作为优化的目标。在搜索结束后,选取超级网络每层中概论最大的模块作为该层保留的模块,去除其它模块,与其它层保留的模块共同组成搜索到的目标网络。
将搜索到的目标模型进行压缩,模型压缩可以使用基于AutoML的自动模型压缩算法,将上一步搜索到的模型作为基准模型,代理部分使用深度确定性策略梯度从第l层中接受嵌入,输出稀疏比率并根据稀疏比率对l层进行模型压缩,接着在环境部分移动到第l+1层进行操作,在完成对所有层的操作之后,评估整个网络的准确率(评估过程与常规网络相同,即将测试集数据输入网络模型,计算网络模型预测正确的数量÷测试集总数量)。最后将包含准确率、参数量和实际计算时间的奖励反馈给代理部分,根据微控制器的应用场景设计了以下奖励算法:
Reware lat=-Error×log(Lat)
Reward mem=-Error×log(Mem)
式中Reward为获得的奖励,Lat表示模型计算时间,Mem表示模型的内存消耗,Error为系数。
卷积计算内存的优化:
卷积神经网络中常用的卷积有标准卷积、深度卷积和点卷积,针对这三种常用卷积,本发明提出一种内存优化的卷积计算方法,采用内存复用方法,实现减少内存消耗,具体如下。
符号约定:C in、W in、H in卷积输入层通道数、宽度、高度;C out、W out、H out卷积输出层通道数、宽度、高度;W k、H k卷积核宽度、高度;h分配内存空间的高度。
标准卷积计算:
标准卷积计算流程图如图2所示。
情况一:C out×W out×H out≤C in×W in×H in,即卷积输出层大小不大于卷积输入层大小(此时输入层空间可存储全部输出层数据),计算过程如图3所示。
步骤1、分配内存空间m,大小为C out×W out×h(h≥H k/2)。
步骤2、卷积输入层部分数据与卷积核运算后填满内存空间m。
步骤3、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤4、将内存空间m中上层数据复制到内存空间m中下层数据,覆盖原有数据。
步骤2、3、4,将输出暂时存放在m中,由于卷积计算涉及到相邻行和列,不可直接将计算后的结果存放在输入层中,需要等到输入层中该位置输入数据后续不再被相邻行、列卷积计算用到时才可将m中的数据复制到输入层数据对应位置。
步骤5、按照顺序计算卷积输入层部分数据与卷积核运算后填充内存空间m中上层数据。
步骤6、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤7、重复步骤4~6,直至计算完卷积输入层所有数据。
步骤8、对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度。
优化前内存消耗:C out×W out×H out(输出层空间全部分配);
优化后内存消耗:C out×W out×h(输出层复用输入层空间,h<<H out)。
情况二:C out×W out×H out>C in×W in×H in,即卷积输出层大小大于卷积输入层大小(此时输入层空间不可存储全部输出层数据,需要额外内存空间M),计算过程如图4所示。
步骤1、分配内存空间m,大小为C out×W out×h(h≥H k/2)。分配内存空间M,大小为C out×W out×H out-C in×W in×H in
步骤2、卷积输入层部分数据与卷积核运算后填满内存空间M。
步骤3、按照计算顺序计算卷积输入层部分与卷积核运算后填满内存空间m。
步骤4、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤5、将内存空间m中上层数据复制到内存空间m中下层数据,覆盖原有数据。
步骤3、4、5,将输出暂时存放在m中,由于卷积计算涉及到相邻行和列,不可直接将计算后的结果存放在输入层中,需要等到输入层中该位置输入数据后续不再被相邻行、列卷积计算用到时才可将m中的数据复制到输入层数据对应位置。
步骤6、按照顺序计算卷积输入层部分数据与卷积核运算后填充内存空间m中上层数据。
步骤7、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤8、重复步骤5~7,直至计算完卷积输入层所有数据。
步骤9、对计算后存放在输入层的数据和M中的数据连接起来,并做reshape操作,使其符合输出层的通道数、宽度和高度。
优化前内存消耗:C out×W out×H out(输出层空间全部分配);
优化后内存消耗:C out×W out×h+C out×W out×H out-C in×W in×H in(输出层部分复用输入层空间,h<<H out)。
深度卷积计算:
图5为深度卷积计算流程图,具体步骤如下:
步骤1、分配内存空间m,其大小为1×W out×H out,即分配输出单个通道占用的内存空间。
步骤2、将输入第1个通道与第1个卷积核做深度卷积后输出存放在内存空间m中。
步骤3、将输入层第n(n>1)个通道与对应第n个卷积核做深度卷积后结果存放在第n-1个通道中。
步骤4、将内存空间m中存放的数据复制到最后一个通道中。
步骤5、释放内存空间m。
步骤6、对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度。
深度卷积计算示意图如图6所示。
优化前内存消耗:C out×W out×H out(输出层空间全部分配);
优化后内存消耗:1×W out×H out(输出层复用输入层空间)。
点卷积计算:
点卷积可看成卷积核大小为1×1的标准卷积,因此可以采用本发明所述标准卷积的计算方法。此外,针对点卷积计算过程中不涉及相邻位置值的特点,本发明还提供一种针对点卷积内存优化的计算方法,将标准卷积内存优化中分配的m内存空间压缩为C out×1×1大小,实现更低的内存消耗,点卷积流程图如图7所示步骤如下:
情况一:C out≤C in,即输出通道数不大于输入通道数(此时输入层空间可存储全部输出层数据),图8为该情况的计算示意图。
步骤1、分配内存空间m,其大小为C out×1×1,即每个输出通道分配一个位置大小,临时存储点卷积计算数据。
步骤2、将输入层各通道位置(i,j)(i∈[1,W in],j∈[1,H in])与点卷积计算,计算结果存放在内存空间m中。
步骤3、将内存中间m中的数据复制到输入层对应通道位置(i,j),覆盖原有数据。
步骤4、重复步骤2和步骤3,直至计算完全部输入数据。
步骤5、释放内存空间m。
步骤6、对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度。
优化前内存消耗:C out×W out×H out(输出层空间全部分配);
优化后内存消耗:C out×1×1(输出层复用输入层空间)。
情况二:C out>C in,即输出通道数大于输入通道数(此时输入层空间不可存储全部输出层数据,需要额外内存空间M),图9为该情况的计算示意图。
步骤1、分配内存空间m,其大小为C out×1×1,即每个输出通道分配一个位置大小,临时存储点卷积计算数据。分配内存空间M,其大小为(C out-C in)× W out×H out
步骤2、将输入层各通道位置(i,j)(i∈[1,W in],j∈[1,H in])与点卷积计算,计算结果存放在内存空间m中。
步骤3、将内存中间m中前C in个数据复制到输入层对应通道位置(i,j),覆盖原有数据。内存中间m中剩余C out-C in个数据复制到内存空间M对应通道位置(i,j)。
步骤4、重复步骤2和步骤3,直至计算完全部输入数据。
步骤5、释放内存空间m。
步骤6、对计算后存放在输入层的数据和M中的数据连接起来,并做reshape操作,使其符合输出层的通道数、宽度和高度。
优化前内存消耗:C out×W out×H out(输出层空间全部分配);
优化后内存消耗:C out×1×1+(C out-C in)×W out×H out(输出层部分复用输入层空间)。
卷积神经网络的部署:
所述面向微控制器的卷积神经网络部署方法包括卷积神经网络模型设计(即上文中卷积神经网络模型的设计)、卷积神经网络模型验证及卷积神经网络模型部署三部分,如图10所示。
针对以上几个组成部分,具体技术方案如下:
1、模型设计:包括数据集获取、数据预处理、模型搜索和训练和模型压缩几步骤。
(1)数据集获取:以图像数据为例,数据集使用微控制器采集的图像数据,微控制器采集的图像数据存储在存储单元中,如内存卡或者FLASH中,采集完成后将数据集传输到计算机中打上对应标签作为训练集和验证集。
(2)数据预处理:数据预处理包括图片增强,对采集到的图像数据进行裁切、旋转以及色彩调整等处理,用于扩充数据集样本数量;调整大小,调整到适合卷积神经网络模型训练的尺寸;归一化,对采集到的图像数据均值和标准差进行处理,加速卷积神经网络模型训练过程。
(3)卷积神经网络模型搜索和训练:使用神经网络架构搜索技术在设定的搜索空间中针对准确率、计算时间、内存消耗三个指标搜索合适的网络结构,再 通过AutoML自动模型压缩算法对搜索到的模型进行压缩,得到目标卷积神经网络模型,并在计算机上使用预处理后的图像数据进行训练,得到训练后的卷积神经网络模型。
2、模型验证:模型验证包括计算机端模型验证和微控制器端模型验证两个步骤。
(1)计算机端模型验证:首先在计算机端使用TensorFlow Lite for Micro深度学习推理框架验证训练后的模型文件中用到的卷积算子、池化算子、激活函数算子等是否支持,若不支持则替换受支持的算子。其次验证TensorFlow Lite for Micro深度学习推理框架推理结果和训练深度学习模型的深度学习框架结果一致性。
(2)微控制器端模型验证:首先验证微控制器端使用TensorFlow Lite for Micro深度学习推理框架和训练深度学习模型的深度学习框架结果一致性。
3、模型部署:包括数据采集、数据预处理和卷积神经网络检测几步骤。
(1)数据采集:例如采用摄像头作为数据采集设备,由微控制器控制摄像头采集数据,将采集到的图像数据送入数据预处理步骤,并保存到外部存储单元中。
(2)数据预处理:数据预处理对待检测的图像数据进行裁切和归一化,对图像数据均值和标准差进行处理。
(3)卷积神经网络检测:卷积神经网络检测将预处理后的数据输入模型推理框架,得到检测结果,检测结果交由应用部分后续处理,执行相应的动作。该步骤包括:卷积神经网络应用层、模型层、模型推理框架层、CMSIS-NN硬件加速层、ARM Cortex-M层和存储层,卷积神经网络检测部署框图如图11所示。
卷积神经网络应用层:用于根据实际应用场景采取不同检测策略,可采用单个检测模型或多个级联模型等策略对待检测数据进行检测。
模型层:用于检测待检测数据的卷积神经网络模型,该模型为第一部分模型设计得到的模型。
模型推理框架层:用于解析和执行模型推理,该框架采用TensorFlow Lite for Micro在微控制器上执行推理计算。
CMSIS-NN计算层:用于加速模型推理速度,该层通过封装ARM内核中数字 信号处理器(DSP)为上层推理框架提供硬件加速,相较于使用通用CPU进行推理运算,使用DSP进行推理运算可将推理速度提升5-10倍。此外该层是可选的,对于没有DSP的微控制器可以去掉该层,直接使用CPU进行推理。
ARM Cortex-M层:用于执行模型推理的实际运算,同时也负责执行其他模块的功能,包括用于数据采集、数据预处理、执行动作等功能。
存储层:存储层包括RAM和FLASH部分,RAM用于存放模型推理过程中中间层的临时数据,FLASH用于存储模型的权重文件。此外存储层还用于存储其他模块的程序。
本发明达到的有益效果为:
(1)本发明针对微控制器运行卷积神经网络提出一种基于神经网络架构搜索的方法,搜索适用于微控制器计算量、参数量和内存需求小的卷积神经网络模型。
(2)本发明提出一种优化内存占用的卷积计算方法。对卷积神经网络中常用到的标准卷积、深度卷积和点卷积分别进行优化,减少卷积神经网络推理过程中的内存占用,使卷积神经网络可以运行在更多内存受限的微控制器上。
(3)本发明设计一种运行在微控制器上的卷积神经网络从构建到应用的方法,提高了微控制器运行卷积神经网络模型的易用性和实用性。
附图说明
图1是本发明中的神经网络架构搜索流程图。
图2是本发明中的标准卷积计算流程图。
图3是本发明中的标准卷积计算示意图一。
图4是本发明中的标准卷积计算示意图二。
图5是本发明中的深度卷积计算流程图。
图6是本发明中的深度卷积计算示意图。
图7是本发明中的点卷积流程图。
图8是本发明中的点卷积计算示意图一。
图9是本发明中的点卷积计算示意图二。
图10是本发明中的基于深度学习技术的工件表面检测方法框图。
图11是本发明中的卷积神经网络检测部署框图。
图12是本发明实施例中的神经网络架构搜索模块示意图。
图13是本发明实施例中的神经网络架构搜索示意图。
图14是本发明实施例中的卷积算法的内存开销柱状图比较图。
具体实施方式
下面结合说明书附图对本发明的技术方案做进一步的详细说明。
一种面向微控制器的卷积神经网络部署和优化方法,包含卷积神经网络模型的设计、卷积计算内存的优化以及卷积神经网络的部署三部分。
(1)卷积神经网络模型的设计:
1)定义n个模块作为神经网络结构搜索的候选方案,每个模块可由几个算子组成,如卷积算子等,如图12所示。
2)指定神经网络中包含的模块层数L。
3)定义一个超级网络,该网络中包含L层,每层中包含n个模块,同一层的n个模块输出维度相同。
4)将每一层n个模块的输出乘以对应标量
Figure PCTCN2022106634-appb-000001
后相加作为该层的输出,
Figure PCTCN2022106634-appb-000002
表示第l层的第j个模块对应的标量。
5)定义损失函数:
Figure PCTCN2022106634-appb-000003
其中,n为训练集样本数量,Loss为损失函数,本处使用交叉熵损失函数,y i为实际标签值,p(x i;W,Θ)为网络根据输入x i和参数W、Θ预测的值,使用交叉熵损失函数作为预测与实际值的损失。
Figure PCTCN2022106634-appb-000004
表示网络的计算时间,
Figure PCTCN2022106634-appb-000005
为常数,根据运行该网络模型的微控制器测量得到;
Figure PCTCN2022106634-appb-000006
表示第l层的第j个模块对应的标量;exp()表示以自然常数e为底的指数函数,
Figure PCTCN2022106634-appb-000007
Figure PCTCN2022106634-appb-000008
表示网络占用的内存大小,
Figure PCTCN2022106634-appb-000009
其中
Figure PCTCN2022106634-appb-000010
Figure PCTCN2022106634-appb-000011
分别表示第l层第j个模块的第i个算子输出特征的宽、高和通道数。
β、γ表示计算时间和内存消耗的损失权重,β、γ越大搜索到的网络计算时间和内存消耗越小。
6)训练超级网络,学习参数W和Θ。
7)计算
Figure PCTCN2022106634-appb-000012
对超级网络的每层取
Figure PCTCN2022106634-appb-000013
取最大值的模块保留下来,得到搜索到最优的网络模型。如图13所示,深色的模块保留下来组成搜索到的网络,其它模块丢弃,减少网络大小。
8)模型压缩使用基于AutoML的自动模型压缩算法,将上一步搜索到的模型作为基准模型,代理部分使用深度确定性策略梯度从第l层中接受嵌入,输出稀疏比率并根据稀疏比率对l层进行模型压缩,接着在环境部分移动到第l+1层进行操作,在完成对所有层的操作之后,评估整个网络的准确率。最后将包含准确率、参数量和实际计算时间的奖励反馈给代理部分,根据微控制器的应用场景设计了以下奖励算法:
Reward lat=-Error×log(Lat)
Reward mem=-Error×log(Mem)
式中Reward为获得的奖励,Lat表示模型计算时间,Mem表示模型的内存消耗,Error为系数。
(2)卷积计算内存的优化:
1)标准卷积计算:
计算当前算子的输入层参数C in、W in、H in和输出层参数C out、W out、H out
情况一:C out×W out×H out≤C in×W in×H in,即卷积输出层大小不大于卷积输入层大小(此时输入层空间可存储全部输出层数据),计算过程如图3所示。
步骤1、分配内存空间m,大小为C out×W out×h(h≥H k/2),在本实施例中H k=3,取h=2。
步骤2、卷积输入层部分数据与卷积核运算后填满内存空间m。
步骤3、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤4、将内存空间m中上层数据复制到内存空间m中下层数据,覆盖原有数据。
步骤5、按照顺序计算卷积输入层部分数据与卷积核运算后填充内存空间m中上层数据。
步骤6、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤7、重复步骤4~6,直至计算完卷积输入层所有数据。
步骤8、对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度。
情况二:C out×W out×H out>C in×W in×H in,即卷积输出层大小大于卷积输入层大小(此时输入层空间不可存储全部输出层数据,需要额外内存空间M),计算过程如图4所示。
步骤1、分配内存空间m,大小为C out×W out×h(h≥H k/2),在本实施例中H k=3,取h=2。分配内存空间M,大小为C out×W out×H out-C in×W in×H in
步骤2、卷积输入层部分数据与卷积核运算后填满内存空间M。
步骤3、按照计算顺序计算卷积输入层部分与卷积核运算后填满内存空间m。
步骤4、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤5、将内存空间m中上层数据复制到内存空间m中下层数据,覆盖原有数据。
步骤6、按照顺序计算卷积输入层部分数据与卷积核运算后填充内存空间m中上层数据。
步骤7、将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据。
步骤8、重复步骤5~7,直至计算完卷积输入层所有数据。
步骤9、对计算后存放在输入层的数据和M中的数据连接起来,并做reshape操作,使其符合输出层的通道数、宽度和高度。
2)深度卷积计算:
深度卷积计算示意图如图6所示,具体步骤如下:
步骤1、分配内存空间m,其大小为1×W out×H out,即分配输出单个通道占用的内存空间。
步骤2、将输入第1个通道与第1个卷积核做深度卷积后输出存放在内存空间m中。
步骤3、将输入层第n(n>1)个通道与对应第n个卷积核做深度卷积后结果存放在第n-1个通道中。
步骤4、将内存空间m中存放的数据复制到最后一个通道中。
步骤5、释放内存空间m。
步骤6、对计算后存放在输入层的数据做reshape操作,使其符合输出层的 通道数、宽度和高度。
3)点卷积计算:
计算当前算子的输入层参数C in、W in、H in和输出层参数C out、W out、H out
情况一:C out≤C in,即输出通道数不大于输入通道数(此时输入层空间可存储全部输出层数据),计算过程如图8所示。
步骤1、分配内存空间m,其大小为C out×1×1,即每个输出通道分配一个位置大小,临时存储点卷积计算数据。
步骤2、将输入层各通道位置(i,j)(i∈[1,W in],j∈[1,H in])与点卷积计算,计算结果存放在内存空间m中。
步骤3、将内存中间m中的数据复制到输入层对应通道位置(i,j),覆盖原有数据。
步骤4、重复步骤2和步骤3,直至计算完全部输入数据。
步骤5、释放内存空间m。
步骤6、对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度。
情况二:C out>C in,即输出通道数大于输入通道数(此时输入层空间不可存储全部输出层数据,需要额外内存空间M),计算过程如图9所示。
步骤1、分配内存空间m,其大小为C out×1×1,即每个输出通道分配一个位置大小,临时存储点卷积计算数据。分配内存空间M,其大小为(C out-C in)×W out×H out
步骤2、将输入层各通道位置(i,j)(i∈[1,W in],j∈[1,H in])与点卷积计算,计算结果存放在内存空间m中。
步骤3、将内存中间m中前C in个数据复制到输入层对应通道位置(i,j),覆盖原有数据。内存中间m中剩余C out-C in个数据复制到内存空间M对应通道位置(i,j)。
步骤4、重复步骤2和步骤3,直至计算完全部输入数据。
步骤5、释放内存空间m。
步骤6、对计算后存放在输入层的数据和M中的数据连接起来,并做reshape操作,使其符合输出层的通道数、宽度和高度。
(3)卷积神经网络的部署:
1、通过数据采集将样本数据采集存储在微控制器内部FLASH或内存卡等存储单元中。
2、将采集到的数据导入到计算机中,并根据缺陷类型打上标签信息,供深度学习算法使用。
3、采用神经网络结构搜索方法在计算时间和内存消耗约束的搜索空间中搜索到最优的网络模型。
4、在计算机上搭建深度学习环境,可以采用TensorFlow、Pytorch、Caffe等框架,可以采用GPU加速的方式提高深度神经网络训练的速度,如采用NVIDIA显卡并对其进行GPU配置。
5、在计算机端采用根据上述算法生成的的深度学习模型和配置后的深度学习框架对工件表面缺陷样本数据进行训练。根据训练结果调整深度学习模型结构和超参数等配置,使之达到目标要求。
6、将训练后的深度学习模型做模型压缩处理,模型压缩可以大幅减少内存占用和计算时间,压缩后的模型可以保存为tflite、onnx、h5等格式。
7、将深度学习模型文件数据存储到微控制器上。
8、在微控制器上部署TensorFlow Lite for Micro推理框架和CMSIS-NN神经网络硬件加速组件。通过编写中间层代码将TensorFlow Lite for Micro深度学习推理框架和CMSIS-NN计算层组合起来,由TensorFlow Lite for Micro负责解析、执行深度学习模型,并调用CMSIS-NN计算层执行计算操作,CMSIS-NN计算层负责调用DSP执行模型推理过程中实际的计算。对于内核不包含DSP的微控制器可以不使用CMSIS-NN,由CPU执行推理过程中实际的计算。
9、在计算机端使用TensorFlow Lite for Micro深度学习推理框架验证训练后的模型文件中用到的深度学习算子是否支持,若不支持则替换受支持的算子。验证在计算机端使用TensorFlow Lite for Micro深度学习推理框架推理结果和训练深度学习模型的深度学习框架推理结果以及在微控制器端使用TensorFlow Lite for Micro深度学习推理框架推理结果的一致性。
10、微控制器将采集到的图像数据送入推理框架中,推理框架执行完推理后返回推理结果,微控制器根据推理结果和实际需要执行相应动作。
接着对本方法和几种算法进行比较实验,具体如下:
表1 测试集信息
Figure PCTCN2022106634-appb-000014
表2 几种卷积算法的内存开销对比
Figure PCTCN2022106634-appb-000015
表1为实验测试数据。表2为实验测试结果,内存使用量包括卷积计算过程中的额外内存使用量和输出矩阵的内存使用量,不包含输入矩阵和卷积核的内存使用量,M im2 col、M MEC、M direct conv和M ours分别表示im2col+GEMM、MEC、直接卷积和本方法内存使用量大小。图14为表2数据的直方图对比。可以看出,本方法显著降低了运算内存的使用开销。
以上所述仅为本发明的较佳实施方式,本发明的保护范围并不以上述实施方式为限,但凡本领域普通技术人员根据本发明所揭示内容所作的等效修饰或变化,皆应纳入权利要求书中记载的保护范围内。

Claims (9)

  1. 一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:所述方法包含卷积神经网络模型的设计、卷积计算内存的优化以及卷积神经网络的部署三部分;
    卷积神经网络模型的设计中,首先搜索得到指标最优的网络结构,组成超级网络,结合微控制器的指标要求得到目标网络;然后对目标网络进行压缩,并评估压缩后的网络准确率以及相应的设计奖励函数;
    卷积计算内存的优化中,分别对于标准卷积、深度卷积和点卷积三种计算方式的内容使用进行优化,基于内存复用实现减少内存消耗;
    卷积神经网络的部署中,基于卷积神经网络模型的设计,还包括卷积神经网络模型验证和卷积神经网络模型部署;
    其中,模型验证包括计算机端模型验证和微控制器端模型验证;模型部署包括数据采集、数据预处理和卷积神经网络检测。
  2. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:使用神经网络架构搜索技术在设定的搜索空间中针对准确率、计算时间、内存消耗三个指标搜索最优的网络结构,由搜索空间中的模块组成超级网络,在超级网络的损失函数中加入微控制器端的计算时间消耗和内存空间消耗,和准确率一起作为优化的目标;在搜索结束后,选取超级网络每层中概论最大的模块作为该层保留的模块,去除其它模块,与其它层保留的模块共同组成搜索到的目标网络。
  3. 根据权利要求2所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:模型压缩中,将上一步搜索到的模型作为基准模型,代理部分使用深度确定性策略梯度从第l层中接受嵌入,输出稀疏比率并根据稀疏比率对l层进行模型压缩,接着在环境部分移动到第l+1层进行操作,在完成对所有层的操作之后,评估整个网络的准确率;最后将包含准确率、参数量和实际计算时间的奖励反馈给代理部分,根据微控制器的应用场景设计了以下奖励算法:
    Reward lat=-Error×log(Lat)
    Reward mem=-Error×log(Mem)
    式中Reward为获得的奖励,Lat表示模型计算时间,Mem表示模型的内存消耗,Error为系数。
  4. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:对于标准卷积,根据卷积输出层大小和卷积输入层大小之前的关系进行分类处理;
    当卷积输出层大小不大于卷积输入层大小时,分配内存空间m;卷积输入层部分数据与卷积核运算后填满内存空间m;将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据;将内存空间m中上层数据复制到内存空间m中下层数据,覆盖原有数据;按照顺序计算卷积输入层部分数据与卷积核运算后填充内存空间m中上层数据;将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据;重复上述流程,直至计算完卷积输入层所有数据;对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度;
    当卷积输出层大小大于卷积输入层大小时,分配内存空间m和内存空间M;卷积输入层部分数据与卷积核运算后填满内存空间M;按照计算顺序计算卷积输入层部分与卷积核运算后填满内存空间m;将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据;将内存空间m中上层数据复制到内存空间m中下层数据,覆盖原有数据;按照顺序计算卷积输入层部分数据与卷积核运算后填充内存空间m中上层数据;将此时内存空间m中下层数据复制到卷积输入层适当位置,覆盖原有输入数据;重复步骤上述流程,直至计算完卷积输入层所有数据;对计算后存放在输入层的数据和M中的数据连接起来,并做reshape操作,使其符合输出层的通道数、宽度和高度。
  5. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:对于深度卷积计算,分配内存空间m,即分配输出单个通道占用的内存空间;将输入第1个通道与第1个卷积核做深度卷积后输出存放在内存空间m中;将输入层第n(n>1)个通道与对应第n个卷积核做深度卷积后结果存放在第n-1个通道中;将内存空间m中存放的数据复制到最后一个通道中;释放内存空间m;对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度。
  6. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:对于点卷积,根据输出通道数和输入通道数分类处理:
    当输出通道数不大于输入通道数时,分配内存空间m,每个输出通道分配一个位置大小,临时存储点卷积计算数据;将输入层各通道位置与点卷积计算,计算结果存放在内存空间m中;将内存中间m中的数据复制到输入层对应通道位置,覆盖原有数据;重复上述流程,直至计算完全部输入数据;释放内存空间m;对计算后存放在输入层的数据做reshape操作,使其符合输出层的通道数、宽度和高度;
    当输出通道数大于输入通道数时,分配内存空间m,每个输出通道分配一个位置大小,临时存储点卷积计算数据,分配内存空间M;将输入层各通道位置与点卷积计算,计算结果存放在内存空间m中;将内存中间m中对应于卷积输入层通道数的前数个数据复制到输入层对应通道位置,覆盖原有数据,内存中间m中剩余数据复制到内存空间M对应通道位置;重复上述步骤,直至计算完全部输入数据;释放内存空间m;对计算后存放在输入层的数据和M中的数据连接起来,并做reshape操作,使其符合输出层的通道数、宽度和高度。
  7. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:所述模型验证具体包括如下分步骤:
    计算机端模型验证:首先在计算机端使用深度学习推理框架验证训练后的模型文件中用到的卷积算子、池化算子、激活函数算子是否支持,若不支持则替换受支持的算子;其次验证深度学习推理框架推理结果和训练深度学习模型的深度学习框架结果一致性;
    微控制器端模型验证:验证微控制器端使用深度学习推理框架和训练深度学习模型的深度学习框架结果一致性。
  8. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:所述模型部署具体包括如下分步骤:
    数据采集:由微控制器控制外部设备采集数据,将采集到的数据送入数据预处理步骤,并保存到外部存储单元中;
    数据预处理:数据预处理对采集到数据进行裁切、归一化、均值和标准差进行处理;
    卷积神经网络检测:卷积神经网络检测将预处理后的数据输入模型推理框架,得到检测结果;部署的卷积神经网络包括应用层、模型层、模型推理框架层、 CMSIS-NN硬件加速层、ARM Cortex-M层和存储层。
  9. 根据权利要求1所述的一种面向微控制器的卷积神经网络部署和优化方法,其特征在于:所述卷积神经网络中,
    卷积神经网络应用层用于根据实际情况采取不同检测策略;
    模型层中根据实际需要替换不同的检测模型;
    模型推理框架层用于解析和执行模型推理;
    CMSIS-NN计算层用于加速模型推理速度,该层通过封装ARM内核中数字信号处理器DSP为上层推理框架提供硬件加速;
    ARM Cortex-M层用于执行模型推理的实际运算,同时也负责执行其他模块的功能,包括用于数据采集、数据预处理、执行动作的功能;
    存储层包括RAM和FLASH部分,RAM用于存放模型推理过程中中间层的临时数据,FLASH用于存储模型的权重文件。
PCT/CN2022/106634 2022-06-10 2022-07-20 一种面向微控制器的卷积神经网络部署和优化方法 WO2023236319A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210653260.3A CN114742211B (zh) 2022-06-10 2022-06-10 一种面向微控制器的卷积神经网络部署和优化方法
CN202210653260.3 2022-06-10

Publications (1)

Publication Number Publication Date
WO2023236319A1 true WO2023236319A1 (zh) 2023-12-14

Family

ID=82287414

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/106634 WO2023236319A1 (zh) 2022-06-10 2022-07-20 一种面向微控制器的卷积神经网络部署和优化方法

Country Status (2)

Country Link
CN (1) CN114742211B (zh)
WO (1) WO2023236319A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114742211B (zh) * 2022-06-10 2022-09-23 南京邮电大学 一种面向微控制器的卷积神经网络部署和优化方法
CN115630578B (zh) * 2022-10-30 2023-04-25 四川通信科研规划设计有限责任公司 一种算力体系预测布局优化方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447239A (zh) * 2018-09-26 2019-03-08 华南理工大学 一种基于arm的嵌入式卷积神经网络加速方法
US20190188537A1 (en) * 2017-12-14 2019-06-20 Robert Bosch Gmbh Effective building block design for deep convolutional neural networks using search
CN112766467A (zh) * 2021-04-06 2021-05-07 深圳市一心视觉科技有限公司 基于卷积神经网络模型的图像识别方法
CN114742211A (zh) * 2022-06-10 2022-07-12 南京邮电大学 一种面向微控制器的卷积神经网络部署和优化方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111768458A (zh) * 2020-06-28 2020-10-13 苏州鸿鹄骐骥电子科技有限公司 一种基于卷积神经网络的稀疏图像处理方法
CN113011570B (zh) * 2021-04-30 2023-04-07 电子科技大学 一种采用神经网络压缩系统的人脸表情识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188537A1 (en) * 2017-12-14 2019-06-20 Robert Bosch Gmbh Effective building block design for deep convolutional neural networks using search
CN109447239A (zh) * 2018-09-26 2019-03-08 华南理工大学 一种基于arm的嵌入式卷积神经网络加速方法
CN112766467A (zh) * 2021-04-06 2021-05-07 深圳市一心视觉科技有限公司 基于卷积神经网络模型的图像识别方法
CN114742211A (zh) * 2022-06-10 2022-07-12 南京邮电大学 一种面向微控制器的卷积神经网络部署和优化方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG, YINGJIE: "Design and Implementation of Deep Convolutional Neural Network Embedded Inference Framework", CHINESE MASTER’S THESES FULL-TEXT DATABASE, 15 February 2021 (2021-02-15) *

Also Published As

Publication number Publication date
CN114742211A (zh) 2022-07-12
CN114742211B (zh) 2022-09-23

Similar Documents

Publication Publication Date Title
WO2023236319A1 (zh) 一种面向微控制器的卷积神经网络部署和优化方法
US11907760B2 (en) Systems and methods of memory allocation for neural networks
WO2018099084A1 (zh) 一种神经网络模型训练方法、装置、芯片和系统
CN108563739B (zh) 天气数据获取方法及装置、计算机装置及可读存储介质
US20230297846A1 (en) Neural network compression method, apparatus and device, and storage medium
EP4080416A1 (en) Adaptive search method and apparatus for neural network
CN110175628A (zh) 一种基于自动搜索与知识蒸馏的神经网络剪枝的压缩算法
CN110660478A (zh) 一种基于迁移学习的癌症图像预测判别方法和系统
WO2021051987A1 (zh) 神经网络模型训练的方法和装置
CN113449859A (zh) 一种数据处理方法及其装置
CN112085157B (zh) 基于神经网络和树模型的疾病预测方法及其装置
CN112163601A (zh) 图像分类方法、系统、计算机设备及存储介质
CN116051574A (zh) 一种半监督分割模型构建与图像分析方法、设备及系统
CN110838364A (zh) 一种基于深度学习混合模型的克罗恩病预测方法及装置
CN113688787A (zh) 花生叶片病害识别方法
CN114417986A (zh) 基于人工智能的药物特征信息确定方法及装置
CN113222149A (zh) 模型训练方法、装置、设备和存储介质
CN112764893A (zh) 数据处理方法和数据处理系统
CN113674862A (zh) 一种基于机器学习的急性肾功能损伤发病预测方法
CN112308825A (zh) 一种基于SqueezeNet的农作物叶片病害识别方法
Wang et al. Towards efficient convolutional neural networks through low-error filter saliency estimation
CN113822434A (zh) 用于知识蒸馏的模型选择学习
KR102430796B1 (ko) 딥러닝 모델의 분산 훈련을 위한 전체 슬라이드 이미지 분배 방법 및 이를 수행하는 컴퓨팅 시스템
CN116151323A (zh) 模型生成方法、装置、电子设备及存储介质
CN111931913B (zh) 基于Caffe的卷积神经网络在FPGA上的部署方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22945453

Country of ref document: EP

Kind code of ref document: A1