CN114742211B

CN114742211B - Convolutional neural network deployment and optimization method facing microcontroller

Info

Publication number: CN114742211B
Application number: CN202210653260.3A
Authority: CN
Inventors: 孙雁飞; 王子牛; 亓晋; 许斌
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-23
Anticipated expiration: 2042-06-10
Also published as: CN114742211A; WO2023236319A1

Abstract

A convolutional neural network deployment and optimization method facing a microcontroller comprises three parts, namely design of a convolutional neural network model, optimization of convolutional calculation memory and deployment of a convolutional neural network. The design of the convolutional neural network model is based on neural network architecture search, and the convolutional neural network model which is suitable for a microcontroller and has small calculated amount, parameter amount and memory requirement is searched; respectively optimizing standard convolution, deep convolution and point convolution commonly used in the convolutional neural network, reducing memory occupation in the inference process of the convolutional neural network, and enabling the convolutional neural network to operate on more microcontrollers with limited memory; the method for the convolutional neural network running on the microcontroller from construction to application is provided, and the usability and the practicability of the microcontroller for running the convolutional neural network model are improved.

Description

Convolutional neural network deployment and optimization method facing microcontroller

Technical Field

The invention relates to the field of microcontroller design, in particular to a convolutional neural network deployment and optimization method facing a microcontroller.

Background

The microcontroller usually has only memory space and storage space of dozens to hundreds of KB, the operating frequency is from several MHz to several hundreds of MHz, and the parameter quantity of the mainstream convolution neural network model is different from several M to several hundreds of M, so that the storage space constraint of the microcontroller is difficult to meet. In response to the need for a lightweight convolutional neural network model, some methods for designing a lightweight neural network have been proposed in academia and industry, and although the parameters and the amount of calculation of the model are effectively reduced, the methods are still insufficient for a microcontroller. Taking the lightweight convolutional neural network model mobilonet V3 as an example, the parameter amount is 2.9M, and even after the weight quantization, the parameter amount cannot be stored in the microcontroller, and the large calculation amount makes it difficult to realize real-time detection on the microcontroller. In addition, the academic community mainly focuses on the accuracy, the calculated amount and the parameter of the convolutional neural network, and neglects the memory consumption of the convolutional neural network in the reasoning process, and the memory consumption also determines whether the convolutional neural network can operate on the microcontroller.

At present, a large amount of memory is needed in the calculation process of the convolutional neural network, and the convolutional neural network is difficult to operate on a microcontroller, so that the microcontroller is mainly responsible for acquiring data in the practical application of the convolutional neural network, the reading of a sensor is transmitted to a server, and the convolutional neural network is operated on the server for decision making, and the mode causes certain limitation on the application scene of the convolutional neural network.

In the prior art, "an image processing method and apparatus based on embedded GPU and convolution calculation" (CN 110246078B) discloses a method for reducing memory overhead during operation, which is a patent for reducing memory overhead compared with im2col convolution calculation method. im2col convolution calculations speed of convolution calculations is accelerated by optimizing the data layout using additional memory space, thereby reducing the number of times universal matrix multiplication is invoked. Compared with the common convolution calculation, im2col and the convolution calculation method disclosed in the patent both consume more memory space. "a method for optimizing convolution calculation of visual image" (CN 108564524A) discloses a method for optimizing convolution calculation, which optimizes memory transmission and improves the efficiency of convolution calculation, but does not reduce the memory usage. The patent only provides a method for training a deep learning algorithm, quantizing a model and deploying the model on a microcontroller, the model depends on manual design/selection, and methods such as model design, model compression, memory optimization, calculation acceleration and the like are not carried out on the microcontroller.

Disclosure of Invention

The invention aims to provide a convolutional neural network deployment and optimization method facing a microcontroller, and provides a method based on neural network architecture search aiming at the problems that the microcontroller is low in calculation power, limited in storage space and difficult to operate a mainstream convolutional neural network. In the searching process, the constraint of the accuracy rate, the calculation time and the parameter quantity of the convolutional neural network is considered, so that a convolutional neural network model which is suitable for the microcontroller and has small calculation quantity and parameter quantity is searched; aiming at the problem of limited memory space of a microcontroller, an optimization method for occupying memory by convolution calculation is provided, standard convolution, deep convolution and point convolution which are commonly used in a convolution neural network are respectively optimized, and memory occupation in the inference process of the convolution neural network is reduced by methods such as local calculation and the like; aiming at the application problem of the convolutional neural network on the microcontroller, a method for designing the convolutional neural network running on the microcontroller from construction to application comprises the processes of data acquisition, network design, training, deployment, acceleration and the like.

A convolutional neural network deployment and optimization method facing a microcontroller comprises three parts, namely design of a convolutional neural network model, optimization of convolutional calculation memory and deployment of a convolutional neural network. Wherein the content of the first and second substances,

designing a convolutional neural network model:

an optimal network structure is searched for three indexes of accuracy, calculation time and memory consumption in a set search space by using a neural network architecture search technology, and fig. 1 is a neural network architecture search flow chart.

The search space is a series of optional operations, a super network is formed by modules in the search space, the calculation time consumption and the memory space consumption of the micro-controller end are added to a loss function of the super network, and the accuracy is used as an optimization target together. And after the search is finished, selecting the module with the maximum probability in each layer of the super network as the module reserved in the layer, removing other modules, and forming the searched target network together with the modules reserved in other layers.

And (3) compressing the searched target model, wherein the model compression can use an automatic model compression algorithm based on AutoML, the model searched in the last step is used as a reference model, the agent part uses a depth certainty strategy gradient to receive embedding from the l-th layer, outputs a sparse ratio, compresses the l-th layer according to the sparse ratio, moves to the l + 1-th layer in the environment part to operate, and after the operation on all the layers is completed, estimates the accuracy of the whole network (the estimation process is the same as that of the conventional network, namely test set data is input into the network model, and the correct quantity of the network model prediction is calculated divided by the test lumped quantity). Finally, the reward including the accuracy, the parameter number and the actual calculation time is fed back to the agent part, and the following reward algorithm is designed according to the application scene of the microcontroller:

in the formulaRewardIn order to obtain a reward for the user,Latthe time of calculation of the model is represented,Memthe memory consumption of the model is represented,Errorare coefficients.

And (3) optimizing convolution calculation memory:

the convolution common in the convolutional neural network includes standard convolution, deep convolution and point convolution, and aiming at the three types of common convolution, the invention provides a convolution calculation method with optimized memory, and a memory multiplexing method is adopted to reduce the memory consumption.

Symbol convention:

、

、

convolution input layer channel number, width and height;

、

、

convolution output layer channel number, width and height;

、

convolution kernel width, height; h the height of the allocated memory space.

Standard convolution calculation:

a standard convolution calculation flow chart is shown in fig. 2.

The first condition is as follows:

i.e., the convolved output layer size is not larger than the convolved input layer size (in which case the input layer space may store all of the output layer data), the calculation process is shown in fig. 3.

Step 1, allocating memory space m with the size of

(h≥

)。

And 2, performing convolution input layer part data and convolution kernel operation, and filling the memory space m.

And 3, copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data.

And 4, copying the upper layer data in the memory space m to the lower layer data in the memory space m to cover the original data.

And 2, steps 3 and 4, temporarily storing the output in m, wherein the calculated result cannot be directly stored in an input layer because convolution calculation relates to adjacent rows and columns, and the data in m can be copied to the corresponding position of the input layer data only when the input data at the position in the input layer is not used by the convolution calculation of the adjacent rows and columns subsequently.

And 5, calculating partial data of the convolution input layer and filling upper-middle layer data of the memory space m after convolution kernel operation according to the sequence.

And 6, copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data.

And 7, repeating the steps 4-6 until all data of the convolution input layer are calculated.

And 8, performing reshape operation on the data stored in the input layer after calculation to enable the data to be in accordance with the number, width and height of channels of the output layer.

Memory consumption before optimization:

(output layer space total allocation);

optimizing memory consumption:

(output layer multiplexing input layer space, h

）。

Case two:

that is, the size of the convolution output layer is larger than the size of the convolution input layer (at this time, the input layer space may not store all the output layer data, and an additional memory space M is needed), and the calculation process is shown in fig. 4.

Step 1, allocating memory space m with the size of

(h≥

). Allocating a memory space M of size

。

And 3, calculating the convolution input layer part and the convolution kernel part according to the calculation sequence, and filling the memory space m after the convolution input layer part and the convolution kernel are operated.

And 4, copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data.

And 5, copying the upper layer data in the memory space m to the lower layer data in the memory space m to cover the original data.

And 3, 4, 5, temporarily storing the output in the m, wherein the calculated result cannot be directly stored in the input layer because the convolution calculation relates to adjacent rows and columns, and the data in the m can be copied to the corresponding position of the input layer data only when the input data at the position in the input layer is not used by the convolution calculation of the adjacent rows and columns.

And 6, calculating partial data of the convolution input layer and filling upper-middle layer data of the memory space m after convolution kernel operation according to the sequence.

And 7, copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data.

And 8, repeating the steps 5-7 until all data of the convolution input layer are calculated.

And 9, connecting the calculated data stored in the input layer with the data in the M, and performing reshape operation to make the data meet the number, width and height of channels of the output layer.

Memory consumption before optimization:

(output layer space total allocation);

optimizing memory consumption:

+

(output layer part multiplexes input layer space, h

）。

And (3) depth convolution calculation:

fig. 5 is a flowchart of depth convolution calculation, which includes the following specific steps:

step 1, allocating memory space m with size 1

×

I.e. allocating the memory space occupied by the output single channel.

And 2, performing deep convolution on the input 1 st channel and the 1 st convolution kernel, and outputting and storing in a memory space m.

And 3, storing the result of the deep convolution of the nth (n >1) channel of the input layer and the corresponding nth convolution kernel in the nth-1 channel.

And 4, copying the data stored in the memory space m to the last channel.

And 5, releasing the memory space m.

And 6, performing reshape operation on the data stored in the input layer after calculation to enable the data to be in accordance with the number, width and height of channels of the output layer.

A schematic diagram of the depth convolution calculation is shown in fig. 6.

Memory consumption before optimization:

(output layer space total allocation);

optimizing memory consumption:

(output layer multiplexes input layer space).

And (3) performing point convolution calculation:

the point convolution can be regarded as a standard convolution with a convolution kernel size of 1 × 1, and therefore the calculation method of the standard convolution described in the present invention can be adopted. In addition, aiming at the characteristic that the adjacent position value is not involved in the point convolution calculation process, the invention also provides a calculation method aiming at the point convolution memory optimization, which optimizes the standard convolution memoryIn-allocated m memory spaces are compressed into

×

Size, lower memory consumption is achieved, and the point convolution flowchart, as shown in fig. 7, includes the following steps:

the first condition is as follows:

i.e. the number of output channels is not greater than the number of input channels (at this time, the input layer space can store all the output layer data), fig. 8 is a calculation diagram of this case.

Step 1, allocating a memory space m with the size of

×

I.e. each output channel is allocated a position size, and the point convolution calculation data is temporarily stored.

Step 2, inputting the position (i, j) of each channel of the input layer (i belongs to [1,

], j∈[1,

]) And performing convolution calculation with the points, and storing the calculation result in a memory space m.

And 3, copying the data in the middle m of the memory to the corresponding channel position (i, j) of the input layer to cover the original data.

And 4, repeating the step 2 and the step 3 until all input data are calculated.

And 5, releasing the memory space m.

Memory consumption before optimization:

(output layer space total allocation);

optimizing memory consumption:

×

(output layer multiplexes input layer space).

And a second condition:

that is, the number of output channels is greater than the number of input channels (at this time, the input layer space may not store all the output layer data, and an additional memory space M is needed), and fig. 9 is a calculation diagram of this case.

Step 1, allocating a memory space m with the size of

×

I.e. each output channel is allocated a position size, and the point convolution calculation data is temporarily stored. Allocating a memory space M of size

。

], j∈[1,

]) And (5) performing convolution calculation with the point, and storing a calculation result in a memory space m.

Step 3, the middle m of the memory is centered and centered

And copying the data to the corresponding channel position (i, j) of the input layer to cover the original data. Middle m of memory remaining

And copying the data to the channel position (i, j) corresponding to the memory space M.

And 4, repeating the step 2 and the step 3 until all input data are calculated.

And 5, releasing the memory space m.

And 6, connecting the calculated data stored in the input layer with the data in the M, and performing reshape operation to make the data meet the number, width and height of channels of the output layer.

Memory consumption before optimization:

(output layer space total allocation);

optimizing memory consumption:

×

+

(output layer portion multiplexes input layer space).

Deployment of convolutional neural networks:

the convolutional neural network deployment method facing the microcontroller comprises three parts of convolutional neural network model design (namely the design of the convolutional neural network model in the above), convolutional neural network model verification and convolutional neural network model deployment, and is shown in fig. 10.

Aiming at the components, the specific technical scheme is as follows:

1. model design: the method comprises the steps of data set acquisition, data preprocessing, model searching and training and model compression.

(1) Data set acquisition: taking image data as an example, the data set uses image data acquired by a microcontroller, the image data acquired by the microcontroller is stored in a storage unit, such as a memory card or a FLASH, and after the acquisition is completed, the data set is transmitted to a computer and a corresponding label is printed as a training set and a verification set.

(2) Data preprocessing: the data preprocessing comprises image enhancement, and the acquired image data is cut, rotated, color-adjusted and the like for expanding the number of data set samples; adjusting the size to a size suitable for convolutional neural network model training; and (4) normalization, namely processing the mean value and the standard deviation of the acquired image data and accelerating the training process of the convolutional neural network model.

(3) Searching and training a convolutional neural network model: the method comprises the steps of searching a proper network structure in a set search space according to three indexes of accuracy, calculation time and memory consumption by using a neural network architecture search technology, compressing a searched model by using an automatic ML (automatic markup language) automatic model compression algorithm to obtain a target convolutional neural network model, and training by using preprocessed image data on a computer to obtain a trained convolutional neural network model.

2. And (3) model verification: the model verification comprises two steps of computer-side model verification and microcontroller-side model verification.

(1) Computer-side model verification: firstly, verifying whether convolution operators, pooling operators, activation function operators and the like used in a trained model file are supported or not by using a TensorFlow Lite for Micro deep learning inference framework at a computer terminal, and replacing the supported operators if the convolution operators, the pooling operators, the activation function operators and the like are not supported. And secondly, verifying the consistency of the inference result of the TensorFlow Lite for Micro deep learning inference framework and the deep learning framework result of the training deep learning model.

(2) And (3) verifying the microcontroller end model: firstly, verifying the consistency of the results of the deep learning frame of the microcontroller end using the TensorFlow Lite for Micro deep learning inference frame and the training deep learning model.

3. And (3) deploying the model, wherein the model comprises the steps of data acquisition, data preprocessing and convolutional neural network detection.

(1) Data acquisition: for example, a camera is used as data acquisition equipment, a microcontroller controls the camera to acquire data, the acquired image data is sent to a data preprocessing step, and the acquired image data is stored in an external storage unit.

(2) Data preprocessing: and (3) carrying out data preprocessing on image data to be detected, cutting and normalizing the image data, and processing the average value and the standard deviation of the image data.

(3) Detection of a convolutional neural network: and the convolutional neural network detection inputs the preprocessed data into a model reasoning framework to obtain a detection result, and the detection result is delivered to an application part for subsequent processing to execute corresponding actions. The method comprises the following steps: the convolutional neural network detection deployment block diagram is shown in FIG. 11.

Convolutional neural network application layer: the method is used for adopting different detection strategies according to the actual application scene, and can adopt strategies such as a single detection model or a plurality of cascade models to detect the data to be detected.

A model layer: and the convolutional neural network model is used for detecting the data to be detected, and is a model obtained by designing the first part of models.

Model inference framework layer: for parsing and executing model reasoning, the framework adopts TensorFlow Lite for Micro to execute reasoning calculation on a microcontroller.

CMSIS-NN computation layer: the method is used for accelerating the model reasoning speed, hardware acceleration is provided for an upper layer reasoning framework by packaging a Digital Signal Processor (DSP) in an ARM core, and compared with the method for carrying out reasoning operation by using a general CPU, the method for accelerating the model reasoning speed by using the DSP to carry out the reasoning operation can improve the reasoning speed by 5-10 times. In addition, the layer is optional, and can be removed for a microcontroller without a DSP, and a CPU is directly used for reasoning.

ARM Cortex-M layer: the model inference engine is used for executing the actual operation of the model inference and is also responsible for executing the functions of other modules, including the functions of data acquisition, data preprocessing, action execution and the like.

A storage layer: the storage layer comprises an RAM and a FLASH part, the RAM is used for storing temporary data of the middle layer in the model reasoning process, and the FLASH is used for storing weight files of the model. The storage layer is also used to store programs for other modules.

The invention achieves the following beneficial effects:

(1) the invention provides a neural network architecture search-based method for a convolutional neural network operated by a microcontroller, and the search is suitable for a convolutional neural network model with small calculated amount, parameter amount and memory requirement of the microcontroller.

(2) The invention provides a convolution calculation method for optimizing memory occupation. The standard convolution, the deep convolution and the point convolution which are commonly used in the convolutional neural network are respectively optimized, so that the memory occupation in the inference process of the convolutional neural network is reduced, and the convolutional neural network can be operated on more microcontrollers with limited memories.

(3) The invention designs a method for constructing and applying a convolutional neural network running on a microcontroller, and improves the usability and the practicability of the microcontroller for running a convolutional neural network model.

Drawings

Fig. 1 is a flow chart of neural network architecture search in the present invention.

Fig. 2 is a flow chart of a standard convolution calculation in the present invention.

FIG. 3 is a diagram of a standard convolution calculation in the present invention.

FIG. 4 is a diagram of a standard convolution calculation in the present invention.

Fig. 5 is a flowchart of the depth convolution calculation in the present invention.

Fig. 6 is a schematic diagram of the depth convolution calculation in the present invention.

Fig. 7 is a flow chart of point convolution in the present invention.

FIG. 8 is a diagram illustrating a first calculation of point convolution according to the present invention.

Fig. 9 is a schematic diagram of the point convolution calculation in the present invention.

Fig. 10 is a block diagram of a workpiece surface detection method based on the deep learning technique in the present invention.

FIG. 11 is a block diagram of a convolutional neural network detection deployment in the present invention.

FIG. 12 is a block diagram of a neural network architecture search module according to an embodiment of the present invention.

Fig. 13 is a schematic diagram of a neural network architecture search in an embodiment of the present invention.

FIG. 14 is a comparison graph of memory overhead histograms for the convolution algorithm in an embodiment of the invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the drawings in the specification.

A convolutional neural network deployment and optimization method facing a microcontroller comprises three parts, namely design of a convolutional neural network model, optimization of convolutional calculation memory and deployment of a convolutional neural network.

(1) Designing a convolutional neural network model:

1) n modules are defined as candidates for neural network structure search, and each module may be composed of several operators, such as convolution operators, etc., as shown in fig. 12.

2) The number of module layers L contained in the neural network is specified.

3) Defining a super network, wherein the network comprises L layers, each layer comprises n modules, and the output dimensions of the n modules in the same layer are the same.

4) Multiplying the output of each layer of n modules by a corresponding scalar

The post-addition is taken as the output of the layer,

and indicating the scalar corresponding to the jth module of the ith layer.

5) Defining a loss function:

wherein the content of the first and second substances,

in order to train the number of samples in the set,Lossfor the loss function, a cross-entropy loss function is used here,

for the actual value of the tag(s),

based on input for network

And parameters

、

The predicted value, using a cross entropy loss function as the loss of the predicted and actual values.

Which represents the computation time of the network and,

the constant value is obtained by measurement of a microcontroller operating the network model;

is shown as

Scalar corresponding to jth module of layer;

denotes an exponential function based on a natural constant e, exp: (

) =

。

The size of the memory occupied by the network is indicated,

wherein

、

Respectively represent

Layer one

First of a module

Each operator outputs the width, height and number of channels of the feature.

、

Representing the loss weight of computation time and memory consumption,

、

the larger the search, the less the computation time and memory consumption of the network.

6) Training super network, learning parameters

And

。

7) computing

Fetch for each layer of the super network

And reserving the maximum value module to obtain the optimal network model searched. As shown in fig. 13, the dark modules are retained to form the searched network, and the other modules are discarded to reduce the size of the network.

8) The model compression uses an automatic model compression algorithm based on AutoML, the model searched in the last step is used as a reference model, the agent part uses a depth certainty strategy gradient to receive embedding from the l layer, outputs a sparse ratio and compresses the model of the l layer according to the sparse ratio, then the environment part moves to the l +1 layer to operate, and after the operation of all the layers is completed, the accuracy of the whole network is evaluated. Finally, the reward including the accuracy, the parameter number and the actual calculation time is fed back to the agent part, and the following reward algorithm is designed according to the application scene of the microcontroller:

in the formula, Reward is the acquired Reward, Lat represents the model calculation time, Mem represents the memory consumption of the model, and Error is a coefficient.

(2) And (3) optimizing convolution calculation memory:

1) standard convolution calculation:

calculating input layer parameters of a current operator

And output layer parameters

。

The first condition is as follows:

i.e., the convolutional output layer size is not larger than the convolutional input layer size (in which case the input layer space may store all the output layer data), the calculation process is as shown in fig. 3.

Step 1, allocating memory space m with the size of m

(h≥

) In the present embodiment

=3Get ith=2。

And 2, performing convolution input layer part data and convolution kernel operation and filling the memory space m.

And 8, carrying out reshape operation on the data stored in the input layer after calculation to enable the data to accord with the number, width and height of channels of the output layer.

Case two:

Step 1, allocating memory space m with the size of

(h≥

) In the present embodiment

= 3Get ith=2. Allocating a memory space M of size

。

And 3, calculating the part of the convolution input layer and the convolution kernel according to the calculation sequence, and filling the memory space m after the convolution input layer and the convolution kernel are operated.

And 6, calculating partial data of the convolution input layer and upper layer data in the memory space m after convolution kernel operation according to the sequence.

2) And (3) depth convolution calculation:

a schematic diagram of the depth convolution calculation is shown in fig. 6, and the specific steps are as follows:

step 1, allocating memory space m with size 1

×

I.e. allocating the memory space occupied by the output single channel.

And 4, copying the data stored in the memory space m to the last channel.

And 5, releasing the memory space m.

3) And (3) performing point convolution calculation:

calculating input layer parameters of current operator

And output layer parameters

。

The first condition is as follows:

i.e. the number of output channels is not greater than the number of input channels (at this time, the input layer space can store all the output layer data), the calculation process is as shown in fig. 8.

Step 1, allocating a memory space m with the size of

×

], j∈[1,

And 4, repeating the step 2 and the step 3 until all input data are calculated.

And 5, releasing the memory space m.

Case two:

that is, the number of output channels is greater than the number of input channels (at this time, the input layer space may not store all the output layer data, and an additional memory space M is needed), and the calculation process is shown in fig. 9.

Step 1, allocating a memory space m with the size of

×

。

], j∈[1,

Step 3, the middle m of the memory is centered and centered

Copying data to corresponding channel bits of input layerAnd (i, j) is set to cover the original data. Middle m of memory remaining

And 4, repeating the step 2 and the step 3 until all input data are calculated.

And 5, releasing the memory space m.

(3) Deployment of convolutional neural networks:

1. and collecting and storing the sample data in a storage unit such as a FLASH or a memory card in the microcontroller through data collection.

2. And importing the acquired data into a computer, and marking label information according to the defect type for a deep learning algorithm.

3. And searching an optimal network model in a search space with constraint on computation time and memory consumption by adopting a neural network structure searching method.

4. A deep learning environment is built on a computer, frames such as TensorFlow, Pythrch, Caffe and the like can be adopted, and the speed of deep neural network training can be increased in a GPU acceleration mode, for example, an NVIDIA display card is adopted and GPU configuration is carried out on the display card.

5. And training the sample data of the surface defects of the workpiece by adopting the deep learning model generated according to the algorithm and the configured deep learning frame at a computer terminal. And adjusting the configurations of the deep learning model structure, the hyper-parameters and the like according to the training result to meet the target requirement.

6. And performing model compression on the trained deep learning model, wherein the model compression can greatly reduce memory occupation and calculation time, and the compressed model can be stored in the formats of tflite, onnx, h5 and the like.

7. Storing the deep learning model file data on the microcontroller.

8. And a TensorFlow Lite for Micro inference framework and a CMSIS-NN neural network hardware acceleration component are deployed on the microcontroller. The TensorFlow Lite for Micro deep learning inference framework and the CMSIS-NN computing layer are combined by compiling an intermediate layer code, the TensorFlow Lite for Micro is responsible for analyzing and executing a deep learning model, the CMSIS-NN computing layer is called to execute computing operation, and the CMSIS-NN computing layer is responsible for calling actual computing in the DSP execution model inference process. For a microcontroller with a kernel that does not include a DSP, the CMSIS-NN may not be used, and the CPU performs the actual calculations in the inference process.

9. And (3) verifying whether the deep learning operator used in the trained model file is supported or not by using a TensorFlow Lite for Micro deep learning inference framework at a computer terminal, and replacing the supported operator if the deep learning operator is not supported. And verifying the consistency of the inference result of the deep learning framework using the TensorFlow Lite for Micro deep learning inference framework on the computer side and the inference result of the deep learning framework training the deep learning model and the inference result of the deep learning framework using the TensorFlow Lite for Micro deep learning inference framework on the microcontroller side.

10. The microcontroller sends the acquired image data to the reasoning framework, the reasoning framework returns a reasoning result after reasoning, and the microcontroller executes corresponding actions according to the reasoning result and actual needs.

Then, the method and several algorithms are compared and tested, and the method comprises the following specific steps:

TABLE 1 test set information

TABLE 2 memory overhead comparison of several convolution algorithms

Table 1 shows experimental test data. Table 2 shows the experimental results, where the memory usage includes the extra memory usage in the convolution calculation process and the memory usage of the output matrix, and does not include the input matrixAnd the memory usage of the convolution kernel,

、

、

and

respectively representing im2col + GEMM, MEC, direct convolution and the size of the memory usage amount of the method. Fig. 14 is a histogram comparison of the data of table 2. It can be seen that the method significantly reduces the use overhead of the operation memory.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims

1. A convolutional neural network deployment and optimization method facing to a microcontroller is characterized in that: the method comprises three parts, namely design of a convolutional neural network model, optimization of convolutional calculation memory and deployment of a convolutional neural network;

in the design of the convolutional neural network model, firstly, a network structure with the optimal index is obtained through searching to form a super network, and a target network is obtained by combining the index requirement of a microcontroller; then compressing the target network, and evaluating the accuracy rate of the compressed network and a corresponding design reward function;

in the optimization of convolution calculation memory, memory use of three calculation modes of standard convolution, deep convolution and point convolution is optimized respectively, and memory consumption is reduced based on memory multiplexing; for standard convolution, carrying out classification processing according to the relation between the size of a convolution output layer and the size of a convolution input layer; for point convolution, classifying processing is carried out according to the number of output channels and the number of input channels;

for a standard convolution: when the size of the convolution output layer is not larger than that of the convolution input layer, allocating a memory space m; after the convolution input layer part data and the convolution kernel are operated, filling the memory space m; copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data; copying the upper layer data in the memory space m to the lower layer data in the memory space m to cover the original data; calculating partial data of the convolution input layer and upper layer data in the memory space m after convolution kernel operation according to the sequence; copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data; repeating the above process until all data of the convolution input layer are calculated; performing reshape operation on the calculated data stored in the input layer to enable the calculated data to accord with the number, width and height of channels of the output layer;

when the size of the convolution output layer is larger than that of the convolution input layer, allocating a memory space M and a memory space M; after the convolution input layer part data and the convolution kernel are operated, filling the memory space M; calculating the convolution input layer part and the convolution kernel part according to the calculation sequence, and filling the memory space m after the operation; copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data; copying the upper layer data in the memory space m to the lower layer data in the memory space m to cover the original data; calculating partial data of the convolution input layer and upper-layer data in the memory space m after convolution kernel operation according to the sequence; copying the lower layer data in the memory space m to a proper position of a convolution input layer at the moment, and covering the original input data; repeating the above steps until all data of the convolution input layer are calculated; connecting the calculated data stored in the input layer with the data in the M, and performing reshape operation to make the data meet the number, width and height of channels of the output layer;

for the depth convolution calculation, allocating a memory space m, namely allocating and outputting the memory space occupied by a single channel; performing deep convolution on the input 1 st channel and the 1 st convolution kernel, and outputting and storing in a memory space m; the nth channel of the input layer and the corresponding nth convolution kernel are subjected to deep convolution, and the result is stored in the (n-1) th channel, wherein n is greater than 1; copying data stored in the memory space m to the last channel; releasing the memory space m; performing reshape operation on the calculated data stored in the input layer to enable the calculated data to accord with the number, width and height of channels of the output layer;

for point convolution: when the number of output channels is not more than the number of input channels, allocating a memory space m, allocating a position size to each output channel, and temporarily storing point convolution calculation data; performing convolution calculation on the positions of all channels of an input layer and points, and storing the calculation result in a memory space m; copying data in the middle m of the memory to a position of a corresponding channel of an input layer, and covering original data; repeating the above process until all input data are calculated; releasing the memory space m; performing reshape operation on the calculated data stored in the input layer to enable the calculated data to accord with the number, width and height of channels of the output layer;

when the number of output channels is larger than that of input channels, allocating a memory space M, allocating a position size to each output channel, temporarily storing point convolution calculation data, and allocating a memory space M; performing convolution calculation on the positions of all channels of the input layer and the points, and storing the calculation result in a memory space m; copying the previous data corresponding to the number of the channels of the convolution input layer in the middle M of the memory to the position of the corresponding channel of the input layer, covering the original data, and copying the rest data in the middle M of the memory to the position of the corresponding channel of the memory space M; repeating the steps until all input data are calculated; releasing the memory space m; connecting the calculated data stored in the input layer with the data in the M, and performing reshape operation to make the data meet the number, width and height of channels of the output layer;

in the deployment of the convolutional neural network, based on the design of a convolutional neural network model, the convolutional neural network model verification and the deployment of the convolutional neural network model are further included;

the model verification comprises computer-side model verification and microcontroller-side model verification; the model deployment comprises data acquisition, data preprocessing and convolutional neural network detection.

2. The microcontroller-oriented convolutional neural network deployment and optimization method of claim 1, wherein: searching an optimal network structure in a set search space according to three indexes of accuracy, calculation time and memory consumption by using a neural network architecture search technology, forming a super network by using modules in the search space, adding calculation time consumption and memory space consumption of a micro controller end into a loss function of the super network, and taking the calculation time consumption and the memory space consumption together with the accuracy as an optimization target; and after the search is finished, selecting the module with the maximum probability in each layer of the super network as the module reserved in the layer, removing other modules, and forming the searched target network together with the modules reserved in other layers.

3. The microcontroller-oriented convolutional neural network deployment and optimization method of claim 2, wherein: in model compression, the model searched in the last step is used as a reference model, the agent part uses a depth certainty strategy gradient to receive embedding from the l layer, outputs a sparse ratio and carries out model compression on the l layer according to the sparse ratio, then the agent part moves to the l +1 layer for operation in the environment part, and after the operation on all the layers is finished, the accuracy of the whole network is evaluated; finally, the reward including the accuracy, the parameter number and the actual calculation time is fed back to the agent part, and the following reward algorithm is designed according to the application scene of the microcontroller:

Reward _lat ＝-Error×log(Lat)

Reward _mem ＝-Error×log(Mem)

in the formula, Reward is the acquired Reward, Lat represents the calculation time of the model, Mem represents the memory consumption of the model, and Error is a coefficient.

4. The microcontroller-oriented convolutional neural network deployment and optimization method of claim 1, wherein: the model verification specifically comprises the following steps:

computer-side model verification: firstly, a deep learning inference frame is used at a computer end to verify whether a convolution operator, a pooling operator and an activation function operator used in a trained model file are supported, and if not, the supported operator is replaced; secondly, verifying the consistency of the deep learning reasoning frame reasoning result and the deep learning frame result of the training deep learning model;

and (3) verifying the microcontroller end model: and verifying the result consistency of the deep learning frame of the microcontroller end using the deep learning inference frame and the training deep learning model.

5. The microcontroller-oriented convolutional neural network deployment and optimization method of claim 1, wherein: the model deployment specifically comprises the following sub-steps:

data acquisition: the microcontroller controls external equipment to collect data, the collected data is sent to a data preprocessing step, and the data is stored in an external storage unit;

data preprocessing: the data preprocessing is used for cutting, normalizing, and processing the mean value and the standard deviation of the acquired data;

detecting a convolutional neural network: the convolutional neural network detection inputs the preprocessed data into a model reasoning framework to obtain a detection result; the deployed convolutional neural network comprises an application layer, a model inference framework layer, a CMSIS-NN hardware acceleration layer, an ARM Cortex-M layer and a storage layer.

6. The microcontroller-oriented convolutional neural network deployment and optimization method of claim 1, wherein: in the convolutional neural network,

the convolutional neural network application layer is used for adopting different detection strategies according to actual conditions;

different detection models are replaced in the model layer according to actual needs;

the model reasoning framework layer is used for analyzing and executing model reasoning;

the CMSIS-NN computing layer is used for accelerating the model reasoning speed and provides hardware acceleration for an upper layer reasoning framework by packaging a Digital Signal Processor (DSP) in an ARM core;

the ARM Cortex-M layer is used for executing actual operation of model reasoning and is also responsible for executing functions of other modules, including functions for data acquisition, data preprocessing and action execution;

the storage layer comprises an RAM and a FLASH part, the RAM is used for storing temporary data of the middle layer in the model reasoning process, and the FLASH is used for storing weight files of the model.