CN117197576A

CN117197576A - Image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution

Info

Publication number: CN117197576A
Application number: CN202311208995.6A
Authority: CN
Inventors: 陈彦明; 武钢; 张以文
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-08

Abstract

The invention provides an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution, and relates to the field of machine learning. The image classification method specifically comprises the following steps: the method comprises the steps of 1, obtaining MCU-BLOCK-A, improving depth separable convolution of A lightweight neural network MobileNet to obtain MCU-BLOCK-A, adding A BN layer and an efficient channel attention mechanism between the MCU-BLOCK-A and the MCU-BLOCK-A by using one depth convolution and point-by-point convolution, finally adding one layer of depth convolution, carrying out residual connection on input and output of the last layer of depth convolution, and 2, obtaining MCU-BLOCK-B, and 3, constructing A model by combining A nonlinear pooling layer. The model provided by the invention has the advantages of low parameter quantity, less peak memory occupation, capability of meeting the resource requirements of most MCUs and better classification performance. The machine learning model is operated on the MCU, so that uploading of data to the cloud can be avoided, data privacy is greatly protected, real-time processing and response are accelerated, and energy consumption is greatly reduced.

Description

Image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution

Technical Field

The invention relates to the field of machine learning, in particular to an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution.

Background

In recent years, as Machine Learning (ML) continues to advance, new opportunities are brought for machine learning to be applied to internet of things (IoT) nodes with limited resources. At present, the machine learning algorithm is widely applied to industries such as intelligent home, precise agriculture, consumer electronics and the like. Although running a machine learning model (e.g., image classification, etc.) on a Microcontroller (MCU) can avoid uploading data to the cloud, accelerate real-time processing and response, greatly protect data privacy, and greatly reduce energy consumption, deploying intelligent algorithms on MCUs still faces a number of problems.

1) The model size, the FLASH memory (FLASH) of the MCU is used for storing model parameters, the space range is mostly 0-2 MB, and the general lightweight neural network model size is more than 10MB, for example, the MobileNet v2 size is 13.6MB, and the parameter efficiency is too low.

2) The peak memory size, the Static Random Access Memory (SRAM) of the MCU is used to store temporary intermediate data of the neural network operation, including input and output active matrices, and the memory size is typically 0-512 KB. The peak memory of MobileNet v2 and efficientNet-B0 reaches 2.29MB, which is not suitable for most MCUs in the prior art.

3) In the existing method, the method for compressing the model by utilizing pruning, quantization and other technologies only focuses on reducing the model parameters and the calculated amount, but does not solve the memory bottleneck, and in addition, a method for searching (NAS) by utilizing a neural network architecture needs to spend a great deal of hardware resources, and the manually designed network MobileNet v2-0.35 is not balanced with the EtinyNet in terms of the contradiction among the peak memory, the calculated amount and the accuracy.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme: an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution specifically comprises the following steps:

step 1 acquisition of MCU-BLOCK-A

The method comprises the steps of improving the depth separable convolution of A lightweight neural network MobileNet to obtain MCU-BLOCK-A, utilizing A depth convolution (DWConv) and A point-by-point convolution (PWConv), adding A BN layer and A high-efficiency channel attention mechanism (Efficient Channel Attention, ECA for short) between the two, adding A layer of depth convolution at last, and carrying out residual connection on input and output of the last layer of depth convolution;

step 2 acquisition of MCU-BLOCK-B

Based on the MCU-BLOCK-A obtained in the step 1, removing residual connection between the input and the final layer of depth convolution output on the basis of the MCU-BLOCK-A, adding residual connection between the input and the first layer of point-by-point convolution output, and carrying out residual connection between the connected output and the final layer of depth convolution output;

step 3, obtaining a nonlinear pooling layer

Based on a nonlinear pooling module, introducing nonlinear pooling after a first convolution layer, rapidly downsampling the picture size, and bypassing a middle large activation layer to complete image aggregation calculation attenuation;

step 4, model construction

The modeling is carried out by combining convolution, nonlinear pooling and MCU-BLOCK-A, MCU-BLOCK-B module, and the peak memory, the model size, the calculated amount and the accuracy are weighed, which comprises the following steps:

1) The first stage, extracting local features by convolution with a stride of 2;

2) In the second stage, feature information is extracted along the row direction and the column direction in the picture by utilizing nonlinear pooling, then the picture size is reduced rapidly, and the peak memory of the CNN model is ensured not to be higher than the static random access memory of the MCU;

3) The third stage, extracting features through A plurality of MCU-BLOCK-A modules;

4) A fourth stage, adopting a constructed MCU-BLOCK-B module;

5) Fifthly, performing dimension reduction by using global pooling, and finally obtaining a classification result through a full connection layer; step 5, model training and deployment

Training and testing the model by adopting an ImageNet data set and a visual wake-up word data set, and deploying the model trained on the VWW data set on a singlechip to test the performance of the model.

Preferably, the ReLU activation functions used in step 1, in step 2, between the MCU-BLOCK-A, the point-by-point convolution and the depth convolution, and after the last depth convolution are all described.

Preferably, the efficient channel attention mechanism in step 1 specifically includes:

firstly, carrying out global pooling on an input original feature map, and then calculating the weight of each channel by using a learnable 1D convolution operation, wherein the formula is as follows:

w _i ＝σ(C1D _k (y))

wherein C1D _k For a fast 1D convolution, k represents how many neighboring channels participate in the channel's attention prediction process, σ is the Sigmoid activation function, and then multiplied element-by-element with the original feature map.

Preferably, the specific operation of the nonlinear pooling module in the step 3 includes extracting a receptive field picture (r×c×k) in a specified format from the input picture, where r is a receptive field row size, c is a receptive field column size, and k is a receptive field channel number. Then, extracting features on the steps by using a fast gate cycle neural network (FastGRNN 1) to obtain r pieces of length h ₁ Characteristic block h of (1) ₁ Hide layer size for FastGRNN1, then length h at r ₁ Is subjected to bidirectional FastGRNN2 on the characteristic blocks to obtain two blocks with the length of h ₂ Is a feature block of (1); similar to extracting features on row level, c lengths h are obtained on column level ₁ Is subjected to bidirectional FastGRNN2 operation to obtain two characteristic blocks with the length of h ₂ Is a feature block of (1); finally, four lengths are h ₂ And (3) splicing the characteristic blocks to obtain the characteristic vector after nonlinear pooling operation is carried out on the single receptive field.

Preferably, the step of deploying the CNN neural network based on the MCU specifically comprises the following steps:

based on the step four, model construction is carried out, an image classification dataset is loaded, training is carried out, a model weight file with highest accuracy is stored, the model weight file is converted into an open neural network exchange format, 8bit asymmetric quantization is carried out, and a quantization formula is as follows:

val _fp32 ＝scale*val _quantized

and analyzing the model by using the STMCube.AI toolkit, generating corresponding C language basic codes, and developing upper-layer application to realize the operation of an image classification algorithm on the MCU.

Preferably, the device for image classification calculation is deployed based on MCU, and the device comprises MCU microprocessor and camera, wherein:

STM32F746G-DISCO is used as MCU, flash memory in MCU is used for storing neural network model weight, basic code frame, OS, SRAM in MCU is used for storing CNN network in-process middle activation value and other buffer file, arducam is used as the camera acquisition image.

(III) beneficial effects

The invention provides an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution. The beneficial effects are as follows:

1. the invention provides an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution. The method is limited by the memory and the computing power of the MCU, the conventional attention mechanism SE has the advantages of multiple parameters and large computing capacity, and is not suitable for deployment on the MCU, and by adding the ECA, compared with the conventional attention, the method does not need to introduce additional parameters, so that the model can pay more attention to important characteristic channels, and the parameter quantity and the computing cost of a full-connection layer of the common attention mechanism are reduced, thereby improving the computing efficiency. The invention achieves better classification performance using less than 1M parameters.

2. The invention provides an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution, which has smaller peak memory, uses convolution with 8 steps of output channels being 2 in the first stage, and then downsamples pictures to 1/4 of the original picture by utilizing nonlinear pooling, so that compared with other models, the image classification method has smaller peak memory, further reduces the calculated amount, is suitable for a plurality of MCUs, and obtains better classification performance. The machine learning model is operated on the MCU, so that uploading of data to the cloud can be avoided, data privacy is greatly protected, real-time processing and response are accelerated, and energy consumption is greatly reduced. The invention can be widely applied to industries such as intelligent home, precise agriculture, consumer electronics and the like.

Drawings

FIG. 1 is a block diagram of a depth separable convolution of the present invention;

FIG. 2 is A BLOCK diagram of an improved module MCU-BLOCK-A of the present invention;

FIG. 3 is a BLOCK diagram of an improved module MCU-BLOCK-B of the present invention;

FIG. 4 is a graph of intermediate activation value changes for a network of the present invention;

FIG. 5 is a graph showing accuracy and peak memory comparison with other models in the visual wake-up word experiment of the present invention;

FIG. 6 is a graph showing accuracy and parameter amounts of the wake-up word test and other models;

FIG. 7 is a flow chart of a model training deployment of the present invention;

FIG. 8 is a block diagram of a model deployment hardware device of the present invention;

fig. 9 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples:

as shown in fig. 1-9, an embodiment of the present invention provides an image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution, which specifically includes the following steps:

step one, obtaining MCU-BLOCK-A

And improving the depth separable convolution of the lightweight neural network MobileNet to obtain the MCU-BLOCK-A. Firstly, DWConv and PWConv are utilized, a BN layer and an ECA layer are added between the DWConv and the PWConv, cross-channel interaction is realized under the condition that the parameter quantity and the calculated quantity are hardly increased, and the model classification accuracy is increased. And finally, adding a layer of DWConv, and carrying out residual connection on the input and the output of the last layer of DWConv, so as to relieve gradient disappearance and improve the mobility of information.

As shown in fig. one, the structure of the mobilenes depth separable convolution is shown. The second diagram is the MCU-BLOCK-A improved by us, we first use A DWConv to extract information channel by channel, then use ECA to learn the importance of each channel, here first use global pooling, then apply A1D convolution operation that can be learned, calculate the weight of each channel, the formulA is as follows:

w _i ＝σ(C1D _k (y))

where C1D is a fast 1D convolution, k represents how many neighboring channels participate in the attention prediction process of the channel, the size of k is determined dynamically by the number of channels, and σ is a Sigmoid activation function. The formula is as follows:

c is the number of channels, |t| _odd Represents the nearest odd number to t, γ is 2, and b is 1. Then multiplying the original feature map element by element, and then utilizing PWConv to interact and combine the features between channels so as to obtain richer feature representation. The last layer we used DWConv again because of the comparisonThe DWConv parameters are fewer in calculation amount and lower in calculation amount, so that the method is more suitable for being used in a scene with limited resources. We therefore increase the proportion of DWConv in the model. And meanwhile, residual connection is introduced into the input and the output of the last layer, so that a deeper network can be constructed, and the representation capability and performance of the model can be improved.

Step two, obtaining MCU-BLOCK-B

As shown in figure three. Based on the MCU-BLOCK-A obtained in the step 1, residual connection between the input and the output of the last layer of depth convolution is removed on the basis of the MCU-BLOCK-A, residual connection between the input and the output of the first layer of point-by-point convolution is added, and residual connection is carried out on the connected output and the output of the last layer of depth convolution. As shown in figure four. Because of the larger picture size in the early stages of the CNN network, if too many residual connections are used, the peak memory of the model will exceed the SRAM of the MCU. And the model later picture size is low so that more residual connections can be used to increase the equivalent width of the model without exceeding the SRAM size limit of the MCU. Therefore, the equivalent width of the method is increased by introducing a plurality of residual connection under the conditions of smaller network width and parameters, and the accuracy is ensured.

ReLU activation functions are adopted between the first PWConv and the last DWConv in the MCU-BLOCK-A, MCU-BLOCK-B and after the last DWConv, because the ReLU activation functions facilitate quantization and effectively increase the non-linearity capability of the network under the condition of low parameter quantity and calculation quantity. Whereas the first DWConv did not employ a ReLU activation function since ReLU would block the flow of information in the low dimensional data, thereby weakening the capacity and expressive power of the model.

Step three, obtaining a nonlinear pooling layer

After the first convolution layer, nonlinear pooling is introduced, the picture size is rapidly downsampled, and image aggregation calculation attenuation is completed by bypassing the middle large activation layer. The nonlinear pooling is introduced to enable the model to perform finer aggregation on a large receptive field, more picture information is extracted compared with the common pooling, meanwhile, the occupied memory and calculation amount in the operation process are reduced, and too much precision is not lost.

As shown in fig. four, a receptive field picture (r×c×k) of a specified format is extracted from an input picture, where r is the receptive field row size, c is the receptive field column size, and k is the receptive field channel number. Then, extracting features on the steps by using a fast gate cycle neural network (FastGRNN 1) to obtain r pieces of length h ₁ Characteristic block h of (1) ₁ Hide layer size for FastGRNN1, then length h at r ₁ Is subjected to bidirectional FastGRNN2 on the characteristic blocks to obtain two blocks with the length of h ₂ Is described. Similar to extracting features on row level, c lengths h are obtained on column level ₁ Is subjected to bidirectional FastGRNN2 operation to obtain two characteristic blocks with the length of h ₂ Is described. Finally, four lengths are h ₂ And (3) splicing the characteristic blocks to obtain the characteristic vector after nonlinear pooling operation is carried out on the single receptive field. In the present embodiment h ₁ ＝h ₂ After = 8,c =r= 6,k =8, the nonlinear pooling operation is performed on different receptive fields of the pictures, and the finer aggregation extracts more picture information than the common pooling, and simultaneously reduces the memory and the calculation amount occupied in the running process. The fastgnns formula is as follows:

z _t ＝σ(Wp _t +Uh _t-1 +b _z )

p _t represents the t-th input feature vector, h _t-1 Representing the output of the last step, W, U, b _z ,b _h All are parameters, z _t To update the gate indicates how much information of the last state can be updated into the current state,to candidate hidden layer states, the hidden layer state is generated by fusing the current input and the previous hidden state,generating a possible update state, h _t Compared with GRU and LSTM, fastGRNN has less calculation amount, faster training time and can effectively capture the edge and direction of the graph, thus being more suitable for resource-restricted equipment.

List one

Fourth, model construction

And combining conventional convolution, nonlinear pooling and MCU-BLOCK-A, MCU-BLOCK-B modules, global pooling and full-connection layer to construct a model. The network structure is shown in table one. Where Input is the Input picture size and channel number, operator is the specific operation, c is the output channel number, n is the number of repetitions, s is the stride number, and ECA indicates whether ECA modules are used in this operation.

The method specifically comprises the following steps of weighing peak memory, model size, calculated amount and accuracy:

4) A fourth stage, adopting a constructed MCU-BLOCK-B module;

5) And fifthly, performing dimension reduction by using global pooling, and finally obtaining a classification result through the full connection layer.

Step 5, model training and deployment

The performance of the model is tested by training and testing the model by using an ImageNet data set and a Visual Wake Words (VWW) data set, and deploying the model trained on the VWW data set on an STM32F746 singlechip. The ImageNet dataset was the most convincing benchmark, with 1,281,167 images in the training set and 50,000 images in the validation set, the images were pre-processed first, and all training sets were adjusted to 224 x 224. Subsequently, the image is randomly flipped horizontally and then normalized using the mean and standard deviation. The model was trained using a random gradient descent (SGD) and momentum (momentum) optimizer with a weight decay of 4X 10-5 and a momentum of 0.9. And (3) adopting a cosine learning rate attenuation strategy, wherein the initial learning rate is 0.1, performing 400 rounds of training, and finally attenuating the learning rate to 0.00001, wherein the batch size is set to 1024, and adopting a PyTorch frame and 4 NVIDIAGeForceA100 GPU for training, wherein experimental results are shown in a table II. Compared with other lightweight models, the model provided by the invention has fewer parameters and peak memory, so that the model can be deployed on MCU with fewer memory resources, and has higher accuracy.

Watch II

EtinyNet-1.0 in the second table is that the output channel of the first layer of the model is reduced to 8, and the same Peak RAM retraining result as the model is kept;

visual wake Words (VWW for short) are the benchmark for evaluating the performance of a miniature vision model on a microcontroller. The task includes determining the presence of a person in the image, the data set being a realistic representation of a generic use case of the microcontroller-based vision application and providing a standardized basis for evaluating the accuracy and effectiveness of the small vision model. The training dataset contained 115,000 images and the validation dataset contained 8,000 images.

Specifically, the example uses the same data enhancement strategy as the experiment on ImageNet data, with an initial learning rate of 0.05, a cosine learning rate decay strategy, and a model trained using SGD and momentum optimizers with a weight decay of 3e-4. Training 300 rounds, batch size 256. The present example tested the effectiveness and efficiency of the model on images with resolutions of 80 x 80, 144 x 144, and 244 x 244 pixels. The present example sets the multiplying factor (multiplier) of the model to 0.75, i.e., the number of all output channels of the model multiplied by 0.75, the first layer output channel becomes 6, the nonlinear pooled output channel becomes 24, and so on. Thus, the standard conditions of visual wake task, namely the requirement of peak memory (not more than 256 KB), MMACs (not more than 60M) and parameter number (not more than 300K) are met, and the model is ensured to be suitable for deployment on MCU equipment with limited resources. Experimental comparison results as shown in fig. five and six, a better classification was achieved with less peak memory and parameter amount (265 k). The actual deployment flow is shown in a seventh diagram, firstly, a model file with highest classification accuracy obtained by training on a data set with resolution of 80×80 is converted into an ONNX format, and then 8bit static asymmetric quantization is utilized, wherein a quantization formula is as follows:

val _fp32 ＝scale*val _quantized

wherein val _fp32 For the original value of the model weight, val _quantized For the values after the quantization,maximum value in the original model weight, +.>Minimum value in original model weight, +.>Is the maximum value after quantization, since the 8bit quantization is here 255,/>is the quantized minimum value of 0.

The model is analyzed by using the STMCube AI toolkit, corresponding C language basic codes are generated, and the quantized model weights are initialized in the form of static arrays. The hardware configuration is shown in fig. eight. The invention adopts STM32F746G-DISCO as a main controller, arducam as a camera to acquire images, wherein STM32F746G-DISCO is used as an MCU, flash memory in the MCU is used for storing neural network model weights, basic code frames and OS, SRAM in the MCU is used for storing intermediate activation values and other buffer files in the CNN network operation process, and Arducam is used as the camera to acquire images. The method and the device realize real-time detection of whether people exist in the video.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution is characterized by comprising the following steps:

step 1 acquisition of MCU-BLOCK-A

The method comprises the steps of improving the depth separable convolution of A lightweight neural network MobileNet to obtain MCU-BLOCK-A, utilizing one depth convolution and point-by-point convolution, adding A BN layer and an efficient channel attention mechanism between the two, adding one layer of depth convolution at last, and carrying out residual connection on input and output of the last layer of depth convolution;

step 2 acquisition of MCU-BLOCK-B

Based on the MCU-BLOCK-A obtained in the step 1, removing residual connection between the input and the final layer of depth convolution output on the basis of the MCU-BLOCK-A, adding residual connection between the input and the point-by-point convolution output, and carrying out residual connection on the connected output and the final layer of depth convolution output;

step 3, obtaining a nonlinear pooling layer

Based on the nonlinear pooling module, the nonlinear pooling module is applied to an initial convolution layer, the picture size is rapidly downsampled, and the image aggregation calculation attenuation is completed by bypassing a large activation layer in the middle;

step 4, model construction

2) In the second stage, feature information is extracted along the row direction and the column direction in the picture by utilizing nonlinear pooling, then the picture size is reduced rapidly, and the peak memory of the CNN model is ensured not to be higher than the capacity of the static random access memory of the MCU;

4) A fourth stage, adopting a constructed MCU-BLOCK-B module;

5) Fifthly, performing dimension reduction by using global pooling, and finally obtaining a classification result through a full connection layer;

step 5, model training and deployment

2. The image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution according to claim 1, wherein: and the MCU-BLOCK-A in the step 1, the MCU-BLOCK-B in the step 2, the non-linear activation function is not used after the first depth convolution, and the ReLU activation function is adopted between the point-by-point convolution and the depth convolution and after the last depth convolution.

3. The image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution according to claim 1, wherein: the efficient channel attention mechanism in the step 1 specifically includes:

w _i ＝σ(C1D _k (y))

4. The image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution according to claim 1, wherein: the specific operation of the nonlinear pooling module in the step 3 comprises extracting a receptive field picture (r×c×k) in a specified format from an input picture, wherein r is a receptive field row size, c is a receptive field column size, and k is a receptive field channel number. Then, extracting features on the steps by using a fast gate cycle neural network (FastGRNN 1) to obtain r pieces of length h ₁ Characteristic block h of (1) ₁ Hide layer size for FastGRNN1, then length h at r ₁ Is subjected to bidirectional FastGRNN2 on the characteristic blocks to obtain two blocks with the length of h ₂ Is a feature block of (1); similar to extracting features on row level, c lengths h are obtained on column level ₁ Is subjected to bidirectional FastGRNN2 operation to obtain two characteristic blocks with the length of h ₂ Is a feature block of (1); finally, four lengths are h ₂ And (3) splicing the characteristic blocks to obtain the characteristic vector after nonlinear pooling operation is carried out on the single receptive field.

5. The image classification method suitable for MCU deployment based on nonlinear pooling and depth separable convolution according to claim 1, comprising the steps of deploying CNN neural network based on MCU, specifically comprising the steps of:

val _fp32 ＝scale*val _quantized

6. The method for classifying images suitable for MCU deployment based on nonlinear pooling and depth separable convolution according to claim 1, comprising a device for classifying images based on MCU deployment, the device comprising an MCU microprocessor and a camera, wherein: