CN113222148A

CN113222148A - Neural network reasoning acceleration method for material identification

Info

Publication number: CN113222148A
Application number: CN202110549464.8A
Authority: CN
Inventors: 孟文超; 朱建新; 徐金明; 董超; 陈军; 陈雪超; 林学忠
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-08-06
Anticipated expiration: 2041-05-20
Also published as: CN113222148B

Abstract

The invention discloses a neural network reasoning acceleration method facing material identification, which solves the problem of serious reasoning delay of an edge end network. Firstly, the invention creatively uses cosine distance to judge the difference of network characteristic distribution before and after quantization, thereby determining the sensitivity of different layers to the low bit of quantization; quantization-sensitive layers in the network are quantized to higher bits using mixed-precision quantization. Secondly, aiming at the problem that the material identification network division method is greatly influenced by network bandwidth, the regularization item is added into the traditional model division algorithm, and a lower bit quantization layer is preferentially selected as a division point, so that the influence of the network bandwidth on reasoning is reduced. Finally, the invention aims at the problem that the mixed precision and division point selection search space is large in the mixed precision quantization and network division collaborative design, firstly, a mixed precision quantization scheme is determined according to the network precision requirement based on a greedy algorithm, and then, the division point with the least total delay is selected.

Description

Neural network reasoning acceleration method for material identification

Technical Field

The invention belongs to the field of deep neural network model optimization and reasoning acceleration, and particularly relates to a neural network reasoning acceleration method for material identification.

Background

With the continuous development of machine learning and deep learning, Deep Neural Networks (DNNs) are widely used in various aspects such as computer vision, natural language processing, and data processing. Taking biomass power plant material identification as an example, early processing methods extracted material features for classification by using image processing and identification methods. With the continuous development of deep learning, a Convolutional Neural Network (CNN) is widely applied to the field of image classification and plays an important role in the material identification task processing of a biomass power plant. The application of the neural network comprises both Training (Training) and reasoning (Inference). The neural network training enables the network to learn the characteristics in the data by inputting supervised data into the network, mainly comprises two stages of forward reasoning and backward propagation, wherein the characteristics are learned in the forward reasoning, and the network parameters are updated by calculating gradients through the backward propagation, so that the accuracy of the network is ensured. Therefore, the neural network training process is a computationally intensive task, has high requirements on hardware devices, is usually performed on a cloud server, and consumes long time and resources. The neural network reasoning is the application of the trained neural network, and is different from the training process, the neural network reasoning only comprises a forward reasoning process, data is input to the network, and reasoning is carried out according to the parameters of the trained network, so that a final result is obtained. Therefore, the inference process involves a low amount of computation and can be performed on a device with limited computing power.

The early training and reasoning process of the neural network is carried out on the cloud server, cloud computing is rapidly developed, the terminal equipment transmits input data to the cloud for computing, and the computing result is returned to the terminal equipment after reasoning is finished; however, the cloud-based method puts pressure on network bandwidth, when the data volume is large, the calculation delay is increased, and meanwhile, the privacy of terminal data cannot be protected. The edge calculation well solves the problem, input data is processed at an edge end close to a source, bandwidth pressure is reduced, and meanwhile privacy of the data is well protected. The method is based on edge calculation, and utilizes a convolutional neural network to perform material identification reasoning at an edge end.

However, most of the existing neural networks belong to computation-intensive tasks, and meanwhile, the computation and storage capabilities of the edge-end equipment are usually limited, so that the material identification inference speed on the edge-end equipment is slow, and the computation delay is large. The existing method for reducing the neural network inference delay comprises network quantification, model division and the like. Network quantization reduces the model size by quantizing network weights from 32-bit floating point numbers to low-bit fixed point numbers, thereby reducing the bandwidth delay in accessing data from memory during computation. The model division is a cloud-edge collaborative reasoning method, and the neural network reasoning is accelerated by transmitting a layer with intensive computation in the network to cloud equipment with stronger computation performance.

Although network quantization can reduce computation delay and accelerate edge-end reasoning, low bit quantization can reduce network reasoning accuracy and reduce material identification reasoning accuracy. Meanwhile, the model division depends on the transmission of data, so that the method is greatly influenced by network bandwidth, and when the network bandwidth is blocked, the inference cannot be accelerated, and the inference delay of the network is longer. The invention deeply analyzes the defects of the two methods and provides a corresponding method for optimizing the problems in the two technologies.

Disclosure of Invention

The invention aims to provide a neural network reasoning acceleration method facing material identification aiming at the defects of the prior art, and solves the problem of serious reasoning delay of an edge end network.

Firstly, the cosine distance is creatively used for judging the characteristic distribution difference of the material identification network before and after quantization, so that the sensitivity of different layers in the network to quantization to low bits is determined; and then, the layer sensitive to quantization in the network is quantized to a higher bit by using mixed precision quantization, and the layer insensitive to quantization keeps low bit quantization, so that the accuracy of material identification is greatly improved on the premise of reasoning acceleration.

Secondly, aiming at the problem that the material identification network division needs to transmit intermediate output characteristics and is greatly influenced by network bandwidth, the method improves the traditional model division algorithm, adds a regularization item containing a division point quantization flag bit in the traditional model division algorithm, and preferentially selects a lower bit quantization layer as a division point.

Finally, because the search space of the mixed precision quantization is an exponential level of the network layer number, the search space of the network division points is consistent with the network layer number, namely if the network has N layers and each layer has M quantization precision selections, the search space of the total division points and the quantization precision is M^NN solutions, the search space is too large. Therefore, the invention provides a solution method using a greedy algorithm for solving the problem of large search space of mixed precision and division point selection in mixed precision quantization and network division collaborative design: i.e. each step in the search selects the next best (low latency) decision to determine the network mixing accuracy and the partition point. The network reasoning accuracy rate is only related to the quantization scheme, and the reasoning delay is related to the quantization scheme and the division point selection; therefore, the mixed precision quantization scheme meeting the lowest delay under the requirement of the accuracy is determined according to the requirement of the lowest accuracy of the network, and then the division point with the minimum total delay is selected according to the determined network quantization structure, so that the requirement of the accuracy of material identification can be met, and the identification time can be shortest.

The purpose of the invention is realized by the following technical scheme: a neural network reasoning acceleration method for material identification comprises the following steps:

1) training a full-precision material identification network based on a convolutional neural network architecture by utilizing a training set in a material identification data set, recording the accuracy of the full-precision network, and simultaneously storing the trained network weight;

2) randomly extracting a part of pictures from the training set as a quantitative calibration data set;

3) the trained full-precision network is quantized after being trained by using a calibration data set, the quantization precision adopts 16-bit quantization to obtain the accuracy of the network after the full 16-bit quantization, and the output characteristic of the last full connection layer is recorded;

4) based on the network quantized by full 16 bits, sequentially adjusting the quantization precision of each layer to 8 bits from the first layer, keeping the other layers unchanged at 16 bits, and recording the output characteristics of the last full connection layer of the network after each layer is adjusted to 8 bits;

5) calculating cosine distances between the full 16-bit quantized output features obtained in the step 3) and the output features corresponding to the 8-bit quantization adjusted by each layer obtained in the step 4), and arranging all the cosine distances from small to large as the sequence of the sensitivity of different layers of the network to the 8-bit quantization;

6) acquiring the lowest accuracy requirement on a material identification network during actual reasoning;

7) quantizing the network to full 8 bits by using a post-training quantization method, sequencing the quantization sensitivity of different layers obtained in the step 5) according to the lowest accuracy requirement, and sequentially adjusting the layer which is most sensitive to quantization in the full 8-bit quantization network to 16-bit quantization until the network accuracy reaches the lowest requirement, thereby obtaining a mixed precision network;

8) distributing the mixed precision network and the material identification data set obtained in the step 7) to cloud equipment and edge end equipment;

9) the cloud end and the edge end respectively use the material identification data sets to carry out mixed precision network reasoning, calculate the reasoning delay of each layer at the cloud end and the edge end, and record the output data volume of each layer;

10) the edge end records the real-time network bandwidth condition;

11) determining a network division point by the edge terminal according to the cloud edge reasoning delay and the output data amount of each layer obtained in the step 9) and the current bandwidth of the network obtained in the step 10) by taking the minimum overall network reasoning delay as an optimization target according to a network division algorithm, and uploading the division point result to a cloud end;

12) the edge terminal carries out a reasoning process from a first layer to a division point according to the input material identification data set, and uploads a reasoning result to a cloud terminal;

13) the cloud end performs network reasoning from the division point to the last layer according to the reasoning result before the division point sent by the edge end as input;

14) and the cloud end sends the finally obtained reasoning result, namely the material identification result to the edge end, and the reasoning is finished.

Furthermore, the convolutional neural network architecture adopts a directed acyclic graph model, and the accuracy rate of the trained full-precision model is required to be greater than the minimum accuracy rate required by actual reasoning; the deep learning framework is a pyrrch, and the reasoning process of the network in the cloud end and the edge end is realized based on the pyrrch.

Further, in the step 3), the quantization method after training is implemented as follows:

the quantization parameter of the network weight is determined according to the value distribution of the trained network parameter, and the quantization method adopts symmetric quantization;

the network activated quantization parameter is determined according to a specific range of a quantization calibration data set in the forward reasoning process, and the activated quantization method adopts asymmetric quantization; in order to reduce the activated quantization error, the activated quantization calibration process adopts a pseudo quantization operation of quantizing to 8bit fixed point numbers and then quantizing to 32 bit floating point numbers in an inverse manner, and the quantization and inverse quantization formulas are as follows:

and (3) quantification:

inverse quantization: r scale x (q-zero _ point)

Where r represents a 32-bit floating point number before quantization, q represents a fixed point number after quantization, scale represents a scaling factor, zero _ point represents a zero offset, and round () represents an integer function.

Further, the cosine distance in step 5) is calculated as follows:

sequentially quantizing the convolution layer and the full-connection layer in the network to 8 bits, keeping 16 bits of quantization of other layers unchanged, and inputting a quantization calibration data set for quantization;

inputting the test set in the material identification data set into the quantized network to obtain the output characteristic of the last full-connection layer of the quantized network, and calculating the cosine distance between the output characteristic and the cosine distance corresponding to the output characteristic of the full-connection layer during full 16-bit quantization;

and repeating the step for multiple times of measurement, taking an average value as the cosine distance of the output characteristics before and after quantization, and arranging the values from small to large as the sequence of different layers of the network to the quantization sensitivity.

Further, the method for determining the mixed precision network in step 7) is as follows:

according to the quantization sensitivity sequence of different network layers obtained in the step 5), selecting the most sensitive layer in the rest layers as a 16-bit quantization layer in each step until the final accuracy is greater than the required lowest accuracy;

the hybrid precision quantization method is as follows:

performing channel-by-channel symmetric quantization on the network weight, and performing layer-by-layer asymmetric quantization on the network activation; when a certain layer is quantized by 16 bits, the corresponding q is quantized by weight_maxIs 2¹⁵-1，q_minIs-2^-15Activating q corresponding to quantization_maxIs 2¹⁶-1，q_minIs 0; when a certain layer is quantized by 8 bits, the corresponding q is quantized by weight_maxIs 2⁷-1，q_minIs-2^-7Activating q corresponding to quantization_maxIs 2⁸-1，q_minIs 0.

Further, in step 9), the output data amount is calculated as follows: because the output characteristics of each layer of the network are expressed by four-dimensional tensors [ N, C, H, W ], wherein N is the size of each batch of data, C is the number of channels, H is the height of the characteristic diagram, and W is the width of the characteristic diagram, the output data volume is N C H W b, the unit is bit, the b value is replaced by the actual quantization precision, and the calculated output data volume of the network layer is stored in the edge memory.

Further, the implementation of the network partitioning algorithm in step 11) is as follows:

recording the inference delay T of each layer of the mixed precision network on the edge end_e＝[T_e1,T_e2,…,T_eL]Wherein L represents the total number of network layers, and records the inference delay T of each layer of the mixed precision network on the cloud_c＝[T_c1,T_c2,…,T_cL]Recording the output data quantity S ═ S of each layer of the network₁,S₂,…,S_L-1]Recording the current network bandwidth B;

increasing a regularization term lambda multiplied by Q, wherein lambda is a scale factor of the regularization term, and the larger the lambda value is, the larger the role of the regularization term in the optimization formula is, and the more the regularization term is biased to select an 8-bit quantization layer as a division point; q is a quantization flag, Q is 1 if the layer is a 16-bit quantization layer, and Q is 0 if the layer is an 8-bit quantization layer;

the network overall reasoning delay comprises four parts of edge terminal reasoning delay, data transmission delay, cloud reasoning delay and regularization; the total inference delay T is delayed by selecting proper network division points_totalAt a minimum, the corresponding optimization problem is expressed as follows:

wherein the content of the first and second substances,

the inference delay at the edge end is represented,

the inference delay of the cloud is represented,

representing the intermediate output characteristic propagation delay.

Further, the edge end obtains a division point with the minimum total delay according to the inference delay and the data transmission delay of each layer of the mixed precision network at the cloud side, and uploads the division point result to the cloud end; and the edge terminal executes a layer from the first layer of the network to the division point layer, transmits the output characteristics after the division point layer to the cloud terminal to execute the inference of the rear layer, and finally transmits the inference result to the edge terminal equipment.

Furthermore, when the material identification task is executed, the result after neural network inference is only a number, so that the delay of the inference result transmitted from the cloud end to the edge end is ignored.

Furthermore, the edge terminal equipment adopts a raspberry group, a mobile phone or a computer, and the cloud terminal equipment adopts a server.

The invention has the beneficial effects that: the invention provides a neural network reasoning acceleration framework facing a material identification task, which reduces the reasoning delay of a neural network at an edge end by combining network quantization and network division, and simultaneously determines a network structure and division points by adopting a greedy algorithm in order to solve the problem of large quantization precision and division point selection search space in mixed precision quantization, thereby solving the problem of long reasoning delay at the edge end.

Drawings

FIG. 1 is a flow chart of a neural network reasoning acceleration method for material identification according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a VGG16 network structure provided by an embodiment of the present invention (only a convolutional layer and a fully connected layer are shown);

fig. 3 is a schematic diagram of a network after quantization of mixed precision according to an embodiment of the present invention, where a dashed box indicates that the layer is quantized with 8 bits, and a solid box indicates that the layer is quantized with 16 bits;

fig. 4 is a schematic diagram of a network division result of the VGG16 network according to the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The invention provides a neural network reasoning acceleration method facing material identification, which comprises the steps of firstly, aiming at the problem of low reasoning accuracy rate caused by low-precision quantification, determining the sensitivity of different layers of a network to quantification by adopting cosine distance so as to carry out mixed precision quantification; secondly, the influence of network bandwidth on network division is reduced by adding a regularization term to a division algorithm; and finally, combining network quantization and network division by a greedy algorithm to further reduce network inference delay.

The search space for the selection of the network mixing precision and the division point selection is M^NXn (N represents the number of network layers and M represents the quantization precision), taking VGG16 as an example, each layer has a quantization scheme of 8 and 16 bits, and the search space has 2¹⁶X 16 ═ 1048576 protocols. Therefore, the optimal solution cannot be selected by measuring the actual inference precision and delay of each solution. The invention solves the searching problem based on a greedy algorithm, namely, each step of searching selects a scheme with the minimum delay at present; the inference accuracy of the network is only related to the scheme of the mixed precision quantification and is not related to the selection of the division points, and the inference delay is related to the two. Therefore, the method is divided into two stages of determining the mixed precision network structure and determining the division point. And determining a low-delay network meeting the accuracy requirement by searching the mixed precision network, and determining a division point according to the mixed precision network structure. And finally, reasoning the network layer in front of the division point at the edge end, reasoning the network layer behind the division point at the cloud end, and returning a reasoning result to the edge end.

Fig. 1 is a neural network reasoning acceleration method for material identification according to an embodiment of the present invention, which specifically includes the following steps:

1) because the material picture data is less, the invention uses a public classification data set (in the embodiment, a Cifar10 data set is selected, the data set comprises 10 types of objects, wherein the training set comprises 50000 pictures, the test set comprises 10000 pictures, and each type of object comprises 6000 pictures) to replace the material picture data to train a material recognition network (the material recognition network adopts a convolutional neural network, in the embodiment, the VGG16 is selected), the accuracy rate of the full-precision network is recorded, and the trained network weight is saved;

2) preparing a quantized calibration data set: randomly selecting 300 pictures from a training set, wherein each class of object picture is 30 pictures;

3) carrying out post-training quantization (PTQ) on the trained VGG16 network by using a calibration data set, wherein the quantization precision adopts 16-bit quantization to obtain the accuracy of the fully 16-bit quantized network, and recording the output characteristic T0 of the last full-connection layer FC 3;

5) calculating cosine distances between the full 16-bit quantized output features obtained in the step 3) and the layers obtained in the step 4) adjusted to 8-bit quantized output features, and arranging all the cosine distances from small to large as a ranking of 8-bit quantized sensitivity of different layers of the VGG16 network (the cosine distance values are between [ -1,1], the closer to 1, the more similar the two output features are, the less sensitive the layer is to quantization);

6) ACC (adaptive cruise control) requirement for minimum accuracy of material identification network accuracy during actual reasoning acquisition_min；

7) Quantizing the network to full 8 bits by using a post-training quantization (PTQ) method, sequencing the quantized sensitivity of different layers obtained in the step 5) according to the lowest accuracy requirement, and sequentially adjusting the layer which is most sensitive to quantization in the full 8-bit quantized network to 16-bit quantization until the network accuracy reaches the lowest requirement, thereby obtaining a mixed precision network;

8) distributing the mixed precision network and the Cifar10 data set obtained in the step 7) to cloud equipment and edge equipment; the edge device is generally a device with limited computing power, such as a raspberry group, a mobile phone, a computer, and the like, and the cloud device is generally a device with strong computing power, generally a server;

9) the cloud end and the edge end respectively use a Cifar10 data set to carry out mixed precision network reasoning, calculate the reasoning delay of each layer at the cloud end and the edge end, and record the output data volume of each layer;

10) the edge end records the real-time network bandwidth condition;

11) the edge end determines a network division point according to the cloud edge reasoning delay and the output data amount of each layer obtained in the step 9) and the current bandwidth of the network obtained in the step 10) by taking the minimum network overall reasoning delay as an optimization target according to a network division algorithm; uploading the division point result to a cloud end;

12) the edge terminal carries out a reasoning process from a first layer to a division point according to an input Cifar10 data set, and uploads a reasoning result to a cloud;

Furthermore, the method provided by the invention is not limited to the chain model such as VGG16, and is also applicable to a general Directed Acyclic Graph (DAG) model. In addition, the reasoning task is not limited to the material identification task, and the method is also applicable to other picture identification tasks, or target detection and image segmentation tasks, such as SSD, YOLO series, RCNN series models and the like. In addition, the accuracy of the trained full-precision model is required to be greater than the lowest accuracy ACC required by actual reasoning_min。

Furthermore, the deep learning framework adopted by the method is a pyrrch, and the reasoning process of the network in the cloud end and the edge end is realized based on the pyrrch.

Further, in the step 3), in order to accelerate the quantization speed, the invention quantizes the model by using a post-training quantization (PTQ) method which is simple in operation and does not need retraining. The post-training quantization method is implemented as follows: the quantization parameters (scaling factor scale and zero-point offset) of the network weight are directly determined according to the value distribution of the trained network parameters, and the quantization method adopts symmetric quantization; the network activated quantization parameter is determined according to the specific range of the quantization calibration data set in the forward reasoning process, and because the activation function adopts the ReLU function and the activation values are all non-negative values, the activated quantization method adopts asymmetric quantization; meanwhile, in order to reduce the activated quantization error, the activated quantization calibration process adopts a pseudo quantization operation of firstly quantizing to 8-bit fixed point numbers and then inversely quantizing to 32-bit floating point numbers. The specific quantization and dequantization formulas are as follows:

and (3) quantification:

inverse quantization: r scale x (q-zero _ point)

Because the output characteristic space of the neural network is in a state that the origin point is diverged outwards, the classification result can know whether the samples are of the same type or not by judging the included angle between the vectors corresponding to the two samples in the output space. The included angle is the cosine distance, and a smaller included angle means that two samples are more similar. Therefore, the invention proposes to use cosine distance to measure the distance between the output characteristics of the corresponding FC3 layers before and after 8 bits of quantization of each layer in the network, wherein the distance value range is [ -1,1], the closer the value is to 1, the more similar the distribution of the corresponding output characteristics before and after quantization is, i.e. the less influence on the network output after the layer is quantized to 8 bits is, the less sensitive the layer is to quantization.

Further, the cosine distance in step 5) is calculated as follows: sequentially quantizing the convolution layer and the full-connection layer in the network to 8 bits, keeping 16 bits of quantization of other layers unchanged, and inputting a quantization calibration data set for quantization; and inputting the testing set of the Cifar10 into the quantized network to obtain the output characteristic of the last full connection layer of the quantized network, and calculating the cosine distance between the output characteristic and the full 16-bit quantized output characteristic. And repeating the step for multiple times of measurement, taking an average value as the cosine distance of the output characteristics before and after quantization, and arranging the values from small to large as the sequencing of different layers of the VGG16 network on the quantization sensitivity.

Further, the method for determining the mixed precision network in step 7) is as follows: according to the quantization sensitivity sequence of different network layers obtained in the step 5), in order to further reduce delay, the most sensitive layer in the remaining layers is selected as a 16-bit quantization layer in each step until the final accuracy is greater than the required lowest accuracy.

The specific mixing precision quantification method is as follows:

where r represents a 32-bit floating point number before quantization, r_maxRepresenting the maximum value of the floating-point number before quantization, r_minRepresenting the minimum value of floating-point number before quantization, q representing the number of fixed-point after quantization, q_maxRepresenting the maximum value of the fixed-point number after quantization, q_minRepresents the minimum value of the fixed point number after quantization, scale represents the scaling factor, zero _ point represents the zero offset, and round () represents the rounding function. The invention adopts channel-by-channel symmetric quantization for network weight and layer-by-layer asymmetric quantization for network activation. When a certain layer is quantized by 16 bits, the corresponding q is quantized by weight_maxIs 2¹⁵-1，q_minIs-2^-15Activating the quantization pairQ is_maxIs 2¹⁶-1，q_minIs 0. When a certain layer is quantized by 8 bits, the corresponding q is quantized by weight_maxIs 2⁷-1，q_minIs-2^-7Activating q corresponding to quantization_maxIs 2⁸-1，q_minIs 0.

Further, the method for calculating the output data amount in step 9) is as follows: because the output characteristics of each layer of the network are expressed by four-dimensional tensors [ N, C, H, W ], wherein N is the size of each batch of data, C is the number of channels, H is the height of the characteristic diagram, and W is the width of the characteristic diagram, the output data volume is N C H W b, the unit is bit, the b value is replaced by the actual quantization precision, and the calculated output data volume of the network layer is stored in the edge memory.

Because the computing power of the cloud device is stronger than that of the edge device, the inference delay of the same quantization layer on the edge is far higher than that on the cloud. It is because of this difference that the total inference delay can be minimized by choosing a suitable partition point.

Further, the network partitioning algorithm in step 11) is implemented as follows: recording the inference delay T of each layer of the mixed precision network on the edge end_e＝[T_e1,T_e2,…,T_eL]Wherein L represents the total number of network layers, and records the inference delay T of each layer of the mixed precision network on the cloud_c＝[T_c1，T_c2，…，T_cL]Recording the output data quantity S ═ S of each layer of the network₁，S₂，…，S_L-1]And recording the current network bandwidth B. In order to reduce the influence of the network bandwidth on the division, a regularization term xQ is added, wherein lambda is a scale factor of the regularization term, Q is a quantization flag bit, if the layer is a 16-bit quantization layer, Q is 1, and if the layer is an 8-bit quantization layer, Q is 0. When the network bandwidth is poor, the transmission delay becomes a main factor for limiting the inference delay, and since the output of the 16-bit quantization layer is 2 times of the output data volume of the 8-bit quantization layer, in order to reduce the inference delay, the 8-bit quantization layer can be preferentially selected as a division point by adding the regularization item. The larger the value of λ, the more the regularization term is expressed in the optimization formulaThe larger the effect, the more biased is to choose an 8-bit quantization layer as the partition point. Therefore, the network overall inference delay comprises four parts of edge-end inference delay, data transmission delay, cloud-end inference delay and regularization. The total inference delay T is delayed by selecting proper network division points_totalAt a minimum, the corresponding optimization problem is expressed as follows:

wherein the content of the first and second substances,

the inference delay at the edge end is represented,

the inference delay of the cloud is represented,

representing the intermediate output characteristic propagation delay.

After the inference delay and the data transmission delay of each layer of the mixed precision network at the cloud side are obtained at the edge end, the edge end calculates the optimization problem, so that the division point with the minimum total delay is obtained, and the division point result is uploaded to the cloud end. And the edge terminal executes a layer from the first layer of the network to the division point layer, transmits the output characteristics after the division point layer to the cloud terminal to execute the inference of the rear layer, and finally transmits the inference result to the edge terminal equipment.

Furthermore, when the material identification task is executed, the result after neural network inference is only a number, so that the delay of the inference result transmitted from the cloud end to the edge end is ignored for the convenience of calculation.

Examples

The embodiment of the invention adopts a VGG16 network to carry out reasoning on a Cifar10 data set to simulate the forward reasoning process of material identification, and the core concept is as follows: according to the lowest reasoning accuracy requirement of the material identification task, a greedy algorithm is utilized to determine a mixed precision quantization network meeting the lowest delay of the accuracy, the reasoning delay and the network bandwidth of different layers of the quantization network on the cloud side equipment are analyzed, and finally a division point is determined so as to accelerate the network reasoning of the material identification.

The VGG16 network model is shown in FIG. 2, and comprises 13 convolutional layers and 3 fully-connected layers;

the accuracy of the full-precision VGG16 on the Cifar10 test set is 90%;

after 8-bit training, the accuracy of the quantized network on a Cifar10 test set is 70.6%;

according to a greedy algorithm, the sensitivity of the different layers of the network to quantization is first determined: the method comprises the steps of firstly obtaining output characteristics T0 of the last full connection layer (FC3) of the full 16-bit quantized network, then quantizing 16 layers such as the convolutional layer and the full connection layer to 8 bits in sequence, ensuring that the rest network layers are quantized to 16 bits, and obtaining the cosine distances between the output characteristics of the corresponding FC3 layer and the T0 as follows:

conv1_1 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.945;

conv1_2 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.827;

conv2_1 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.973;

conv2_2 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.873;

conv3_1 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.925;

conv3_2 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.951;

conv3_3 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.968;

conv4_1 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.959;

conv4_2 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.842;

conv4_3 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.942;

conv5_1 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.962;

conv5_2 is quantized to 16 bits, the rest layers are quantized to 8 bits, and the cosine distance is 0.913;

conv5_3 is quantized to 16 bits, the remaining layers are quantized to 8 bits, and the cosine distance is 0.885;

FC1 (full connectivity layer) was quantized to 16 bits, the remaining layers were 8 bits quantized, cosine distance was 0.992;

FC2 (full link layer) is quantized to 16 bits, the remaining layers are quantized to 8 bits, and the cosine distance is 0.851;

FC3 (fully connected layer) is quantized to 16 bits, the remaining layers are quantized to 8 bits, and the cosine distance is 0.892;

according to the obtained added value of the accuracy of the different layers, determining the sensitivity sequence of the different layers to quantization as follows: [ Conv1_2, Conv4_2, FC2, Conv2_2, Conv5_3, FC3, Conv5_2, Conv3_1, Conv4_3, Conv1_1, Conv3_2, Conv4_1, Conv5_1, Conv3_3, Conv2_1, FC1]

The lowest requirement of the actual task reasoning accuracy rate is set to be 80%, the accuracy rate of the full 8-bit quantization is set to be 70.6%, and the requirement cannot be met; and according to the sensitivity sequence obtained above, quantizing the corresponding layers to 16 bits in sequence until the requirement of lowest accuracy is met.

The accuracy rate after Conv1_2 is quantized to 16 bits is 74.2%, and the requirement of the lowest accuracy rate is not met;

the accuracy rate after 16 bits are quantized by Conv1_2 and Conv4_2 layers is 76.2%, and the requirement of the lowest accuracy rate is not met;

the accuracy rate of the FC2 layer quantized to 16 bits is 77.0 percent after Conv1_2 and Conv4_2 are quantized, and the requirement of the minimum accuracy rate is not met;

the accuracy rate of the Conv1_2 and Conv4_2, FC2 and Conv2_2 layers quantized to 16 bits is 77.5 percent, and the minimum accuracy rate requirement is not met;

the accuracy rate after the Conv1_2 and Conv4_2, FC2, Conv2_2 and Conv5_3 layers are quantized to 16 bits is 78.2 percent, and the requirement of the minimum accuracy rate is not met;

the accuracy rate of the layers Conv1_2, Conv4_2, FC2, Conv2_2, Conv5_3 and FC3 quantized to 16 bits is 78.6 percent, and the minimum accuracy rate requirement is not met;

the accuracy rate of 16 bits quantized Conv1_2, Conv4_2, FC2, Conv2_2, Conv5_3, FC3 and Conv5_2 layers is 79.4%, and the minimum accuracy rate requirement is not met;

the accuracy of 16 bits quantized Conv1_2 and Conv4_2, FC2, Conv2_2, Conv5_3, FC3, Conv5_2 and Conv3_1 layers is 80.3%, and the lowest accuracy requirement is met, so the accuracy of the mixed-precision network is configured as 16-bit quantization layers [ Conv1_2, Conv4_2, FC2, Conv2_2, Conv5_3, FC3, Conv5_2 and Conv3_1], and the rest layers are 8-bit quantized;

after a specific network structure is obtained, calculating the data size of the output feature of each layer and the inference time of the layer on the edge equipment and the cloud equipment, and taking the batch _ size as 32, namely the inference time described below is the inference delay of 32 test pictures;

conv1_1 (8-bit quantization) has output characteristics of [32,64,32,32], the data size is 2MB, and the inference time of the layer at the edge device and the cloud device is 1.92ms and 0.88ms respectively;

conv1_2(16bit quantization) has output characteristics of [32,64,32,32], a data size of 4MB, and inference times of the layer at the edge device and the cloud device are 8.17ms and 1.09ms respectively;

conv2_1 (8-bit quantized) has output characteristics of [32,128,16,16], the data size is 1MB, and the inference time of the layer at the edge device and the cloud device is 2.18ms and 0.90ms respectively;

conv2_2(16bit quantization) has output characteristics of [32,128,16,16], a data size of 2MB, and inference times of the layer at the edge device and the cloud device are 6.86ms respectively; 1.42 ms;

conv3_1(16bit quantization) has output characteristics of [32,256,8,8], the data size is 1MB, and the inference time of the layer at the edge device and the cloud device is 3.59ms and 0.91ms respectively;

conv3_2(8bit quantized) has output characteristics of [32,256,8,8], data size of 0.5MB, and inference time of the layer at the edge device and the cloud device is 2.79ms respectively; 0.73 ms;

conv3_3(8bit quantization) has output characteristics of [32,256,8,8], a data size of 0.5MB, and the inference time of the layer at the edge device and the cloud device is 3.21ms and 0.47ms respectively;

conv4_1 (8-bit quantized) has output characteristics of [32,512,4,4], the data size is 0.25MB, and the inference time of the layer at the edge device and the cloud device is 1.67ms and 0.45ms respectively;

conv4_2(16bit quantization) has output characteristics of [32,512,4,4], data size of 0.5MB, and inference time of the layer at the edge device and the cloud device is 7.63ms and 1.62ms respectively;

conv4_3 (8-bit quantized) has output characteristics of [32,512,4,4], the data size is 0.25MB, and the inference time of the layer at the edge device and the cloud device is 3.21ms and 0.54ms respectively;

conv5_1 (8-bit quantized) has output characteristics of [32,512,2,2], a data size of 0.0625MB, and inference times of the layer at the edge device and the cloud device are 1.15ms and 0.34ms respectively;

conv5_2(16bit quantization) has output characteristics of [32,512,2,2], data size of 0.125MB, and inference time of the layer at the edge device and the cloud device is 4.11ms and 1.24ms respectively;

conv5_3(16bit quantization) has output characteristics of [32,512,2,2], data size of 0.125MB, and inference time of the layer at the edge device and the cloud device is 3.95ms and 1.11ms respectively;

the output characteristic of FC1 (8-bit quantization) is [32,256,1,1], the data size is 8KB, and the inference time of the layer at the edge device and the cloud device is 0.30 and 0.23ms respectively;

the output characteristic of FC2(16bit quantization) is [32,256,1,1], the data size is 16KB, and the inference time of the layer at the edge end device and the cloud end device is 0.04ms and 0.05ms respectively;

the output characteristic of FC3 (16-bit quantization) is [32,10,1,1], the data size is 0.625KB, and the inference time of the layer at the edge end device and the cloud end device is 0.12ms and 0.03ms respectively;

the total delay of inference at the edge end of all layers of the mixed precision network is 50.9 ms; the inference delay of the full-precision network on the edge end equipment is 71.9 ms; the inference delay of the full 8-bit quantization network on the edge equipment is 34.05 ms;

the method comprises the steps that edge-end equipment records the current network bandwidth situation in real time, the current 5G network is taken as an example, the 5G network can achieve the transmission speed of 140MB/s, the current network bandwidth is assumed to be 100MB/s and can be calculated by a model division algorithm, Conv2_1 is selected as a division point and is influenced by a regularization item, the layer is an 8-bit quantization layer, namely all layers in the front of the layer (including Conv2_1) carry out reasoning (delaying for 9.14ms) on the edge-end equipment, then the output of the Conv2_1 is transmitted to a cloud (delaying for 10ms), all layers behind the layer carry out reasoning (delaying for 12.27ms) on the cloud-end equipment, the total reasoning delay can be minimized, and the delay reasoning is 31.41ms at the moment.

From the above experimental results it can be seen that: the inference framework provided by the invention can eliminate the defect that the inference accuracy rate of low bit quantization is reduced by combining mixed precision quantization and network division, and meanwhile, the inference total delay after network division can be lower than the delay of full 8bit quantization, the inference speed is higher, and the effectiveness of the method is verified.

The method is characterized in that the method is optimized aiming at the defects of the existing model quantization and model division technology, the edge material identification reasoning is further accelerated by combining two methods, and meanwhile, the implementation method for determining the mixing precision and the division point based on the greedy algorithm is provided aiming at the problems of large search space, complex calculation and other difficult combination existing in the combined implementation of the two methods.

Hereinbefore, specific embodiments of the present invention are described with reference to the drawings. However, those skilled in the art will appreciate that various modifications and substitutions can be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention. Such modifications and substitutions are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. A neural network reasoning acceleration method for material identification is characterized by comprising the following steps:

10) the edge end records the real-time network bandwidth condition;

2. The neural network reasoning acceleration method for material recognition according to claim 1, characterized in that the convolutional neural network architecture adopts a directed acyclic graph model, and the accuracy of the trained full-precision model must be higher than the minimum accuracy required by actual reasoning; the deep learning framework is a pyrrch, and the reasoning process of the network in the cloud end and the edge end is realized based on the pyrrch.

3. The neural network reasoning acceleration method for material recognition according to claim 1, wherein in the step 3), the post-training quantification method is implemented as follows:

and (3) quantification:

inverse quantization: r scale x (q-zero _ point)

4. The neural network reasoning acceleration method for material identification according to claim 1, wherein the cosine distance in the step 5) is calculated as follows:

5. The neural network reasoning acceleration method for material identification according to claim 1, wherein the determination method of the mixed precision network in the step 7) is as follows:

the hybrid precision quantization method is as follows:

6. The neural network reasoning acceleration method for material identification as claimed in claim 1, wherein in the step 9), the output data amount is calculated as follows: because the output characteristics of each layer of the network are expressed by four-dimensional tensors [ N, C, H, W ], wherein N is the size of each batch of data, C is the number of channels, H is the height of the characteristic diagram, and W is the width of the characteristic diagram, the output data volume is N C H W b, the unit is bit, the b value is replaced by the actual quantization precision, and the calculated network layer output data volume is stored in an edge end memory.

7. The neural network reasoning acceleration method for material identification according to claim 1, characterized in that the network partitioning algorithm in step 11) is implemented as follows:

recording the inference delay T of each layer of the mixed precision network on the edge end_e＝[T_e1，T_e2，...，T_eL]Wherein L represents the total number of network layers, and records the inference delay T of each layer of the mixed precision network on the cloud_c＝[T_c1，T_c2，...，T_cL]Recording the output data quantity S ═ S of each layer of the network₁，S₂，...，S_L-1]Recording the current network zoneWidth B;

wherein the content of the first and second substances,

the inference delay at the edge end is represented,

the inference delay of the cloud is represented,

representing the intermediate output characteristic propagation delay.

8. The neural network reasoning acceleration method for material identification as claimed in claim 1, characterized in that the edge obtains the division point with the minimum total delay according to the reasoning delay and data transmission delay at the cloud side of each layer of the mixed precision network, and uploads the division point result to the cloud end; and the edge terminal executes a layer from the first layer of the network to the division point layer, transmits the output characteristics after the division point layer to the cloud terminal to execute the inference of the rear layer, and finally transmits the inference result to the edge terminal equipment.

9. The neural network reasoning acceleration method for material recognition according to claim 1, wherein when the material recognition task is executed, since the neural network reasoning result is only a number, the delay of the reasoning result being retransmitted from the cloud end to the edge end is ignored.

10. The neural network reasoning acceleration method for material identification as claimed in claim 1, wherein the edge device is a raspberry pi, a mobile phone or a computer, and the cloud device is a server.