CN111275187A

CN111275187A - Compression method and device of deep neural network model

Info

Publication number: CN111275187A
Application number: CN202010056993.XA
Authority: CN
Inventors: 李晶; 郑哲; 刘瑞; 乔磊; 崔文朋; 池颖英; 龙希田; 聂玉虎
Original assignee: State Grid Information and Telecommunication Co Ltd; Beijing Smartchip Microelectronics Technology Co Ltd
Current assignee: State Grid Jiangxi Electric Power Co ltd Construction Branch; State Grid Information and Telecommunication Co Ltd; Beijing Smartchip Microelectronics Technology Co Ltd
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-12
Also published as: WO2021143070A1

Abstract

The embodiment of the invention provides a compression method and device of a deep neural network model, and belongs to the technical field of computers. The method comprises the following steps: in the current model training period, calculating a norm corresponding to a model channel in the deep neural network model to be compressed; according to the norm and the corresponding initialized weight threshold value, cutting a model channel to obtain a cut deep neural network model; judging whether the difference value between the model precision and the expected precision of the cut deep neural network model is greater than zero or not; when the difference is greater than zero, according to the difference and the initialized weight threshold, the self-adaptive weight threshold corresponding to each layer of neural network; determining a quantized deep neural network model according to the influence degree of the quantized result of each parameter in the cut model on the loss function of the model; and taking the quantized deep neural network model as a compressed model of the current model training period. The embodiment of the invention is suitable for compression of the neural network model.

Description

Compression method and device of deep neural network model

Technical Field

The invention relates to the technical field of computers, in particular to a compression method and device of a deep neural network model.

Background

Although the deep learning network in the prior art solves a series of problems in various fields, the existing deep learning network model is difficult to be effectively deployed on mobile equipment with limited resources due to the huge model and huge calculation amount. In order to reduce the consumption of hardware by the model, compression techniques for the model are widely studied. In the research of the existing model compression method, there are mainly two important research directions: one is clipping based on parameter importance, and the other is model parameter quantization. For the first clipping based on the importance of the parameters, the importance of the parameters is generally set by people, and the unimportant parameters are directly eliminated, and the important parameters are kept. For the second model parameter quantization, the weights of the quantization bit number are also selected by human. The accuracy of the finally obtained model cannot be guaranteed because the existing model compression methods are all set manually.

Disclosure of Invention

The embodiment of the invention aims to provide a compression method and a compression device of a deep neural network model, which solve the problem of low precision of a model compression method in the prior art, and simultaneously perform adaptive model channel cutting and parameter quantization in a model training period so as to ensure the precision of the compressed model.

In order to achieve the above object, an embodiment of the present invention provides a method for compressing a deep neural network model, where the method includes: in the current model training period, calculating a norm corresponding to a model channel in the deep neural network model to be compressed; cutting the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model; judging whether the difference value between the model precision and the expected precision of the cut deep neural network model is larger than zero or not; when the difference between the model precision and the expected precision is larger than zero, obtaining a self-adaptive weight threshold value corresponding to each layer of neural network when a model channel is cut in the next model training period according to the difference and the initialized weight threshold value corresponding to each layer of neural network; determining a quantized deep neural network model according to the influence degree of the quantized result of each parameter in the cut deep neural network model on the loss function of the deep neural network model; and taking the quantized deep neural network model as a compressed model of the current model training period, and cutting and quantizing the compressed model by utilizing the self-adaptive weight threshold in the next model training period.

Further, the cutting the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model includes: sorting the norms of the model channels from small to large, and taking the model channels corresponding to the norms arranged in the previous set percentage as a channel set to be cut; extracting a model channel to be cut from the channel set to be cut, and extracting an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located; judging whether the norm of the model channel to be cut is less than or equal to an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located; and when the norm of the model channel to be cut is less than or equal to the initialization weight threshold, cutting the model channel to be cut to obtain a cut deep neural network model.

Further, after the determining whether the norm of the model channel to be clipped is less than or equal to the initialization weight threshold corresponding to the neural network of the layer where the model channel is located, the method further includes: and when the norm of the model channel to be cut is greater than the initialization weight threshold, not cutting the model channel to be cut, and extracting the next model channel to be cut from the channel set to be cut for cutting.

Further, after the determining whether the difference between the model precision and the expected precision of the cropped deep neural network model is greater than zero, the method further includes: and when the difference between the model precision and the expected precision is less than or equal to zero, restoring the cut deep neural network model to the deep neural network model before the cutting, and ending the compression of the deep neural network model to be compressed.

Further, when the model channel is cut in the next model training period according to the difference and the initialization weight threshold corresponding to each layer of neural network, the adaptive weight threshold corresponding to each layer of neural network includes: according to W_A＝η_ω×T_rxW to obtain the self-adaptive weight threshold value corresponding to each layer of neural network when the model channel is cut in the next model training period, wherein W is a threshold vector consisting of the initialized weight threshold value corresponding to each layer of neural network, and T is the initial weight threshold value corresponding to each layer of neural network_rAs the difference, η_ωIs a coefficient, W_AIs a vector composed of adaptive weight threshold values corresponding to each layer of neural network.

Further, the determining the quantized deep neural network model according to the influence degree of the quantized result of each parameter in the clipped deep neural network model on the loss function of the deep neural network model includes: counting parameters to be quantized in the cut deep neural network model; initially quantizing the parameters to be quantized to an initial bit number, and judging whether a loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value; when the loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value, quantizing the parameters to be quantized after initial quantization from a specified low bit number to a specified high bit number one by one until the loss function value of the depth neural network model after quantization is smaller than or equal to the set threshold value, and determining the depth neural network model after quantization.

Further, after the determining whether the loss function value of the initially quantized deep neural network model is less than or equal to the set threshold, the method further includes: and when the loss function value of the depth neural network model after initial quantization is larger than the set threshold value, restoring the depth neural network model after initial quantization into the depth neural network model before the current cutting, and ending the compression of the depth neural network model to be compressed.

Further, the quantizing the initially quantized parameters to be quantized one by one from a specified low bit number to a specified high bit number until the loss function value of the quantized depth neural network model is less than or equal to the set threshold, and determining the quantized depth neural network model includes: extracting any one parameter to be quantized from the initially quantized parameter to be quantized one by one, quantizing the parameter to be quantized into the specified low-bit number, and calculating whether the loss function value of the depth neural network model after parameter quantization is less than or equal to the set threshold value; when the loss function value is smaller than or equal to the set threshold, determining the specified low-bit number as the quantization bit number of the parameter to be quantized; when the loss function value is larger than the set threshold, quantizing the parameter to be quantized into a specified high bit number higher than the specified low bit number until the loss function value of the depth neural network model after parameter quantization is smaller than or equal to the set threshold; and when all the initially quantized parameters to be quantized are quantized, determining a quantized deep neural network model.

Further, after the ending of the compression of the deep neural network model to be compressed, the method further comprises: and performing Huffman coding on the parameters in the deep neural network model before the cutting to obtain a compressed model of the current model training period.

Correspondingly, the embodiment of the invention also provides a compression device of the deep neural network model, and the device comprises: the norm calculation unit is used for calculating the norm corresponding to the model channel in the deep neural network model to be compressed in the current model training period; the cutting unit is used for cutting the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model; the judging unit is used for judging whether the difference value between the model precision and the expected precision of the cut deep neural network model is larger than zero or not; the threshold value determining unit is used for obtaining a self-adaptive weight threshold value corresponding to each layer of neural network when a model channel is cut in the next model training period according to the difference value and the initialized weight threshold value corresponding to each layer of neural network when the difference value between the model precision and the expected precision is larger than zero; the quantization unit is used for determining the quantized deep neural network model according to the influence degree of the quantized result of each parameter in the cut deep neural network model on the loss function of the deep neural network model; and the processing unit is used for taking the quantized deep neural network model as a compressed model of the current model training period, and cutting and quantizing the compressed model by utilizing the self-adaptive weight threshold in the next model training period.

Further, the clipping unit is further configured to sort the norms of the model channels from small to large, and use the model channels corresponding to the norms arranged in the previous set percentage as a to-be-clipped channel set; extracting a model channel to be cut from the channel set to be cut, and extracting an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located; judging whether the norm of the model channel to be cut is less than or equal to an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located; and when the norm of the model channel to be cut is less than or equal to the initialization weight threshold, cutting the model channel to be cut to obtain a cut deep neural network model.

Further, the clipping unit is further configured to, when the norm of the model channel to be clipped is greater than the initialization weight threshold, not clip the model channel to be clipped, and extract a next model channel to be clipped from the set of channels to be clipped to perform clipping.

Further, the processing unit is further configured to, when the difference between the model precision and the expected precision is less than or equal to zero, restore the clipped deep neural network model to the deep neural network model before the current clipping, and end the compression of the deep neural network model to be compressed.

Further, the threshold value determining unit is also used for determining the threshold value according to W_A＝η_ω×T_rxW to obtain the self-adaptive weight threshold value corresponding to each layer of neural network when the model channel is cut in the next model training period, wherein W is a threshold vector consisting of the initialized weight threshold value corresponding to each layer of neural network, and T is the initial weight threshold value corresponding to each layer of neural network_rAs the difference, η_ωIs a coefficient, W_AIs a vector composed of adaptive weight threshold values corresponding to each layer of neural network.

Furthermore, the quantization unit is further configured to count parameters to be quantized in the clipped deep neural network model; initially quantizing the parameters to be quantized to an initial bit number, and judging whether a loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value; when the loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value, quantizing the parameters to be quantized after initial quantization from a specified low bit number to a specified high bit number one by one until the loss function value of the depth neural network model after quantization is smaller than or equal to the set threshold value, and determining the depth neural network model after quantization.

Further, the processing unit is further configured to, when the loss function value of the depth neural network model after initial quantization is greater than the set threshold, restore the depth neural network model after initial quantization to the depth neural network model before the current clipping, and end compression of the depth neural network model to be compressed.

Further, the quantization unit is further configured to extract any one of the to-be-quantized parameters one by one from the initially quantized to-be-quantized parameters, quantize the to-be-quantized parameters to the specified low-bit number, and calculate whether a loss function value of the depth neural network model after parameter quantization is less than or equal to the set threshold; when the loss function value is smaller than or equal to the set threshold, determining the specified low-bit number as the quantization bit number of the parameter to be quantized; when the loss function value is larger than the set threshold, quantizing the parameter to be quantized into a specified high bit number higher than the specified low bit number until the loss function value of the depth neural network model after parameter quantization is smaller than or equal to the set threshold; and when all the initially quantized parameters to be quantized are quantized, determining a quantized deep neural network model.

Further, the apparatus further comprises: and the coding unit is used for carrying out Huffman coding on the parameters in the deep neural network model before the cutting to obtain a compressed model of the current model training period.

Accordingly, the embodiment of the present invention also provides a machine-readable storage medium, which stores instructions for causing a machine to execute the compression method of the deep neural network model as described above.

Through the technical scheme, in a model training period, calculating a norm corresponding to a model channel in a deep neural network model to be compressed, cutting the model channel according to the norm corresponding to the model channel and an initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model, then judging whether a difference value between model precision and expected precision of the cut deep neural network model is greater than zero, when the difference value between the model precision and the expected precision is greater than zero, obtaining an adaptive weight threshold corresponding to each layer of neural network when the model channel is cut in the next model training period according to the difference value and the initialization weight threshold corresponding to each layer of neural network, and according to the influence degree of a quantized result of each parameter in the cut deep neural network model on a loss function of the deep neural network model, and determining a quantized deep neural network model, taking the quantized deep neural network model as a compressed model of the current model training period, and cutting and quantizing the compressed model by using the self-adaptive weight threshold in the next model training period. The embodiment of the invention solves the problem of low precision of the compressed model caused by artificially setting the compression parameters of the model in the prior art, and compresses the space occupied by the model to the maximum extent while ensuring the precision of the compressed model by simultaneously carrying out adaptive model channel cutting and parameter quantization in one model training period and continuously carrying out adaptive model channel cutting and parameter quantization in the subsequent model training period.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a schematic flow chart of a compression method for a deep neural network model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another compression method for a deep neural network model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a compression apparatus of a deep neural network model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a compressing apparatus of another deep neural network model according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a schematic flowchart of a compression method of a deep neural network model according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 101, in the current model training period, calculating a norm corresponding to a model channel in a deep neural network model to be compressed;

step 102, cutting the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model;

103, judging whether the difference value between the model precision and the expected precision of the cut deep neural network model is greater than zero;

104, when the difference between the model precision and the expected precision is greater than zero, obtaining a self-adaptive weight threshold corresponding to each layer of neural network when a model channel is cut in the next model training period according to the difference and the initialized weight threshold corresponding to each layer of neural network;

105, determining a quantized deep neural network model according to the influence degree of the quantized result of each parameter in the cut deep neural network model on the loss function of the deep neural network model;

and 106, taking the quantized deep neural network model as a compressed model of the current model training period, and cutting and quantizing the compressed model by using the self-adaptive weight threshold in the next model training period.

In order to ensure that the model has good performance, the embodiment of the invention compresses a large-scale deep neural network model with a large-scale data set and a complex structure in advance. And, two parts of model compression are completed in one model training period, one part is model channel clipping, and the other part is parameter quantization.

Before performing model channel clipping, in step 101, in the current model training period, a norm corresponding to each model channel in the deep neural network model to be compressed is first calculated, so as to divide the model channel into two parts, namely important and unimportant, by the norm. Specifically, in step 102, the norms of the model channels are sequentially sorted from small to large, and the model channel corresponding to the norm with the previously set percentage is used as a channel set to be clipped, that is, the model channel corresponding to the norm with the previously set percentage is an unimportant model channel, and clipping can be performed.

Model channel clipping may then begin, which includes two parts, 1) first clipping of the model channel, and 2) determination of the model accuracy after the first clipping.

For 1) clipping of model channels: extracting a first model channel to be cut from the set of model channels to be cut, wherein the extraction mode may not be limited, and may be started from a model channel with a minimum corresponding norm, or any one model channel may be extracted from the set. And after the model channel to be cut with the minimum corresponding norm is extracted, correspondingly extracting an initialization weight threshold corresponding to the neural network of the layer where the model channel to be cut is located. Wherein a corresponding initialization weight threshold is preset for each layer of neural network, e.g., W ═ W_γ1,W_γ2,...,W_γK]Wherein W represents a threshold vector of the initialization weight threshold of the K-layer neural network in the deep neural network model to be compressed. Model channels exist in each layer of neural network, and after a first model channel to be cut is extracted from the channel set to be cut, an initialization weight threshold corresponding to the neural network of the layer where the first model channel to be cut is located is extracted from the initialization weight thresholds of the K layer of neural network. Then utilizeAnd comparing the norm of the first model channel to be cut with the initialization weight threshold corresponding to the neural network of the layer where the first model channel to be cut is positioned, and judging whether the norm of the first model channel to be cut is less than or equal to the initialization weight threshold corresponding to the neural network of the layer where the first model channel to be cut is positioned. And when the norm of the first model channel to be cut is less than or equal to the corresponding initialization weight threshold, cutting the first model channel to be cut, and obtaining a depth neural network model after the first cutting. And when the norm of the first model channel to be cut is larger than the initialization weight threshold, not cutting the first model channel to be cut, and extracting a second model channel to be cut from the set of channels to be cut for cutting. Similarly, whether the norm of the second model channel to be cut is smaller than or equal to the initialization weight threshold corresponding to the neural network of the layer where the second model channel to be cut is located is judged, if yes, the second model channel to be cut is cut, if not, the third model channel to be cut is continuously extracted until one model channel is cut in the first cutting step of the model channel, and therefore the deep neural network model after the first cutting is obtained.

For 2) judgment of model precision after first clipping: and judging whether the difference value between the model precision and the expected precision of the depth neural network model after the first cutting is larger than zero. The model accuracy can be obtained by referring to the calculation of the model accuracy of the deep neural network model in the prior art, and this part of the content is not the important point of the present application in the embodiment of the present invention and is not described here. The expected accuracy can be preset, or the difference between the accuracy of the model before model cutting and the accuracy error tolerance value is determined as the expected accuracy. And when the difference between the model precision and the expected precision of the deep neural network model after the first cutting is judged to be larger than zero, obtaining the self-adaptive weight threshold value corresponding to each layer of neural network when the model channel is cut in the next model training period according to the difference and the initialization weight threshold value corresponding to each layer of neural network. In particular, according to W_A＝η_ω×T_rxW to obtain model channel cutting in next model training periodThen, the adaptive weight threshold corresponding to each layer of neural network is calculated, wherein W is a threshold vector consisting of the initialized weight threshold corresponding to each layer of neural network, and T is the initial weight threshold corresponding to each layer of neural network_rη for the difference between the model accuracy and the expected accuracy of the first clipped deep neural network model_ωIs a coefficient, W_AIs a vector composed of adaptive weight threshold values corresponding to each layer of neural network.

That is, when obtaining model channel clipping in the next model training period, taking model channel clipping in the second model training period as an example, when extracting the first model channel to be clipped in the second model training period from the set of channels to be clipped, the adaptive weight threshold W is obtained from the adaptive weight threshold W_AAnd extracting a self-adaptive weight threshold corresponding to the neural network of the layer where the first model channel to be cut is located, comparing the norm with the self-adaptive weight threshold corresponding to the norm, and cutting the first model channel to be cut when the norm is less than or equal to the self-adaptive weight threshold corresponding to the norm, so as to obtain a deep neural network model after secondary cutting. Then, whether the difference value between the model precision of the deep neural network model after the second cutting and the expected precision of the second cutting is larger than zero is judged. The expected precision of the second cropping can be preset, or the difference between the model precision before the second cropping and the precision error tolerance value is determined as the expected precision of the second cropping. When the difference value between the model precision of the deep neural network model after the second cutting and the expected precision of the second cutting is judged to be more than zero, the same is carried out according to W_A2＝η_ω×T_r×W_A1When the model channel in the next model training period is cut, the adaptive weight threshold corresponding to each layer of neural network is obtained, namely when the model channel in the third model training period is cut, the adaptive weight threshold corresponding to each layer of neural network is obtained, wherein W_A1Cutting a channel for the first time to obtain a threshold vector consisting of self-adaptive weight thresholds corresponding to each layer of neural network, T_rAs the difference, η_ωIs a coefficient, W_A2For the next modelAnd when the model channel is cut in the training period, the adaptive weight threshold corresponding to each layer of neural network, namely, the vector formed by the adaptive weight threshold corresponding to each layer of neural network when the model channel is cut for the third time in the third model training period.

It can be known from the above model channel clipping process that, except for the first model channel clipping, when the model channel is clipped later, the adaptive weight threshold for determining whether the model channel to be clipped is obtained according to the difference between the model precision obtained in the previous clipping and the expected precision, so that the adaptive control of the model channel clipping is realized, and the precision in the model channel clipping process is ensured.

And for the judgment of the model precision, after the model channel is cut for any time, judging whether the difference value between the model precision and the expected precision of the deep neural network model after the cutting is more than zero, and when the difference value between the model precision and the expected precision of the deep neural network model after the cutting is less than or equal to zero, recovering the deep neural network model after the cutting to the deep neural network model before the cutting, and ending the compression of the deep neural network model to be compressed.

After step 103, when the difference between the model precision and the expected precision is greater than zero, when the model channel is cut in the next model training period, the adaptive weight threshold corresponding to each layer of neural network is obtained, and in the next step 105, the parameters in the cut deep neural network model are quantized. There are several cases of general parameter quantization, namely, quantization to 32 bits, 8 bits, 3 bits, 2 bits, and 1 bit. Aiming at the difference of the contribution degree of the parameters in the model to the model, in the process of quantizing the parameters in the model, the influence degree of the quantized result of each parameter on the loss function of the deep neural network model can be determined according to the influence degree of the quantized result of each parameter, so that the quantized bit number of each parameter is determined in a self-adaptive manner, and the finally quantized deep neural network model is further determined.

Specifically, since the deep neural network model to be compressed is subjected to one-time model channel clipping, and a part of parameters are clipped during the clipping, it is necessary to count the parameters to be quantized in the clipped deep neural network model, initially quantize all the parameters to be quantized to an initial bit number, for example, to 32 bits, and determine whether the loss function value of the initially quantized deep neural network model is less than or equal to a set threshold. For the calculation of the loss function value of the model, reference may be made to the calculation of the loss function in the prior art, and details are not repeated since this part is not the important point to be described in this application. The set threshold may be a preset value, which is an upper constant bound of the loss function, and is used to constrain the accuracy loss. When the loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value, quantizing the parameters to be quantized after initial quantization from a specified low bit number to a specified high bit number one by one until the loss function value of the depth neural network model after quantization is smaller than or equal to the set threshold value, and determining the depth neural network model after quantization.

Specifically, any one of the parameters to be quantized is extracted one by one from the initially quantized parameters to be quantized, the parameters to be quantized are quantized to the specified low-bit number, and whether the loss function value of the depth neural network model after parameter quantization is less than or equal to the set threshold value is calculated. And when the loss function value is smaller than or equal to the set threshold, determining the specified low-bit number as the quantization bit number of the parameter to be quantized, and if the loss function value is larger than the set threshold, quantizing the parameter to be quantized into a specified high-bit number higher than the specified low-bit number until the loss function value of the depth neural network model after parameter quantization is smaller than or equal to the set threshold. For example, the parameter to be quantized is quantized into 1 bit, whether the loss function value of the depth neural network model after the parameter quantization is smaller than or equal to the set threshold is calculated, when the loss function value is smaller than or equal to the set threshold, 1 bit is determined as the quantization bit number of the parameter to be quantized, and a second parameter to be quantized is extracted for quantization. Similarly, the second parameter to be quantized is quantized to 1 bit, and the loss function value of the depth neural network model after parameter quantization is calculated to be less than or equal to the set threshold, if so, the third parameter to be quantized is continuously extracted for quantization, and so on. As long as any one of the loss function values of the depth neural network model after the quantization of the parameter to be quantized is greater than the set threshold value in the parameter quantization process, the parameter to be quantized is quantized to a specified high number of bits higher than the specified low number of bits, for example, to 2 bits, and calculating whether the loss function value of the depth neural network model after parameter quantization is less than or equal to the set threshold value or not, if so, continuously extracting the next parameter to be quantized, if not, quantizing the parameter to be quantized into a higher specified high bit number, e.g., 3 bits, and a loss function value is calculated, and if the loss function value is still greater than the set threshold, then, the quantization digit number of the parameter to be quantized is increased again, for example, 8 bits, and if the loss function is still greater than the set threshold, the quantization digit number is increased again until the quantization digit number is restored to the initial digit number. And finally, determining the quantized depth neural network model when all the initially quantized parameters to be quantized are quantized. The quantization process fully embodies the self-adaptive process, and for the key parameters, the influence on the loss function after quantization is large, so that the quantization digit is high. For non-critical parameters, the quantized parameters have little or no influence on the loss function, so the quantized positions can be low.

In addition, when the loss function value of the depth neural network model after initial quantization is larger than the set threshold value, the depth neural network model after initial quantization is restored to the depth neural network model before the current cutting, and the compression of the depth neural network model to be compressed is finished.

After the quantized deep neural network model is determined, the quantized deep neural network model is used as a compressed model of the current model training period, and in the next model training period, the compressed model is cut and quantized by utilizing the self-adaptive weight threshold. That is, model channel clipping and parameter quantization in the next model training period are started.

For further understanding of the embodiment of the present invention, fig. 2 is a flowchart illustrating a method for compressing a deep neural network model according to the embodiment of the present invention. As shown in fig. 2, the method comprises the steps of:

step 201, obtaining a deep neural network model to be compressed;

step 202, in the training period of the model, calculating a norm corresponding to a model channel in the deep neural network model to be compressed;

step 203, sequencing the norms of the model channels from small to large, and taking the model channels corresponding to the norms arranged in the previous set percentage as a channel set to be cut;

step 204, extracting a first model channel to be cut from the set of channels to be cut, and extracting a weight threshold corresponding to the neural network of the layer where the first model channel to be cut is located, for example, extracting a model channel corresponding to a first norm as a first model channel to be cut, and extracting a weight threshold corresponding to the neural network of the layer where the first model channel to be cut is located;

step 205, determining whether the norm of the model channel to be cut is less than or equal to the corresponding weight threshold, if so, executing step 206, and if not, executing step 207;

step 206, cutting the model channel to be cut to obtain a cut deep neural network model, and executing the following step of judging the model precision;

step 207, not cutting the to-be-cut model channel, and extracting a next to-be-cut model channel and a corresponding initialization weight threshold from the to-be-cut channel set, for example, extracting a model channel corresponding to a second norm as a second to-be-cut model channel, and extracting an initialization weight threshold corresponding to a neural network of a layer where the second to-be-cut model channel is located, and then returning to execute step 205;

step 208, judging whether the difference between the model precision and the expected precision of the cut deep neural network model is greater than zero, if so, executing step 209, and if not, executing step 210;

and 209, obtaining the self-adaptive weight threshold corresponding to each layer of neural network when the model channel is cut in the next model training period according to the difference and the weight threshold corresponding to each layer of neural network.

That is, according to W_A＝η_ω×T_rxW to obtain the self-adaptive weight threshold value corresponding to each layer of neural network when the model channel is cut in the next model training period, wherein W is a threshold vector consisting of the initialized weight threshold value corresponding to each layer of neural network, and T is the initial weight threshold value corresponding to each layer of neural network_rAs the difference, η_ωIs a coefficient, W_AIs a vector composed of adaptive weight threshold values corresponding to each layer of neural network.

For example, in the first model training period, after the model channel is cut, the adaptive weight threshold corresponding to each layer of neural network when the model channel is cut in the second model training period may be obtained from the difference and the initialized weight threshold corresponding to each layer of neural network. In the second model training period, after the model channel is cut, the adaptive weight threshold corresponding to each layer of neural network when the model channel is cut in the third model training period can be obtained according to the difference between the model precision and the expected precision of the deep neural network model after the model channel is cut, and the adaptive weight threshold corresponding to each layer of neural network obtained in the first model training period, and so on. That is, in the nth model training period, after the model channel is cut, the adaptive weight threshold corresponding to each layer of neural network obtained in the N +1 th model training period is obtained from the difference between the model accuracy and the expected accuracy of the model of the nth cut deep neural network model and the adaptive weight threshold corresponding to each layer of neural network obtained in the N-1 th model training period, where N is a natural number greater than or equal to 1, and when N is 1, the adaptive weight threshold corresponding to each layer of neural network obtained in the N-1 th model training period is the initialization weight threshold.

And step 210, restoring the trimmed deep neural network model into a deep neural network model before the trimming, and executing step 220.

And (3) for any model channel, if the difference value between the model precision and the expected precision of the deep neural network model after the current clipping is smaller than or equal to the expected precision, indicating that the current deep neural network model is not suitable for being clipped, directly ending the compression of the model, and recovering the model into the model before the current clipping.

Step 211, counting the parameters to be quantized in the depth neural network model after the cutting;

step 212, initially quantizing the parameters to be quantized to an initial bit number, and determining whether a loss function value of the depth neural network model after the initial quantization is less than or equal to a set threshold, if so, performing step 213, and if not, performing step 220, for example, the initial bit number is 32 bits;

step 213, extracting any one of the parameters to be quantized one by one from the initially quantized parameters to be quantized;

step 214, quantizing the parameter to be quantized into the specified low-bit number;

step 215, determining whether the loss function value of the depth neural network model after parameter quantization is less than or equal to the set threshold, if yes, executing step 216, and if no, executing step 218;

step 216, determining the number of the specified low bits as the number of quantization bits of the parameter to be quantized, and determining whether extraction of the parameter to be quantized is finished, if so, executing step 219, otherwise, executing step 217, wherein the number of the specified low bits is 1 bit;

step 217, extracting the next parameter to be quantized from the parameters to be quantized, and returning to execute step 214;

step 218, quantizing the parameter to be quantized to a specified high bit number higher than the specified low bit number, and returning to execute step 215, where the higher specified high bit number is, for example, 2 bits, 3 bits, 8 bits, or 32 bits;

219, determining a quantized deep neural network model, returning to step 202, and performing model channel clipping and parameter quantization on the compressed model in the next training cycle, that is, when returning to step 202, replacing the deep neural network model to be compressed with the compressed model, and performing model channel clipping and parameter quantization in the next training cycle;

step 220, restoring the depth neural network model after the initial quantization into a depth neural network model before the current cutting;

and 221, performing huffman coding on the parameters in the deep neural network model before the current cutting to obtain a compressed model in the current model training period, and ending the compression of the deep neural network model to be compressed.

In the above model compression process, whether in the process of model channel clipping or in the process of parameter quantization, as long as it is determined in step 208 that the difference between the model accuracy of the clipped deep neural network model and the expected accuracy is less than or equal to zero, or it is determined in step 212 that the loss function value of the initially quantized deep neural network model is greater than the set threshold, the model is restored to the deep neural network model before clipping in the current model training cycle. For example, in a third model training period, if it is determined in step 208 that the difference between the model accuracy of the clipped deep neural network model and the expected accuracy is less than or equal to zero, the model is restored to the deep neural network model before being clipped when the third model training period starts, and huffman coding is performed to obtain a compressed model of the current model training period, and the compression of the deep neural network model to be compressed is ended. For another example, in the fourth model training period, when it is determined in step 212 that the loss function value of the initially quantized deep neural network model is greater than the set threshold, the model is restored to the deep neural network model before being clipped at the beginning of the fourth model training period, and huffman coding is performed to obtain a compressed model of the current model training period, and the compression of the deep neural network model to be compressed is ended.

By the embodiment of the invention, the problem of low precision of the compressed model caused by artificially setting the model compression parameters in the prior art is solved, and the space occupied by the model is compressed to the maximum extent while the precision after the model compression is ensured by simultaneously carrying out adaptive model channel cutting and parameter quantization in one model training period and continuously carrying out adaptive model channel cutting and parameter quantization in the subsequent model training period.

Correspondingly, fig. 3 is a schematic structural diagram of a compression apparatus of a deep neural network model according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes: the norm calculation unit 31 is configured to calculate a norm corresponding to a model channel in the deep neural network model to be compressed in a current model training period; the cutting unit 32 is configured to cut the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network, so as to obtain a cut deep neural network model; a judging unit 33, configured to judge whether a difference between the model precision and the expected precision of the cut deep neural network model is greater than zero; a threshold determining unit 34, configured to, when a difference between the model accuracy and the expected accuracy is greater than zero, obtain, according to the difference and an initialization weight threshold corresponding to each layer of neural network, a self-adaptive weight threshold corresponding to each layer of neural network when a model channel is cut in a next model training period; a quantization unit 35, configured to determine a quantized deep neural network model according to an influence degree of a result of quantization of each parameter in the cut deep neural network model on a loss function of the deep neural network model; and the processing unit 36 is configured to use the quantized deep neural network model as a compressed model in a current model training period, and perform clipping and quantization on the compressed model by using the adaptive weight threshold in a next model training period.

Further, as shown in fig. 4, the apparatus further includes: and the encoding unit 37 is configured to perform huffman encoding on the parameters in the deep neural network model before the current clipping to obtain a compressed model of the current model training period.

The specific implementation process of the compression device of the deep neural network model is referred to as the processing process of the compression method of the deep neural network model.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of compressing a deep neural network model, the method comprising:

in the current model training period, calculating a norm corresponding to a model channel in the deep neural network model to be compressed;

cutting the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model;

judging whether the difference value between the model precision and the expected precision of the cut deep neural network model is larger than zero or not;

when the difference between the model precision and the expected precision is larger than zero, obtaining a self-adaptive weight threshold value corresponding to each layer of neural network when a model channel is cut in the next model training period according to the difference and the initialized weight threshold value corresponding to each layer of neural network;

determining a quantized deep neural network model according to the influence degree of the quantized result of each parameter in the cut deep neural network model on the loss function of the deep neural network model;

and taking the quantized deep neural network model as a compressed model of the current model training period, and cutting and quantizing the compressed model by utilizing the self-adaptive weight threshold in the next model training period.

2. The method of claim 1, wherein the clipping the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain the clipped deep neural network model comprises:

sorting the norms of the model channels from small to large, and taking the model channels corresponding to the norms arranged in the previous set percentage as a channel set to be cut;

extracting a model channel to be cut from the channel set to be cut, and extracting an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located;

judging whether the norm of the model channel to be cut is less than or equal to an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located;

and when the norm of the model channel to be cut is less than or equal to the initialization weight threshold, cutting the model channel to be cut to obtain a cut deep neural network model.

3. The method of compressing a deep neural network model according to claim 2, wherein after the determining whether the norm of the model channel to be clipped is less than or equal to the initialization weight threshold corresponding to the neural network of the layer where the model channel is located, the method further comprises:

and when the norm of the model channel to be cut is greater than the initialization weight threshold, not cutting the model channel to be cut, and extracting the next model channel to be cut from the channel set to be cut for cutting.

4. The method of compressing a deep neural network model of claim 1, wherein after said determining whether the difference between the model accuracy and the desired accuracy of the cropped deep neural network model is greater than zero, the method further comprises:

and when the difference between the model precision and the expected precision is less than or equal to zero, restoring the cut deep neural network model to the deep neural network model before the cutting, and ending the compression of the deep neural network model to be compressed.

5. The method of claim 1, wherein the obtaining the adaptive weight threshold corresponding to each layer of neural network when the model channel is cut in the next model training period according to the difference and the initialization weight threshold corresponding to each layer of neural network comprises:

according to W_A＝η_ω×T_rxW to obtain the self-adaptive weight threshold value corresponding to each layer of neural network when the model channel is cut in the next model training period, wherein W is a threshold vector consisting of the initialized weight threshold value corresponding to each layer of neural network, and T is the initial weight threshold value corresponding to each layer of neural network_rAs the difference, η_ωIs a coefficient, W_AIs a vector composed of adaptive weight threshold values corresponding to each layer of neural network.

6. The method of claim 1, wherein the determining the quantized deep neural network model according to the degree of influence of the result of quantizing each parameter in the cropped deep neural network model on the loss function of the deep neural network model comprises:

counting parameters to be quantized in the cut deep neural network model;

initially quantizing the parameters to be quantized to an initial bit number, and judging whether a loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value;

when the loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value, quantizing the parameters to be quantized after initial quantization from a specified low bit number to a specified high bit number one by one until the loss function value of the depth neural network model after quantization is smaller than or equal to the set threshold value, and determining the depth neural network model after quantization.

7. The method of compressing the deep neural network model of claim 6, wherein after the determining whether the loss function value of the initially quantized deep neural network model is less than or equal to the set threshold, the method further comprises:

and when the loss function value of the depth neural network model after initial quantization is larger than the set threshold value, restoring the depth neural network model after initial quantization into the depth neural network model before the current cutting, and ending the compression of the depth neural network model to be compressed.

8. The method of claim 6, wherein the quantizing the initially quantized parameters to be quantized one by one from a specified low bit number to a specified high bit number until the value of the loss function of the quantized deep neural network model is less than or equal to the set threshold, and the determining the quantized deep neural network model comprises:

extracting any one parameter to be quantized from the initially quantized parameter to be quantized one by one, quantizing the parameter to be quantized into the specified low-bit number, and calculating whether the loss function value of the depth neural network model after parameter quantization is less than or equal to the set threshold value;

when the loss function value is smaller than or equal to the set threshold, determining the specified low-bit number as the quantization bit number of the parameter to be quantized;

when the loss function value is larger than the set threshold, quantizing the parameter to be quantized into a specified high bit number higher than the specified low bit number until the loss function value of the depth neural network model after parameter quantization is smaller than or equal to the set threshold;

and when all the initially quantized parameters to be quantized are quantized, determining a quantized deep neural network model.

9. The method of compressing the deep neural network model according to claim 4 or 7, wherein after the ending of the compression of the deep neural network model to be compressed, the method further comprises:

and performing Huffman coding on the parameters in the deep neural network model before the cutting to obtain a compressed model of the current model training period.

10. An apparatus for compressing a deep neural network model, the apparatus comprising:

the norm calculation unit is used for calculating the norm corresponding to the model channel in the deep neural network model to be compressed in the current model training period;

the cutting unit is used for cutting the model channel according to the norm corresponding to the model channel and the initialization weight threshold corresponding to each layer of neural network to obtain a cut deep neural network model;

the judging unit is used for judging whether the difference value between the model precision and the expected precision of the cut deep neural network model is larger than zero or not;

the threshold value determining unit is used for obtaining a self-adaptive weight threshold value corresponding to each layer of neural network when a model channel is cut in the next model training period according to the difference value and the initialized weight threshold value corresponding to each layer of neural network when the difference value between the model precision and the expected precision is larger than zero;

the quantization unit is used for determining the quantized deep neural network model according to the influence degree of the quantized result of each parameter in the cut deep neural network model on the loss function of the deep neural network model;

and the processing unit is used for taking the quantized deep neural network model as a compressed model of the current model training period, and cutting and quantizing the compressed model by utilizing the self-adaptive weight threshold in the next model training period.

11. The compression device of the deep neural network model according to claim 10, wherein the clipping unit is further configured to sort the norms of the model channels from small to large, and use the model channels corresponding to the norms that are ranked in the top by a set percentage as a set of channels to be clipped; extracting a model channel to be cut from the channel set to be cut, and extracting an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located; judging whether the norm of the model channel to be cut is less than or equal to an initialization weight threshold corresponding to a neural network of a layer where the model channel to be cut is located; and when the norm of the model channel to be cut is less than or equal to the initialization weight threshold, cutting the model channel to be cut to obtain a cut deep neural network model.

12. The apparatus for compressing a deep neural network model according to claim 11, wherein the clipping unit is further configured to not clip the model channel to be clipped and extract a next model channel to be clipped from the set of model channels to be clipped for clipping when the norm of the model channel to be clipped is greater than the initialization weight threshold.

13. The device for compressing the deep neural network model according to claim 10, wherein the processing unit is further configured to restore the clipped deep neural network model to the deep neural network model before the current clipping when the difference between the model precision and the desired precision is less than or equal to zero, and end the compression of the deep neural network model to be compressed.

14. The apparatus for compressing the deep neural network model as claimed in claim 10, wherein the threshold determining unit is further configured to determine the threshold according to W_A＝η_ω×T_rxW to obtain the self-adaptive weight threshold value corresponding to each layer of neural network when the model channel is cut in the next model training period, wherein W is a threshold vector consisting of the initialized weight threshold value corresponding to each layer of neural network, and T is the initial weight threshold value corresponding to each layer of neural network_rAs the difference, η_ωIs a coefficient, W_AIs a vector composed of adaptive weight threshold values corresponding to each layer of neural network.

15. The compression device of the deep neural network model according to claim 10, wherein the quantization unit is further configured to count parameters to be quantized in the cropped deep neural network model; initially quantizing the parameters to be quantized to an initial bit number, and judging whether a loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value; when the loss function value of the depth neural network model after initial quantization is smaller than or equal to a set threshold value, quantizing the parameters to be quantized after initial quantization from a specified low bit number to a specified high bit number one by one until the loss function value of the depth neural network model after quantization is smaller than or equal to the set threshold value, and determining the depth neural network model after quantization.

16. The device of claim 15, wherein the processing unit is further configured to, when the loss function value of the initially quantized deep neural network model is greater than the set threshold, restore the initially quantized deep neural network model to the deep neural network model before the current clipping, and end the compression of the deep neural network model to be compressed.

17. The compression device of the depth neural network model according to claim 15, wherein the quantization unit is further configured to extract any one of the parameters to be quantized after the initial quantization one by one, quantize the parameter to be quantized to the specified low bit number, and calculate whether the value of the loss function of the depth neural network model after parameter quantization is less than or equal to the set threshold; when the loss function value is smaller than or equal to the set threshold, determining the specified low-bit number as the quantization bit number of the parameter to be quantized; when the loss function value is larger than the set threshold, quantizing the parameter to be quantized into a specified high bit number higher than the specified low bit number until the loss function value of the depth neural network model after parameter quantization is smaller than or equal to the set threshold; and when all the initially quantized parameters to be quantized are quantized, determining a quantized deep neural network model.

18. The apparatus for compressing a deep neural network model as claimed in claim 13 or 16, further comprising:

and the coding unit is used for carrying out Huffman coding on the parameters in the deep neural network model before the cutting to obtain a compressed model of the current model training period.

19. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method of compressing a deep neural network model of any one of claims 1-9.