CN115983349A

CN115983349A - Method and device for quantizing convolutional neural network, electronic device and storage medium

Info

Publication number: CN115983349A
Application number: CN202310103527.6A
Authority: CN
Inventors: 李雪晨
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-04-18

Abstract

The disclosure provides a quantization method and device of a convolutional neural network, electronic equipment and a computer readable storage medium, and relates to the technical field of computers. The method comprises the following steps: processing the standard data set by using a network to be quantized, and acquiring model parameter quantization errors corresponding to each network layer in the network to be quantized layer by layer; dividing each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer; and quantizing the model parameters of the network layer in each level group according to the corresponding target quantization bit width. The method and the device can reduce the quantization complexity of the convolutional neural network and improve the quantization efficiency of the convolutional neural network.

Description

Method and device for quantizing convolutional neural network, electronic device and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a quantization method for a convolutional neural network, a quantization apparatus for a convolutional neural network, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of computer technology, the deep learning technology based on the convolutional neural network has been developed rapidly, the method has better success in the fields of vision, language and the like. However, these achievements depend on the complexity and huge parameters of the network structure, and put higher requirements on the computational power and memory of the computer, thereby limiting the deployment of the convolutional neural network on the edge devices with limited computational power and memory.

In the related art, model compression is realized through a quantization algorithm, but the current quantization algorithm needs a large amount of training data and has low iteration efficiency, thereby influencing the wide application of the convolutional neural network to a certain extent.

Disclosure of Invention

The present disclosure is directed to a method for quantizing a convolutional neural network, a device for quantizing a convolutional neural network, an electronic device, and a computer-readable storage medium, thereby improving iteration efficiency in quantizing a convolutional neural network at least to some extent.

According to a first aspect of the present disclosure, there is provided a quantization method of a convolutional neural network, comprising: processing the standard data set by using a network to be quantized, and acquiring model parameter quantization errors corresponding to each network layer in the network to be quantized layer by layer; dividing each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer; and quantizing the model parameters of the network layer in each level group according to the corresponding target quantization bit width.

According to a second aspect of the present disclosure, there is provided a quantization apparatus of a convolutional neural network, including: the system comprises a quantization error acquisition module, a model parameter quantization error acquisition module and a quantization error analysis module, wherein the quantization error acquisition module is used for processing a standard data set by using a network to be quantized and acquiring the model parameter quantization errors corresponding to each network layer in the network to be quantized layer by layer; the quantization bit width distribution module is used for dividing each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer; and the quantization processing module is used for performing quantization processing on the model parameters of the network layers in each level group according to the corresponding target quantization bit width.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the above-mentioned method.

The method for quantizing the convolutional neural network provided by the embodiment of the disclosure performs inference processing on a standard data set by using a network to be quantized, obtains model parameter quantization errors corresponding to each network layer in the network to be quantized layer by layer, divides each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization errors of each network layer, and further performs quantization processing on the model parameters of the network layers in each level group according to the corresponding target quantization bit widths. On one hand, a mixed precision quantization scheme based on quantization errors does not need model iteration and training processes, and a forward reasoning processing process is carried out on the basis of a standard data set through a network to be quantized, so that the model parameter quantization errors corresponding to each network layer in the network to be quantized can be obtained, the quantization bit width is distributed based on the model parameter quantization errors, the quantization efficiency of the convolutional neural network is greatly improved, and the quantization operation time can be obviously reduced aiming at the quantization process of the convolutional neural network with more network layers or complex network layers; on the other hand, a quantization bit width distribution strategy can be obtained by performing reasoning operation once by using less training data, and the method has universality for a model quantization scene without a large amount of training data; in addition, the quantization error-based quantization bit width allocation scheme reduces the iteration times of the convolutional neural network, improves the quantization efficiency, and has practical significance for the deployment and application of the convolutional neural network in edge equipment.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

fig. 1 is a schematic diagram illustrating an exemplary application environment related to a quantization method and apparatus for convolutional neural network to which the embodiments of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow chart of a method of quantizing a convolutional neural network in an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart schematically illustrating a method for obtaining model parameter quantization errors for model weights corresponding to any network layer in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart for determining model parameter quantization error for activation values in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic structural diagram of a convolutional neural network in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flowchart for grouping network layers into classes according to model parameter quantization errors of the network layers according to an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating an assignment level group in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a quantization flow diagram for another convolutional neural network in an exemplary embodiment of the present disclosure;

fig. 9 schematically illustrates a composition diagram of a quantization apparatus of a convolutional neural network in an exemplary embodiment of the present disclosure;

fig. 10 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating an exemplary application environment related to a quantization method and apparatus for a convolutional neural network to which an embodiment of the present disclosure may be applied.

As shown in fig. 1, taking an example that the quantization method of the convolutional neural network is applied to an electronic device of a desktop computer in an entity presentation form, the electronic device may obtain a network to be quantized and a standard data set for training, process the standard data set by using the network to be quantized, obtain a model parameter quantization error corresponding to each network layer in the network to be quantized layer by layer, divide each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization errors of each network layer, and perform quantization processing on the model parameters of the network layers in each level group according to the corresponding target quantization bit widths.

The electronic device may be an intelligent device with a model quantization processing function, for example, the electronic device may be an intelligent device such as a smart phone, a computer, a tablet computer, and the like, and the electronic device may also be referred to as a mobile terminal, a mobile device, a terminal device, and the like.

It should be noted that the quantization method of the convolutional neural network provided by the embodiment of the present disclosure may be performed by an electronic device. The quantization method of the convolutional neural network provided by the embodiment of the present disclosure can also be executed by a server. The server may be a background system providing a service related to quantization processing of the convolutional neural network in the embodiments of the present disclosure, and may include one electronic device or a cluster formed by multiple electronic devices having a computing function, such as a portable computer, a desktop computer, and a smart phone. In the embodiment of the present disclosure, a quantization method of a convolutional neural network is performed as an example by an electronic device.

Deep learning based on the convolutional neural network obtains better results in many fields, but depends on the complexity and huge parameters of a network structure to a great extent, so that higher requirements are put forward on the computing power and the memory of a computer, and for some marginal devices, such as mobile phones, cameras and the like, the computing power and the memory are limited, so that the deployment and the application of the deep convolutional neural network on the devices are limited.

In the related art, model compression is performed by methods such as quantization, pruning, and distillation. Among them, the quantization algorithm is a crucial loop in the deployment of deep convolutional neural network. Parameters of the deep convolutional neural network are stored in a form of 32-bit floating point number Float 32 (Full precision), and are quantized into integer numbers such as 8-bit integer number (Int 8), namely, the quantization bit width is 8 bits, or quantized into integer numbers with lower precision such as 4-bit integer number (Int 4), namely, the quantization bit width is 4 bits through a quantization algorithm, so that the memory occupation of the model can be reduced, and the operation speed can be improved.

However, the reduction in bit width of the parametric representation inevitably results in information loss, so that the quantization loss occurs in the quantized model compared to the original model, resulting in a decrease in model accuracy. Based on this, in order to solve the model precision loss caused by quantization, the convolutional neural network may be quantized in a Mixed-precision quantization (Mixed-precision quantization) manner, where the Mixed-precision quantization refers to quantizing different parameters in the model by using different quantization bit widths, so as to reduce the quantization loss and improve the precision of the quantized model.

The existing mixed precision quantization algorithm comprises a mixed precision quantization algorithm based on measure and a mixed precision quantization algorithm based on search, but both require a large amount of training data and need to be iterated for many times to enable the algorithm to converge or meet the requirement of model precision, and particularly for some networks with more layers or more complexity, the operation time is longer, so that the quantization efficiency of the model is greatly influenced, and the wide deployment and application of the convolutional neural network are limited to a certain extent.

Based on one or more of the problems described above, exemplary embodiments of the present disclosure provide a quantization method of a convolutional neural network. Referring to fig. 2, the quantization method of the convolutional neural network may include the following steps S210 to S230:

in step S210, the standard data set is processed by using the network to be quantized, and a model parameter quantization error corresponding to each network layer in the network to be quantized is obtained layer by layer.

In an exemplary embodiment of the present disclosure, the network to be quantized is a convolutional neural network including a plurality of network layers. The model parameters may include a model weight and an activation value output by the model through the activation layer, and the model weight may be quantized, the activation value may also be quantized, and of course, both the model weight and the activation value may also be quantized. The model parameter quantization error refers to a difference between model parameters before and after quantization, and for example, the model parameter quantization error of the model weight of a certain network layer is a difference between the model weight before and after quantization.

The standard data set is input to the network to be quantized, and the model parameter quantization error corresponding to each network layer is obtained. The model parameter quantization error is an accumulated quantization error during a forward inference process of the network to be quantized, which will be described in detail later.

It should be noted that, for the model weight and the activation value of the network layer, the model parameter quantization error is determined in a manner of calculating the model parameter quantization error respectively.

In step S220, each network layer is divided into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer.

In the exemplary embodiment of the present disclosure, each level group corresponds to a different target quantization bit width, and the target quantization bit width corresponding to the level group may be set according to an actual model quantization requirement, for example, a level group including a target quantization bit width of 8 bits and a level group including a target quantization bit width of 4 bits are not particularly limited. Of course, in the embodiment of the present disclosure, the target quantization bit width of each level group may also be set by the user.

The network layer is distributed to the level groups corresponding to different target quantization bit widths according to the model parameter quantization error of the network layer, the network layers belonging to the same level group adopt the same target quantization bit width, and the network layers in different level groups adopt the differentiated target quantization bit width, so that the model parameters in the model to be quantized adopt different quantization bit widths, the quantization loss is reduced, and the precision of the quantized model is improved.

In step S230, the model parameters of the network layers in each level group are quantized according to the corresponding target quantization bit width.

In the exemplary embodiment of the present disclosure, the target quantization bit width corresponding to the level group to which the network layer belongs is used as the target quantization bit width of the network layer, and quantization processing is performed to obtain a quantized network model.

According to the technical scheme of the embodiment of the disclosure, on one hand, a mixed precision quantization scheme based on quantization errors does not need model iteration and a training process, and a forward reasoning processing process is performed through a network to be quantized based on a standard data set, so that the model parameter quantization errors corresponding to each network layer in the network to be quantized can be obtained, the bit width is allocated based on the model parameter quantization errors, the quantization efficiency of the convolutional neural network is greatly improved, and the quantization operation time can be obviously reduced for the quantization process of the convolutional neural network with more network layers or complex network layers; on the other hand, a quantization bit width distribution strategy can be obtained by performing reasoning operation once by using less training data, and the method has universality for quantization scenes without a large amount of training data.

In an exemplary embodiment, the model parameters of the network layers include model weights, and the corresponding model parameter quantization errors of the respective network layers include weight quantization errors. Fig. 3 shows a flowchart for obtaining a model parameter quantization error corresponding to any network layer according to an example of the present disclosure, and as shown in fig. 3, determining the model parameter quantization error of the model weight may include steps S310 to S330:

step S310, obtaining original output data of an upstream network layer of the network layer, and determining first output data according to the original output data and a first weight of the network layer, wherein the original output data is an output result processed by an inactive layer.

The original output data is the result output by the upstream network layer without being processed by the active layer.

Step S320: and quantizing the first weight, and performing inverse quantization operation on a quantization processing result to obtain a second weight.

The quantization process on the first weight is a process of converting floating point data into integer data, and conversely, the dequantization process on the result of the quantization process is a process of converting integer data into estimated floating point data, i.e., the second weight.

The quantization process and the inverse quantization process of the model weights are explained below. For convenience of description, let W be the floating point weight, W _int For the quantized integer weight, there is formula (1):

wherein the content of the first and second substances,

for the nearest rounding operator, q _min And q is _max Respectively minimum and maximum values in the integer domain, which can be determined based on the quantization bit width b used for quantization, q _min ＝-2 ^b-1 ,q _max ＝2 ^b-1 -1, the cutoff function clamp is defined as (2):

wherein z is a zero point of quantization and is an integer; s is a scaling factor, is a floating point number, and represents a quantization step size, where the quantization step size s is obtained by equation (3):

wherein f is _max And f _min The maximum value and the minimum value in the floating-point number domain are obtained through statistics on the obtained feature map, and may also be obtained through other existing manners, which is not particularly limited in the embodiment of the present disclosure.

Realizing the quantization process of model weight through formulas (1) to (3), and mapping floating point number W into integer number W _int . In contrast, the inverse quantization process can be performed by equation (4) to integrate the integer number W _int Mapping to obtain an estimated floating point number

It should be noted that other quantization and dequantization manners may also be adopted in the embodiments of the present disclosure, and no particular limitation is imposed on this.

Step S330: and obtaining second output data of the quantized upstream network layer, and determining third output data according to the second output data and the second weight.

Before obtaining the second output data of the quantized upstream network layer, firstly, quantizing the upstream network layer by using a preset quantization precision, for example, selecting any one of 4bit, 8bit and 16bit to quantize the upstream network layer, taking the output of the quantized upstream network layer as the second output data, and determining third output data according to the second output data and a second weight obtained after quantization and inverse quantization operations. Based on this, the difference between the first output data and the third output data can be used as a quantization error factor for measuring the weight of the model.

Step S340: and determining a weight quantization error according to the first output data and the third output data.

After the first output data and the third output data are obtained, the weight quantization error can be calculated based on the F-norm, and because the accumulation of the quantization error can be realized in the process of calculating the weight quantization error by combining the first output data and the third output data, the effect of the network layer in the actual quantization error can be accurately reflected.

The embodiment of the present disclosure may obtain a model parameter quantization error of the model weight by the following formula (5):

wherein, err _w The error is quantified for the model parameters of the model weights,

wx is the product of the first weight and the original output data of the network layer upstream of the network layer, i.e. the first output data, is F-norm>

Is the product of the second weight and the quantized second output data of the upstream network layer, i.e. the third output data.

As can be seen from the formula (5), the quantization errors are accumulated for the model weight quantization errors corresponding to each network layer, so that the contribution of the network layer in the actual quantization errors can be accurately reflected.

In an exemplary embodiment, the model parameter of the network layer includes an activation value output by the network layer, and the quantization error of the model parameter corresponding to the corresponding network layer includes an activation value quantization error. Fig. 4 shows another exemplary flowchart for obtaining a model parameter quantization error corresponding to any network layer according to the present disclosure, and as shown in fig. 4, determining the model parameter quantization error of the activation value may include steps S410 to S430:

step S410: and processing the first output data by an activation layer to obtain a first activation value.

Step S420: and quantizing the result obtained by processing the third output data through the active layer, and performing inverse quantization on the quantized result to obtain a second active value.

Step S430: an activation value quantization error is determined based on the first activation value and the second activation value.

Based on this, the difference between the first activation value and the second activation value may serve as a quantization error factor for scaling the activation values.

Note that, in step S420, the contents of step S320 are referred to for implementation of the quantization and dequantization operations, and needless to say, the quantization and dequantization methods in the embodiment of the present disclosure are not particularly limited herein.

The embodiment of the present disclosure may obtain a model parameter quantization error of an activation value output by a network layer through the following formula (6):

wherein, err _a The error is quantified for the model parameters of the activation values,

is F-norm, F _a (Wx) is a first activation value that is obtained by processing the first output data in the activation layer, and/or a value that is greater than or equal to a predetermined value>

A second activation value f obtained by quantizing the result of the third output data by the activation layer and inversely quantizing the result of the quantization process _a (. Cndot.) is an activation function, such as a Rectified Linear Unit (ReL), and the embodiment of the present disclosure does not specifically limit the type of the activation function.

As can be seen from the formula (6), quantization error accumulation is performed on the activation value quantization error corresponding to each network layer, so that the contribution condition of the network layer in the actual quantization error can be accurately reflected.

Fig. 5 shows a schematic structural diagram of a convolutional neural network according to an exemplary embodiment of the present disclosure, and the following description is given with reference to fig. 5 by taking an example of a weight quantization error for obtaining model weights.

For the second network layer, firstly, the original output data of the first network layer is obtained, the first output data is determined according to the original output data and the first weight of the second network layer, then, the quantization and inverse quantization operation is carried out on the first weight to obtain the second weight, the quantized second output data of the upstream network layer (the first network layer) is obtained, the third output data is determined according to the second output data and the second weight, finally, the weight quantization error is determined according to the first output data and the third output data, and based on the weight quantization error, the quantization error of the upstream network layer of the network layer is accumulated in the process of obtaining the weight quantization error for the second network layer.

For the third network layer, the first two layers of networks can be used as the whole sub-network, the original output data of the sub-network (namely the original output data of the second network layer) is firstly obtained, the first output data is determined according to the original output data and the first weight of the third network layer, then the first weight is quantized, the quantized result is subjected to inverse quantization operation to obtain a second weight, the third output data is determined according to the quantized second output data and the second weight of the sub-network, and finally the weight quantization error is determined according to the first output data and the third output data. . Based on this, the quantization errors of the first two network layers are accumulated by the quantization errors of the model weights corresponding to the third network layer.

And analogizing in turn, for the Nth network layer, the model parameter quantization error of the model weight corresponding to the network layer is determined according to the unquantized output result of the first N-1 network layers and the quantized output result of the first N-1 network layers, and the process realizes the cumulative effect of the quantization errors generated by the first N-1 network layers.

It should be noted that, the calculation method of the weighting quantization error of each layer of the network layer can be shown in formula (5), and is not described herein again.

In an exemplary embodiment, an implementation manner of determining a target quantization bit width of a network layer is further provided, and as shown in fig. 6, dividing each network layer into level groups corresponding to different target quantization bit widths according to a model parameter quantization error of each network layer may include:

distributing each network layer to a level group corresponding to different target quantization bit widths based on the numerical value of the model parameter quantization error of each network layer; determining the target quantization bit width of the level group corresponding to each network layer as the target quantization bit width of each network layer;

and the model parameter quantization error of the network layer in the level group with the high target quantization bit width is larger than the model parameter quantization error of the network layer in the level group with the low target quantization bit width.

After the model parameter quantization errors of each network layer are sorted according to the order of the numerical values from large to small, the model parameter quantization errors with large values are distributed to the level group with the high target quantization bit width, and the model parameter quantization errors with small values are distributed to the level group with the low target quantization bit width.

Referring to fig. 7, if only two level groups (a first level group and a second level group) exist, the target quantization bit width corresponding to the first level group is 8 bits, the target quantization bit width corresponding to the second level group is 4 bits, and the network to be quantized includes 50 network layers, then after the 50 network layers are sorted from large to small according to the model parameter quantization error, the first 30 network layers are obtained and allocated to the first level group, and the remaining network layers are allocated to the second level group. Wherein the number of network layers in each level group can be preset and allowed to be adjusted as required.

By distributing each network layer of the network to be quantized to the corresponding level group and determining the target quantization bit width according to the model parameter quantization error obtained by calculation, different parameters of the model are quantized based on different quantization bit widths, information loss caused by the quantization bit width is avoided, the quantization loss is reduced, and the network precision is improved.

In an exemplary embodiment, before performing quantization processing on the model parameters of the network layer in each level group according to the corresponding target quantization bit width, the following process may be further performed in a loop until the model scale after the pre-quantization corresponding to the network to be quantized meets a preset scale threshold:

determining the scale of a model after prediction quantization based on the target quantization bit width corresponding to each network layer;

and in response to the fact that the model scale after the pre-quantization is larger than the preset scale threshold, reducing the number of network layers in the level group with the high target quantization bit width, increasing the number of network layers in the level group with the low target quantization bit width, and determining the target quantization bit width of each network layer according to the corresponding level group.

If the model size after the prediction quantization is still large, the number of network layers with higher precision can be reduced, that is, the number of network layers in the level group with the high target quantization bit width is reduced, and the number of network layers in the level group with the low target quantization bit width is correspondingly increased.

With continued reference to fig. 7, if the model size is still large after the prediction quantization, the first 20 network layers may be obtained and allocated to the first level group, and the remaining network layers may be allocated to the second level group.

Based on this, the embodiments of the present disclosure may allocate the target quantization bit width of each network layer according to the model scale requirement of the user, so as to obtain a quantization result meeting the user's expectation.

In an exemplary embodiment, before quantizing the model parameters of the network layer in each level group according to the corresponding target quantization bit width, the following process may be further performed in a loop until the accuracy of the pre-quantized model corresponding to the network to be quantized reaches the set model accuracy threshold:

determining the precision of the model after the prediction quantization based on the target quantization bit width corresponding to each network layer;

and in response to the fact that the model precision after the pre-quantization is smaller than a preset model precision threshold value, increasing the number of network layers in the level group with the high target quantization bit width, reducing the number of network layers in the level group with the low target quantization bit width, and determining the target quantization bit width of each network layer according to the corresponding level group.

If the model precision after the prediction quantization still cannot meet the precision requirement, the number of network layers with higher precision can be increased, that is, the number of network layers in the level group with the high target quantization bit width is increased, and the number of network layers in the level group with the low target quantization bit width is correspondingly decreased.

With continued reference to fig. 7, if the model accuracy is still high after the prediction quantization, the first 35 network layers may be obtained to be allocated to the first level group, and the remaining network layers may be allocated to the second level group.

Based on this, the embodiments of the present disclosure may allocate the target quantization bit width of each network layer according to the model precision requirement of the user, so as to obtain a quantization result meeting the user's expectation.

In addition, the size of the model and the precision of the model can be balanced by adjusting the target quantization bit width corresponding to each level group and the number of network layers in each level group.

In an exemplary embodiment, the target network layer number corresponding to each level group may be determined in response to a network layer number adjustment operation for each level group, and each network layer may be allocated to a level group corresponding to a different target quantization bit width based on a numerical value of a model parameter quantization error of each network layer according to the target network layer number corresponding to each level group.

Continuing with fig. 7, after 50 network layers are sorted from large to small according to the model parameter quantization error, the target network layer number corresponding to each level group is determined based on the adjustment operation of the network layer number, and then each network layer is allocated to the level group corresponding to different target quantization bit widths based on the numerical value of the model parameter quantization error of each network layer.

Of course, the target quantization bit width corresponding to each level group may also be determined in response to the operation of adjusting the target quantization bit width corresponding to each level group, so as to adjust the target quantization bit width of the corresponding network layer.

Fig. 8 shows a quantization flow chart of another convolutional neural network according to an exemplary embodiment of the present disclosure, and a quantization process of the convolutional neural network according to an embodiment of the present disclosure is described below by taking quantization of an activation value of a model as an example.

Step S810: the preset optional target quantization bit width of the mixed precision is 16 bits and 4 bits, and each target quantization bit width corresponds to different level groups.

Step S820: and selecting part of training data as a standard data set.

Step S830: the standard data set is used as an input of a network to be quantified (comprising 50 network layers), and the standard data is reasoned by the network to be quantified.

In the inference process, the activation value quantization error of each network layer is obtained layer by layer, and the obtaining manner can be seen in steps S410 to S430.

Step S840: and distributing each network layer to a level group corresponding to different target quantization bit widths based on the numerical value of the quantization error of the activation value of each network layer.

The active value quantization errors of each network layer are sequenced from large to small, the network layer with the large active value quantization error is allocated to a level group with a target quantization bit width of 16 bits, and the rest network layers are allocated to a level group with a target quantization bit width of 4 bits. The number of network layers in each level group may be determined according to a preset, for example, the number of network layers in a level group with a target quantization bit width of 16 bits is 15, and the number of network layers in a level group with a target quantization bit width of 4 bits is 35.

Step S850 may perform quantization processing according to the target quantization bit width corresponding to the network layer, to obtain a quantized network. Therefore, the target quantization bit width can be distributed to each network layer and quantization processing can be carried out through one forward reasoning process.

In step 860, the model scale after the predictive quantization and/or the model precision after the predictive quantization may also be determined based on the target quantization bit width corresponding to each network layer, and the number of network layers in each level group is adjusted according to the preset scale threshold and/or the preset model precision threshold, and step S840 is repeated until the predictive quantization result that the model scale and/or the model precision meet the requirement is obtained.

If the scale of the model after the prediction quantization is larger than a preset scale threshold, reducing the number of network layers in the level group with the high target quantization bit width and increasing the number of network layers in the level group with the low target quantization bit width; and if the model precision after the prediction quantization is smaller than the preset model precision threshold value, increasing the number of network layers in the level group with the high target quantization bit width and reducing the number of network layers in the level group with the low target quantization bit width.

In step S870, when the predicted quantization result meets the actual requirement, the quantization processing is performed on the corresponding network layer according to the target quantization bit width corresponding to the level group to which each network layer belongs, so as to obtain a quantized network.

In summary, in the quantization method of the convolutional neural network provided in the embodiment of the present disclosure, a standard data set is subjected to inference processing by using a network to be quantized, a model parameter quantization error corresponding to each network layer in the network to be quantized is obtained layer by layer, each network layer is divided into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer, and then the model parameters of the network layers in each level group are subjected to quantization processing according to the corresponding target quantization bit widths. On one hand, a mixed precision quantization scheme based on quantization errors does not need model iteration and a training process, and a forward reasoning processing process is performed on the basis of a standard data set through a network to be quantized, so that model parameter quantization errors corresponding to each network layer in the network to be quantized can be obtained, the bit width is distributed based on the model parameter quantization errors, the quantization efficiency of a convolutional neural network is greatly improved, and the quantization operation time can be obviously reduced for the quantization process of the convolutional neural network with more or complex network layers; on the other hand, a quantization bit width distribution strategy can be obtained by performing reasoning operation once by using less training data, and the method has universality for quantization scenes without a large amount of training data; in addition, the quantization error-based quantization bit width allocation scheme reduces the iteration times of the convolutional neural network, improves the quantization efficiency, and has practical significance for further deployment of the convolutional neural network in edge equipment.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the disclosure and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed, for example, synchronously or asynchronously in multiple modules.

Further, referring to fig. 9, in an exemplary embodiment of the present disclosure, there are provided a quantization apparatus 900 of a convolutional neural network, a quantization error obtaining module 910, a quantization bit width allocating module 920, and a quantization processing module 930. Wherein:

a quantization error obtaining module 910, configured to process the standard data set by using the network to be quantized, and obtain, layer by layer, a model parameter quantization error corresponding to each network layer in the network to be quantized;

a quantization bit width allocation module 920, configured to divide each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer;

and a quantization processing module 930, configured to perform quantization processing on the model parameters of the network layers in each level group according to the corresponding target quantization bit width.

In an exemplary embodiment, the model parameter quantization error corresponding to the network layer comprises a weight quantization error; the quantization error acquisition module 910 may include:

the first acquisition unit is used for acquiring original output data of an upstream network layer of the network layer and determining first output data according to the original output data and first weight of the network layer, wherein the original output data is an output result processed by an inactivated layer; the first quantization processing unit is used for performing quantization processing on the first weight and performing inverse quantization operation on a quantization processing result to obtain a second weight; the first obtaining unit is further configured to obtain quantized second output data of the upstream network layer, and determine third output data according to the second output data and the second weight; a first quantization error obtaining unit configured to determine the weighted quantization error according to the first output data and the third output data.

In an exemplary embodiment, the quantization errors of the model parameters corresponding to the network layer further include activation value quantization errors; the quantization error acquisition module 910 may include:

the second acquisition unit is used for processing the first output data through an activation layer to obtain a first activation value; the second quantization processing unit is used for performing quantization processing on a result obtained by processing the third output data through the active layer and performing inverse quantization processing on the result of the quantization processing to obtain a second active value; a second quantization error acquisition unit for determining the activation value quantization error from the first activation value and the second activation value.

In an exemplary embodiment, the quantization apparatus 900 of the convolutional neural network further includes:

and the network quantization module is used for performing quantization processing on the upstream network layer based on preset quantization precision to obtain the quantized upstream network layer.

In an exemplary embodiment, the quantized bit width allocation module 920 includes:

the distribution unit is used for distributing each network layer to a level group corresponding to different target quantization bit widths based on the numerical value of the model parameter quantization error of each network layer; a second determining unit, configured to determine a target quantization bit width of the level group corresponding to each network layer as the target quantization bit width of each network layer; and the model parameter quantization error of the network layer in the level group with the high target quantization bit width is larger than the model parameter quantization error of the network layer in the level group with the low target quantization bit width.

the first adjusting module is used for circularly executing the following processes until the scale of the pre-quantized model corresponding to the network to be quantized meets a preset scale threshold value: determining the scale of the model after the prediction quantization based on the target quantization bit width corresponding to each network layer; and in response to the model scale after the prediction quantization being larger than a preset scale threshold, reducing the number of network layers in the level group with the high target quantization bit width, increasing the number of network layers in the level group with the low target quantization bit width, and determining the target quantization bit width of each network layer according to the corresponding level group.

the second adjusting module is used for circularly executing the following processes until the precision of the pre-quantized model corresponding to the network to be quantized reaches a set model precision threshold value: determining the model precision after the prediction quantization based on the target quantization bit width corresponding to each network layer; and in response to the model precision after the prediction quantization being smaller than a preset model precision threshold, increasing the number of network layers in the level group with the high target quantization bit width, decreasing the number of network layers in the level group with the low target quantization bit width, and determining the target quantization bit width of each network layer according to the corresponding level group.

In an exemplary embodiment, the dividing unit is further configured to determine, in response to a network layer number adjustment operation for each of the level groups, a target network layer number corresponding to each of the level groups; and distributing each network layer to level groups corresponding to different target quantization bit widths according to the number of the target network layers corresponding to each level group and based on the numerical value of the model parameter quantization error of each network layer.

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device for the method is also provided in an exemplary embodiment of the present disclosure, and the electronic device may be the above-mentioned imaging device or the server. Generally, the electronic device comprises at least a processor and a memory for storing executable instructions of the processor, the processor being configured to perform the above-mentioned method via execution of the executable instructions.

The following takes the mobile terminal 1000 in fig. 10 as an example, and exemplifies the configuration of the electronic device in the embodiment of the present disclosure. It will be appreciated by those skilled in the art that the configuration of fig. 10 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes. In other embodiments, mobile terminal 1000 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the various components is shown schematically and is not meant to limit the structure of the mobile terminal 1000. In other embodiments, the mobile terminal may also interface differently from fig. 10, or a combination of multiple interfaces.

As shown in fig. 10, the mobile terminal 1000 may specifically include: processor 1001, memory 1002, bus 1003, mobile communication module 1004, antenna 1, wireless communication module 1005, antenna 2, display 1006, camera module 1007, audio module 1008, power module 1009, and sensor module 1010.

Processor 1001 may include one or more processing units, such as: the Processor 1001 may include an AP (Application Processor), a modem Processor, a GPU (Graphics Processing Unit), an ISP (Image Signal Processor), a controller, an encoder, a decoder, a DSP (Digital Signal Processor), a baseband Processor, and/or an NPU (Neural-Network Processing Unit), etc.

An encoder may encode (i.e., compress) an image or video to reduce the data size for storage or transmission. The decoder may decode (i.e., decompress) the encoded data for the image or video to recover the image or video data. The mobile terminal 1000 may support one or more encoders and decoders, such as: image formats such as JPEG (Joint Photographic Experts Group), PNG (Portable Network Graphics), BMP (Bitmap), and Video formats such as MPEG (Moving Picture Experts Group) 1, MPEG10, h.1063, h.1064, and HEVC (High Efficiency Video Coding).

The processor 1001 may be connected to the memory 1002 or other components through the bus 1003.

The memory 1002 may be used to store computer-executable program code, which includes instructions. Processor 1001 executes various functional applications and data processing of mobile terminal 1000 by executing instructions stored in memory 1002. The memory 1002 may also store application data, such as files for storing images, videos, and the like.

The communication function of the mobile terminal 1000 may be implemented by the mobile communication module 1004, the antenna 1, the wireless communication module 1005, the antenna 2, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 1004 may provide a mobile communication solution of 3G, 4G, 5G, etc. applied to the mobile terminal 1000. The wireless communication module 1005 may provide a wireless communication solution for wireless lan, bluetooth, near field communication, etc. applied to the mobile terminal 1000.

The display screen 1006 is used for implementing display functions, such as displaying a user interface, images, videos, and the like, and displaying exception notification information. The camera module 1007 is used to implement a shooting function, such as shooting images, videos, etc., so as to capture images of a scene. The audio module 1008 is used to implement audio functions, such as playing audio, collecting voice, and the like. The power module 1009 is used to implement power management functions, such as charging a battery, supplying power to a device, monitoring a battery status, and the like. The sensor module 1010 may include one or more sensors for implementing corresponding inductive sensing functions.

Furthermore, the exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure as described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for quantizing a convolutional neural network, comprising:

processing the standard data set by using a network to be quantized, and acquiring model parameter quantization errors corresponding to each network layer in the network to be quantized layer by layer;

dividing each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer;

and quantizing the model parameters of the network layer in each level group according to the corresponding target quantization bit width.

2. The method of claim 1, wherein the model parameter quantization error corresponding to the network layer comprises a weight quantization error; obtaining a model parameter quantization error corresponding to any one of the network layers, including:

acquiring original output data of an upstream network layer of the network layer, and determining first output data according to the original output data and a first weight of the network layer, wherein the original output data is an output result which is not processed by an activated layer;

quantizing the first weight, and performing inverse quantization operation on a quantization processing result to obtain a second weight;

obtaining second output data of the upstream network layer after quantization, and determining third output data according to the second output data and the second weight;

determining the weighted quantization error based on the first output data and the third output data.

3. The method of claim 2, wherein the network layer corresponding model parameter quantization error comprises an activation value quantization error; obtaining a model parameter quantization error corresponding to any one of the network layers, including:

processing the first output data through an activation layer to obtain a first activation value;

quantizing the result obtained by processing the third output data through the active layer, and performing inverse quantization on the quantized result to obtain a second active value;

determining the activation value quantization error from the first activation value and the second activation value.

4. The method of claim 2, wherein prior to said obtaining quantized second output data of said upstream network layer and determining third output data based on said second output data and said second weight, said method further comprises:

and carrying out quantization processing on the upstream network layer based on preset quantization precision to obtain the quantized upstream network layer.

5. The method of claim 1, wherein the partitioning each of the network layers into groups of levels corresponding to different target quantization bit widths according to the model parameter quantization error of each of the network layers comprises:

distributing each network layer to a level group corresponding to different target quantization bit widths based on the numerical value of the model parameter quantization error of each network layer;

determining the target quantization bit width of the level group corresponding to each network layer as the target quantization bit width of each network layer;

6. The method of claim 5, wherein before quantizing the model parameters of the network layers in each of the level groups according to the corresponding target quantization bit width, the method further comprises:

circularly executing the following processes until the scale of the pre-quantized model corresponding to the network to be quantized meets a preset scale threshold value:

determining the scale of the model after the prediction quantization based on the target quantization bit width corresponding to each network layer;

and in response to the model scale after the prediction quantization being larger than a preset scale threshold, reducing the number of network layers in the level group with the high target quantization bit width, increasing the number of network layers in the level group with the low target quantization bit width, and determining the target quantization bit width of each network layer according to the corresponding level group.

7. The method of claim 5, wherein before quantizing the model parameters of the network layers in each of the level groups according to the corresponding target quantization bit width, the method further comprises:

circularly executing the following processes until the model precision after the corresponding pre-quantization of the network to be quantized reaches a set model precision threshold value:

determining the model precision after the prediction quantization based on the target quantization bit width corresponding to each network layer;

and in response to the model precision after the prediction quantization being smaller than a preset model precision threshold, increasing the number of network layers in the level group with the high target quantization bit width, decreasing the number of network layers in the level group with the low target quantization bit width, and determining the target quantization bit width of each network layer according to the corresponding level group.

8. The method of claim 5, wherein the assigning each of the network layers to a group of levels corresponding to different target quantization bit widths based on a numerical size of a model parameter quantization error of each of the network layers comprises:

in response to the network layer number adjustment operation for each of the level groups, determining a target network layer number corresponding to each of the level groups;

and distributing each network layer to level groups corresponding to different target quantization bit widths according to the number of the target network layers corresponding to each level group and based on the numerical value of the model parameter quantization error of each network layer.

9. An apparatus for quantizing a convolutional neural network, comprising:

the system comprises a quantization error acquisition module, a model parameter quantization error acquisition module and a quantization error analysis module, wherein the quantization error acquisition module is used for processing a standard data set by using a network to be quantized and acquiring the model parameter quantization errors corresponding to each network layer in the network to be quantized layer by layer;

the quantization bit width distribution module is used for dividing each network layer into level groups corresponding to different target quantization bit widths according to the model parameter quantization error of each network layer;

and the quantization processing module is used for performing quantization processing on the model parameters of the network layers in each level group according to the corresponding target quantization bit width.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-8 via execution of the executable instructions.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.