CN114676825A

CN114676825A - Neural network model quantification method, system, device and medium

Info

Publication number: CN114676825A
Application number: CN202210396647.5A
Authority: CN
Inventors: 王曦; 蹇易
Original assignee: Shanghai Yuncong Enterprise Development Co ltd
Current assignee: Shanghai Yuncong Enterprise Development Co ltd
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-06-28

Abstract

The invention provides a neural network model quantification method, a system, equipment and a medium, which comprises the steps of firstly obtaining an original neural network model to be quantified and a preset precision loss range; then, quantizing all network layers in the original neural network model to obtain a quantized neural network model, and recording the quantized neural network model as a quantized neural network model; respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring the precision loss of the quantized neural network model relative to the original neural network model; and finally, comparing the precision loss with a preset precision loss range, and outputting a quantitative neural network model according to a comparison result, or carrying out iterative quantization on the quantitative neural network model according to the comparison result. Therefore, the invention designs a set of quantization standards capable of automatically optimizing and adjusting the quantization model according to the preset precision loss range; and under the condition of no manual intervention, the optimal mixed precision quantitative model can be returned.

Description

Neural network model quantification method, system, device and medium

Technical Field

The invention relates to the technical field of computers, in particular to a neural network model quantification method, a system, equipment and a medium.

Background

The neural network model quantization is an acceleration algorithm widely used in a deployment stage, and the quantization is to represent a high-precision neural network model by a low-precision neural network model or represent a high-bit neural network model by a low-bit neural network model, so that the neural network model can reduce memory occupation and accelerate calculation.

However, due to the limited low-precision or low-bit representation range, performance loss is often accompanied in the quantization process, and the quantization loss in the deep neural network can be accumulated layer by layer, so that the finally trained deep neural network model identification performance is greatly reduced, and even an unacceptable state is achieved. Therefore, the prior art starts to adopt mixed precision for quantization, because some layers of the network can allow smaller precision compared with other layers, that is, the mode of quantizing by adopting part of the network realizes inference of mixed precision, and achieves the purpose of improving speed while ensuring precision. However, when the mixed precision is adopted for quantization, no completely uniform standard exists at present for which network layers are selected for quantization and which network layers should remain the high-precision mode; in addition, the current quantification mode is mainly based on manual adjustment and test, which is not only inconvenient, but also can not achieve the desired precision result.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a method, system, device and medium for quantizing a neural network model, which are used to solve the problems of the prior art in quantizing a neural network model.

To achieve the above and other related objects, the present invention provides a neural network model quantization method, comprising:

acquiring an original neural network model to be quantized and a preset precision loss range;

quantizing all network layers in the original neural network model to obtain a quantized neural network model, and recording as a quantized neural network model;

respectively inputting a target picture into the original neural network model and the quantized neural network model for recognition, and acquiring the precision loss of the quantized neural network model relative to the original neural network model;

and comparing the precision loss with the preset precision loss range, and outputting the quantitative neural network model according to a comparison result, or performing iterative quantization on the quantitative neural network model according to the comparison result.

Optionally, the process of quantizing all network layers in the original neural network model includes:

acquiring the precision of each network layer in the original neural network model, and reducing the precision of each network layer;

or obtaining the bit of the precision of each network layer in the original neural network model, and reducing the bit of the precision of each network layer.

Optionally, the reducing the accuracy of each network layer comprises at least one of: reducing single precision floating point number precision to half precision floating point number precision, reducing single precision floating point number precision to eight bit shaping precision, and reducing half precision floating point number precision to eight bit shaping precision.

Optionally, if the precision loss is within the preset precision loss range, outputting the quantized neural network model;

if the precision loss is not within the preset precision loss range, performing iterative quantization on the quantized neural network model until the precision loss corresponding to the new quantized neural network model is within the preset precision loss range, terminating the iterative quantization, and outputting the corresponding new quantized neural network model as a final quantized neural network model.

Optionally, the process of iteratively quantizing the quantized neural network model includes:

respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring an absolute value of an inference acceleration time difference value of each network layer in the quantized neural network model and a corresponding network layer in the original neural network model, and information loss of each network layer in the quantized neural network model relative to the corresponding network layer in the original neural network model;

calculating the information loss ratio of each network layer in the quantitative neural network model according to the absolute value of the inference acceleration time difference value and the information loss of each network layer in the quantitative neural network model;

and carrying out iterative quantization on the quantized neural network model according to the information loss ratio of each network layer.

Optionally, the process of iteratively quantizing the quantized neural network model according to the information loss ratio of each network layer includes:

sorting the information loss ratios of all network layers in the quantitative neural network model;

obtaining the precision of the network layer corresponding to the maximum information loss ratio after sequencing, and recording as the precision to be quantized;

improving the precision to be quantized to obtain a new quantization neural network model;

respectively inputting the target picture into the original neural network model and the new neural network model for recognition, and acquiring the precision loss of the new neural network model relative to the original neural network model, and recording the precision loss as intermediate precision loss;

comparing the intermediate precision loss with the preset precision loss range;

if the intermediate precision loss is within the preset precision loss range, taking the current new quantitative neural network model as a final quantitative neural network model;

and if the intermediate precision loss is not within the preset precision loss range, successively improving the precision of the corresponding network layer according to the sequencing result of the information loss ratio, and then carrying out iterative quantization until the final precision loss corresponding to the quantized neural network model is within the preset precision loss range, and terminating the iterative quantization.

Optionally, the step of inputting the target picture into the original neural network model and the quantized neural network model respectively for recognition, and obtaining information loss of each network layer in the quantized neural network model relative to a corresponding network layer in the original neural network model includes:

inputting the target picture into each network layer in the original neural network model for identification, and recording an obtained probability distribution result as a first identification result;

inputting the target picture into each network layer in the quantitative neural network model for identification, and recording an obtained probability distribution result as a second identification result;

calculating the distance between the first recognition result and the second recognition result, and determining the information loss of each network layer in the quantitative neural network model relative to the corresponding network layer in the original neural network model according to the distance calculation result; wherein the distance comprises at least one of: cosine distance, euclidean distance.

Optionally, the original neural network model to be quantified comprises at least one of: the neural network model with single-precision floating point number precision and the neural network model with half-precision floating point number precision.

The invention also provides a neural network model quantification system, which comprises:

the acquisition module is used for acquiring an original neural network model to be quantized and a preset precision loss range;

the initial quantization module is used for quantizing all network layers in the original neural network model to obtain a quantized neural network model which is recorded as a quantized neural network model;

the precision loss module is used for respectively inputting the target picture into the original neural network model and the quantized neural network model for identification, and obtaining the precision loss of the quantized neural network model relative to the original neural network model;

the comparison module is used for comparing the precision loss with the preset precision loss range;

the iterative quantization module is used for iteratively quantizing the quantized neural network model when the precision loss is not within the preset precision loss range;

and the output module is used for outputting the corresponding quantitative neural network model when the precision loss is within the preset precision loss range.

The present invention also provides a computer apparatus comprising:

one or more processors; and

a computer-readable medium having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the method as in any one of the above.

The invention also provides a computer readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a method as described in any one of the above.

As described above, the present invention provides a neural network model quantization method, system, device and medium, which have the following beneficial effects: firstly, acquiring an original neural network model to be quantized and a preset precision loss range; then, quantizing all network layers in the original neural network model to obtain a quantized neural network model, and recording the quantized neural network model as a quantized neural network model; respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring the precision loss of the quantized neural network model relative to the original neural network model; and finally, comparing the precision loss with a preset precision loss range, and outputting a quantitative neural network model according to a comparison result, or carrying out iterative quantization on the quantitative neural network model according to the comparison result. Therefore, the invention designs a set of quantization standards capable of automatically optimizing and adjusting the quantization model according to the preset precision loss range; and under the condition of no manual intervention, the optimal mixed precision quantitative model can be returned. The invention can automatically carry out iterative quantization according to the set quantization parameter, and adaptively search out the hybrid quantization model meeting the conditions.

Drawings

FIG. 1 is a schematic flow chart illustrating a neural network model quantization method according to an embodiment;

FIG. 2 is a schematic flow chart illustrating a neural network model quantization method according to another embodiment;

FIG. 3 is a diagram illustrating a hardware structure of a neural network model quantization system according to an embodiment;

fig. 4 is a schematic hardware structure diagram of a terminal device according to an embodiment;

fig. 5 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.

Description of the element reference numerals

1100 input device

1101 first processor

1102 output device

1103 first memory

1104 communication bus

1200 processing assembly

1201 second processor

1202 second memory

1203 communication assembly

1204 Power supply Assembly

1205 multimedia assembly

1206 Audio component

1207 input/output interface

1208 sensor assembly

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present embodiment provides a neural network model quantization method, including the following steps:

s100, obtaining an original neural network model to be quantized and a preset precision loss range. In this embodiment, the original neural network model to be quantized includes but is not limited to: the neural network model with single-precision floating point number precision and the neural network model with half-precision floating point number precision. As an example, the original neural network model to be quantified in the present embodiment may be a neural network model of fp32 type, for example.

S200, quantizing all network layers in the original neural network model to obtain a quantized neural network model, and marking as a quantized neural network model;

s300, respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring the precision loss of the quantized neural network model relative to the original neural network model.

S400, comparing the precision loss with the preset precision loss range, and outputting the quantitative neural network model according to the comparison result, or performing iterative quantization on the quantitative neural network model according to the comparison result. Specifically, if the precision loss in this embodiment is within the preset precision loss range, the quantized neural network model is output; if the precision loss in this embodiment is not within the preset precision loss range, performing iterative quantization on the quantized neural network model until the precision loss corresponding to the new quantized neural network model is within the preset precision loss range, terminating iterative quantization, and outputting the corresponding new quantized neural network model as a final quantized neural network model. As an example, the precision loss range preset by the embodiment may be 3% to 5%; if the recognition accuracy obtained when the target picture is input into the original neural network model is 89%, and the recognition accuracy obtained when the target picture is input into the quantized neural network model is 83%, the accuracy loss of the quantized neural network model relative to the original neural network model is 6%, and at this time, the corresponding accuracy loss is not within the preset accuracy loss range.

Therefore, a set of quantization standards capable of automatically optimizing and adjusting the quantization model according to the preset precision loss range is designed in the embodiment; and under the condition of no manual intervention, the optimal mixed precision quantitative model can be returned. Equivalent to this embodiment, iterative quantization can be automatically performed according to the set quantization parameter, and a hybrid quantization model satisfying the condition can be adaptively searched. In this embodiment, after the neural network model is quantized, a forward reasoning test needs to be performed on the whole, and then the accuracy is compared with the accuracy before quantization to determine the accuracy loss. Therefore, the present embodiment may input the picture for verification as the target picture to the current quantized neural network model and the original neural network model, respectively, and then determine the accuracy loss of the current quantized neural network model. If the precision loss meets the preset precision loss range, directly adopting the current quantitative neural network model as a final quantitative neural network model; if the preset precision loss range is not met, the current quantization neural network model needs to be subjected to iterative quantization until a final quantization neural network model is obtained.

In accordance with the above, in an exemplary embodiment, the process of quantifying all network layers in the original neural network model comprises: acquiring the precision of each network layer in the original neural network model, and reducing the precision of each network layer; or obtaining the bit of the precision of each network layer in the original neural network model, and reducing the bit of the precision of each network layer. The method for reducing the accuracy of each network layer includes, but is not limited to: reducing single precision floating point number precision to half precision floating point number precision, reducing single precision floating point number precision to eight bit shaping precision, and reducing half precision floating point number precision to eight bit shaping precision. As an example, when quantizing all network layers in the original neural network model, the present embodiment may first perform pre-quantization on all network layers in the original neural network model, where the pre-quantization includes, but is not limited to, representing previous single-precision floating point numbers by half-precision floating point numbers and 8-bit integer type or even lower bits; that is, the floating point number previously represented by fp32 is represented by fp16, int8, or even lower bits. In the embodiment, all network layers in the original neural network model are pre-quantized, so that the speed improvement before and after the quantization of the neural network model is conveniently calculated, and the information loss ratio of each network layer before and after the quantization of the neural network model is also conveniently calculated.

In accordance with the above, in an exemplary embodiment, the process of iteratively quantizing the quantized neural network model comprises: respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring an absolute value of an inference acceleration time difference value of each network layer in the quantized neural network model and a corresponding network layer in the original neural network model, and information loss of each network layer in the quantized neural network model relative to the corresponding network layer in the original neural network model; calculating the information loss ratio of each network layer in the quantitative neural network model according to the absolute value of the inference acceleration time difference value and the information loss of each network layer in the quantitative neural network model; and carrying out iterative quantization on the quantized neural network model according to the information loss ratio of each network layer. In the present embodiment, the information loss ratio of each network layer is the information loss of the corresponding network layer divided by the absolute value of the inference acceleration time difference of the corresponding network layer; the physical meaning of the information loss ratio is the loss degree of information in unit time. In this embodiment, the smaller the information loss ratio, the higher the cost performance of quantization performed by the corresponding network layer, and it is suitable for preferentially performing quantization.

According to the above descriptions, in an exemplary embodiment, the specific process of inputting the target picture into the original neural network model and the quantized neural network model respectively for recognition, and obtaining the information loss of each network layer in the quantized neural network model relative to the corresponding network layer in the original neural network model includes: inputting the target picture into each network layer in the original neural network model for identification, and recording an obtained probability distribution result as a first identification result; inputting the target picture into each network layer in the quantitative neural network model for identification, and recording an obtained probability distribution result as a second identification result; calculating the distance between the first recognition result and the second recognition result, and determining the information loss of each network layer in the quantitative neural network model relative to the corresponding network layer in the original neural network model according to the distance calculation result; wherein the distance comprises at least one of: cosine distance, euclidean distance. In the embodiment, the inference acceleration time and the information loss before and after quantization of each layer in the neural network model are counted in sequence by inputting the inspection picture as a target picture; and then measuring the distance of probability distribution before and after quantization of each network layer in the neural network model through KL divergence, wherein the closer the distance is, the smaller the information loss before and after quantization is. Where the information loss is the relative entropy, i.e., the KL divergence. The KL divergence, i.e. the relative entropy, is used to measure the distance between two probability distributions, and if the two probability distributions are the same, the relative entropy is 0, and the less information is lost at this time.

In an exemplary embodiment, the process of iteratively quantizing the quantized neural network model according to the information loss ratio of each network layer includes: sorting the information loss ratios of all network layers in the quantitative neural network model; obtaining the precision of the network layer corresponding to the maximum information loss ratio after sequencing, and recording as the precision to be quantized; improving the precision to be quantized to obtain a new quantization neural network model; respectively inputting the target picture into the original neural network model and the new neural network model for recognition, and acquiring the precision loss of the new neural network model relative to the original neural network model, and recording the precision loss as intermediate precision loss; comparing the intermediate precision loss with the preset precision loss range; if the intermediate precision loss is within the preset precision loss range, taking the current new quantitative neural network model as a final quantitative neural network model; and if the intermediate precision loss is not within the preset precision loss range, successively improving the precision of the corresponding network layer according to the sequencing result of the information loss ratio, and then carrying out iterative quantization until the final precision loss corresponding to the quantized neural network model is within the preset precision loss range, and terminating the iterative quantization. As can be seen from the above description of some embodiments, the smaller the information loss ratio, the higher the cost performance of the network layer for quantization, and the network layer is suitable for quantization with priority. Therefore, in the embodiment, the information loss ratios are sorted from large to small, then the network layer is quantized, and if the precision loss of the quantized neural network model meets the preset precision loss range, the neural network model directly exits; if the requirement is not met, the precision of the corresponding network layer is sequentially improved according to the information loss ratio from large to small, then iterative quantization is carried out, namely the corresponding network layer is sequentially improved from int8 to fp32 according to the information loss ratio from large to small, then iterative quantization is carried out, and then reasoning test is carried out. And if the precision loss of the neural network model after the iterative quantization once does not meet the preset precision loss range, continuing the iterative quantization for the second time until the precision loss of the neural network model after the iterative quantization meets the preset precision loss range, and terminating the iterative quantization. And replacing the partial speed of the neural network model with the corresponding precision until the final precision meets the preset precision requirement.

As shown in fig. 2, in an embodiment, the present embodiment provides a neural network model quantization method, including the following steps:

step 1, preparation phase. And preparing a model to be quantized, and inputting acceptable loss precision, so that the subsequent process automation is facilitated.

And 2, pre-quantizing the model. In order to calculate the speed increase and the information loss ratio before and after quantization, all layers of the network need to be pre-quantized once, but the pre-quantized neural network model does not represent the final quantized neural network model.

And 3, reasoning and testing precision. After the quantization is completed, the embodiment needs to integrally perform a forward reasoning test, then compares the precision loss before quantization with the precision loss before quantization, and if the precision loss is acceptable, directly adopts the quantization model. If the loss of accuracy is not acceptable, step 4 is entered.

And 4, calculating the information loss ratio. And if the precision loss is not acceptable, entering a mixed precision quantization mode, and calculating the information loss ratio of each network layer in the quantized neural network model, wherein the network layer with the larger information loss ratio is less suitable for quantitative deployment. Therefore, the present embodiment chooses to restore the network layer corresponding to the maximum loss ratio to the original precision (e.g., to fp 32); and then judging whether the precision of the neural network model meets the requirement, if so, directly exiting iteration, and outputting the corresponding neural network model as a final quantitative neural network model. And if the precision does not meet the requirement, returning to the step 3 for iteration, if the precision of the neural network model after one iteration cannot meet the requirement, calculating the information loss ratio again, and restoring a network layer with the largest loss ratio to the original precision until the precision during the last iteration can meet the requirement, and stopping the iteration. The mixed precision quantization mode refers to a method for carrying out different quantitative reasoning on different network layers in model reasoning, namely only quantifying part of the network layers, and reasoning other network layers with original precision.

In summary, the present invention provides a neural network model quantization method, which first obtains an original neural network model to be quantized and a preset precision loss range; then, quantizing all network layers in the original neural network model to obtain a quantized neural network model, and recording the quantized neural network model as a quantized neural network model; respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring the precision loss of the quantized neural network model relative to the original neural network model; and finally, comparing the precision loss with a preset precision loss range, and outputting a quantitative neural network model according to a comparison result, or carrying out iterative quantization on the quantitative neural network model according to the comparison result. The method can design a set of quantization standard which can automatically optimize and adjust the quantization model according to the preset precision loss range; and under the condition of no manual intervention, the optimal mixed precision quantitative model can be returned. The method can automatically carry out iterative quantization according to the set quantization parameter, and adaptively searches out the hybrid quantization model meeting the conditions. Aiming at the existing problems, the method designs a set of standards for self-adaptive model quantization based on information loss, namely, the difference value of information entropy, which is the information loss before and after quantization, is measured and can be reflected by relative entropy (KL divergence), thereby representing the difference of two probability distributions. The method also provides a concept of loss ratio, integrates the quantization loss of each layer and the acceleration time before and after quantization, analyzes the influence of single-layer quantization more comprehensively, and designs a set of scheme capable of automatically optimizing and adjusting the quantization model according to the setting precision by taking the loss ratio as a measurement standard, and the optimal mixed precision quantization model can be returned without manual intervention. The method provides a measurement method based on information loss to reflect the information loss from probability distribution, and the mainstream quantization algorithm is designed based on the probability distribution, such as uniform quantization, finger quantization and the like, so that the distance measurement before and after quantization by adopting KL divergence information loss is more reasonable. The method also provides a concept of loss ratio, combines information loss and acceleration time, comprehensively considers whether a single-layer network needs to be quantized or not, and inclines to a network structure with a particularly obvious acceleration effect. Meanwhile, the method can automatically carry out iterative computation loss ratio according to the set parameters, and adaptively search out the hybrid quantization model meeting the conditions. In addition, the method comprehensively measures the performance difference before and after quantization based on more reasonable information loss ratio, ensures that the precision loss is minimized as far as possible while the hybrid precision quantization meets the reasoning acceleration. And the method adopts a self-adaptive method, and automatically searches out a mixed quantization scheme meeting the conditions based on the information loss ratio, thereby saving the labor cost. Meanwhile, the method adopts a mode of information loss ratio, is a method of measuring the distance of probability distribution before and after quantization according to the characteristic distribution consistency by integrating the current mainstream quantization algorithm, is also the most reasonable measurement mode at present, and considers the income brought by the quantization speed. In addition, the mixed precision quantification algorithm adopted in the method is completely based on an off-line mode, and only a trained model is needed. If on-line training is adopted, corresponding pseudo quantization algorithm insertion training is required, and the training mode of the model needs to be deeply understood, so that more time and energy are consumed.

As shown in fig. 3, the present invention further provides a neural network model quantification system, which includes:

the acquisition module M10 is used for acquiring an original neural network model to be quantized and a preset precision loss range; in this embodiment, the original neural network model to be quantized includes but is not limited to: the neural network model with single-precision floating point number precision and the neural network model with half-precision floating point number precision. As an example, the original neural network model to be quantized in the present embodiment may be a fp32 type neural network model, for example.

An initial quantization module M20, configured to quantize all network layers in the original neural network model, obtain a quantized neural network model, and record the quantized neural network model as a quantized neural network model;

the precision loss module M30 is used for respectively inputting the target picture into the original neural network model and the quantized neural network model for identification, and obtaining the precision loss of the quantized neural network model relative to the original neural network model;

a comparison module M40, configured to compare the precision loss with the preset precision loss range;

the iterative quantization module M50 is configured to, when the precision loss is not within the preset precision loss range, perform iterative quantization on the quantized neural network model; if the precision loss in this embodiment is not within the preset precision loss range, performing iterative quantization on the quantized neural network model until the precision loss corresponding to the new quantized neural network model is within the preset precision loss range, terminating iterative quantization, and outputting the corresponding new quantized neural network model as a final quantized neural network model. As an example, the precision loss range preset by the embodiment may be 3% to 5%; if the recognition accuracy obtained when the target picture is input into the original neural network model is 89%, and the recognition accuracy obtained when the target picture is input into the quantized neural network model is 83%, the accuracy loss of the quantized neural network model relative to the original neural network model is 6%, and at this time, the corresponding accuracy loss is not within the preset accuracy loss range.

And the output module M60 is used for outputting the corresponding quantized neural network model when the precision loss is within the preset precision loss range. And if the precision loss in the embodiment is within the preset precision loss range, outputting the quantized neural network model. As an example, the precision loss range preset by the embodiment may be 3% to 5%; if the recognition accuracy obtained when the target picture is input into the original neural network model is 91%, and the recognition accuracy obtained when the target picture is input into the quantized neural network model is 87%, the accuracy loss of the quantized neural network model relative to the original neural network model is 4%, and the corresponding accuracy loss is within the preset accuracy loss range.

In accordance with the above, in an exemplary embodiment, the process of quantifying all network layers in the original neural network model comprises: acquiring the precision of each network layer in the original neural network model, and reducing the precision of each network layer; or obtaining the bit of the precision of each network layer in the original neural network model, and reducing the bit of the precision of each network layer. The method for reducing the precision of each network layer includes, but is not limited to: reducing single precision floating point number precision to half precision floating point number precision, reducing single precision floating point number precision to eight bit shaping precision, and reducing half precision floating point number precision to eight bit shaping precision. As an example, when quantizing all network layers in the original neural network model, the present embodiment may first perform pre-quantization on all network layers in the original neural network model, where the pre-quantization includes, but is not limited to, representing previous single-precision floating point numbers by half-precision floating point numbers and 8-bit integer type or even lower bits; that is, floating point numbers previously represented by fp32 are represented by fp16, int8, or even lower bits. In the embodiment, all network layers in the original neural network model are pre-quantized, so that the speed improvement before and after the quantization of the neural network model is conveniently calculated, and the information loss ratio of each network layer before and after the quantization of the neural network model is also conveniently calculated.

In an embodiment, the present embodiment provides a neural network model quantization system, configured to perform the following steps:

And 3, reasoning and testing precision. After the quantization is completed, the embodiment needs to integrally perform a forward reasoning test, then compares the precision loss before quantization with the precision loss before quantization, and if the precision loss is acceptable, directly adopts the quantization model. If the loss of precision is not acceptable, step 4 is entered.

And 4, calculating the information loss ratio. And if the precision loss is not acceptable, entering a mixed precision quantization mode, and calculating the information loss ratio of each network layer in the quantized neural network model, wherein the network layer with the larger information loss ratio is less suitable for quantitative deployment. Therefore, the embodiment chooses to restore the network layer corresponding to the maximum loss ratio to the original precision (e.g. to fp 32); and then judging whether the precision of the neural network model meets the requirement, if so, directly exiting iteration, and outputting the corresponding neural network model as a final quantitative neural network model. And if the precision does not meet the requirement, returning to the step 3 for iteration, if the precision of the neural network model after one iteration cannot meet the requirement, calculating the information loss ratio again, and restoring a network layer with the largest loss ratio to the original precision until the precision during the last iteration can meet the requirement, and stopping the iteration. The mixed precision quantization mode refers to a method for carrying out different quantitative reasoning on different network layers in model reasoning, namely only part of the network layers are quantized, and other network layers are subjected to reasoning with original precision.

In another embodiment, the present embodiment further provides a neural network model quantification system, including:

loss ratio module: the method is used for measuring the information loss effect of a single network layer, and the lower the information loss ratio, the better the information loss ratio; the information calculation formula is that information loss before and after quantization is divided by acceleration time before and after quantization, wherein the information loss is relative entropy, namely KL divergence, and the distance of probability distribution before and after quantization is measured; the physical meaning of the information loss ratio is the degree of loss of information per unit time.

An adaptive quantization module: the self-adaptive quantization module can automatically calculate the information loss ratio of each network layer according to the set lowest loss precision and the quantization model, gradually set the layer with the larger loss ratio as an fp32 precision mode, perform forward reasoning test precision once after the setting is completed, and then automatically complete the mixed precision quantization according to whether the corresponding precision meets the precision requirement.

In summary, the present invention provides a neural network model quantization system, which first obtains an original neural network model to be quantized and a preset precision loss range; then, quantizing all network layers in the original neural network model to obtain a quantized neural network model, and recording the quantized neural network model as a quantized neural network model; respectively inputting the target picture into the original neural network model and the quantized neural network model for recognition, and acquiring the precision loss of the quantized neural network model relative to the original neural network model; and finally, comparing the precision loss with a preset precision loss range, and outputting a quantitative neural network model according to a comparison result, or carrying out iterative quantization on the quantitative neural network model according to the comparison result. The system can design a set of quantization standards capable of automatically optimizing and adjusting the quantization model according to the preset precision loss range; and under the condition of no manual intervention, the optimal mixed precision quantitative model can be returned. The system can automatically carry out iterative quantization according to the set quantization parameters, and adaptively searches out the hybrid quantization model meeting the conditions. Aiming at the problems existing at present, the system designs a set of standard of self-adaptive model quantization based on information loss, namely, the difference value of information entropy which is the information loss before and after quantization is measured and can be reflected by relative entropy (KL divergence), thereby representing the difference of two probability distributions. The system also provides a concept of loss ratio, integrates the quantization loss of each layer and the acceleration time before and after quantization, analyzes the influence of single-layer quantization more comprehensively, and designs a set of scheme capable of automatically optimizing and adjusting the quantization model according to the setting precision by taking the analysis as a measurement standard, and the optimal mixed precision quantization model can be returned without manual intervention. The system provides a measurement method based on information loss to reflect the information loss from probability distribution, and the mainstream quantization algorithm is designed based on the probability distribution at present, such as uniform quantization, exponential quantization and the like, so that the distance measurement before and after quantization of the system adopts KL divergence information loss more reasonably. The system also provides a concept of loss ratio, combines information loss and acceleration time, comprehensively considers whether a single-layer network needs to be quantized or not, and inclines to a network structure with a particularly obvious acceleration effect. Meanwhile, the system can automatically carry out iterative computation loss ratio according to the set parameters, and adaptively search out the hybrid quantization model meeting the conditions. In addition, the system comprehensively measures the performance difference before and after quantization based on more reasonable information loss ratio, ensures that the precision loss is reduced to the minimum as far as possible while the hybrid precision quantization meets the reasoning acceleration. And the system adopts a self-adaptive method, automatically searches out a mixed quantization scheme meeting the conditions based on the information loss ratio, and saves the labor cost. Meanwhile, the system adopts a mode of information loss ratio, is a mode of integrating the current mainstream quantization algorithm, measures the distance of probability distribution before and after quantization according to the characteristic distribution consistency, is also the most reasonable measurement mode at present, and considers the income brought by the quantization speed. In addition, the mixed precision quantification algorithm adopted in the system is completely based on an off-line mode, and only a trained model is needed. If on-line training is adopted, corresponding pseudo quantization algorithm insertion training is required, and the training mode of the model needs to be deeply understood, so that more time and energy are consumed.

An embodiment of the present application further provides a computer device, where the computer device may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the data processing method in fig. 1 according to the present embodiment.

Fig. 4 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output device 1102 may include a display, a sound, etc. output device.

In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 5 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. Fig. 5 is a specific embodiment of the implementation process of fig. 4. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

A power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, audio component 1206 also includes a speaker for outputting voice signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the audio component 1206, the input/output interface 1207 and the sensor component 1208 in the embodiment of fig. 5 may be implemented as the input device in the embodiment of fig. 4.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

It should be understood that although the terms first, second, third, etc. may be used to describe the preset accuracy loss ranges, etc. in the embodiments of the present invention, these preset accuracy loss ranges should not be limited to these terms. These terms are only used to distinguish preset loss of precision ranges from each other. For example, the first preset precision loss range may also be referred to as a second preset precision loss range, and similarly, the second preset precision loss range may also be referred to as a first preset precision loss range, without departing from the scope of embodiments of the present invention.

Claims

1. A method for quantifying a neural network model, the method comprising the steps of:

quantizing all network layers in the original neural network model to obtain a quantized neural network model, and recording the quantized neural network model as a quantized neural network model;

and comparing the precision loss with the preset precision loss range, and outputting the quantitative neural network model according to a comparison result, or carrying out iterative quantization on the quantitative neural network model according to the comparison result.

2. The neural network model quantization method of claim 1, wherein the process of quantizing all network layers in the original neural network model comprises:

3. The neural network model quantification method of claim 2, wherein the manner of reducing the accuracy of each network layer comprises at least one of: reducing single precision floating point number precision to half precision floating point number precision, reducing single precision floating point number precision to eight bit shaping precision, and reducing half precision floating point number precision to eight bit shaping precision.

4. The neural network model quantization method of claim 1, wherein if the precision loss is within the preset precision loss range, outputting the quantized neural network model;

and if the precision loss is not within the preset precision loss range, carrying out iterative quantization on the quantized neural network model until the precision loss corresponding to the new quantized neural network model is within the preset precision loss range, terminating the iterative quantization, and outputting the corresponding new quantized neural network model as a final quantized neural network model.

5. The neural network model quantization method of any one of claims 1 to 4, wherein the process of iteratively quantizing the quantized neural network model comprises:

6. The neural network model quantization method of claim 5, wherein the process of iteratively quantizing the quantized neural network model according to the information loss ratio of each network layer comprises:

comparing the intermediate precision loss with the preset precision loss range;

7. The neural network model quantization method of claim 5, wherein the process of inputting a target picture into the original neural network model and the quantized neural network model for recognition, and obtaining information loss of each network layer in the quantized neural network model relative to a corresponding network layer in the original neural network model comprises:

8. The neural network model quantization method of claim 1, wherein the original neural network model to be quantized comprises at least one of: the neural network model with single-precision floating point number precision and the neural network model with half-precision floating point number precision.

9. A neural network model quantification system, comprising:

the iterative quantization module is used for performing iterative quantization on the quantized neural network model when the precision loss is not in the preset precision loss range;

10. A computer device, comprising:

one or more processors; and

a computer-readable medium having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-8.

11. A computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause an apparatus to perform the method of any one of claims 1-8.