CN112580805A

CN112580805A - Method and device for quantizing neural network model

Info

Publication number: CN112580805A
Application number: CN202011564315.0A
Authority: CN
Inventors: 张真; 庞嘉丽; 孙刚; 陈琳
Original assignee: Samsung China Semiconductor Co Ltd; Samsung Electronics Co Ltd
Current assignee: Samsung China Semiconductor Co Ltd; Samsung Electronics Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-30
Also published as: KR20220092776A

Abstract

A method and apparatus for quantizing a neural network model are disclosed. The quantization method comprises the following steps: acquiring a neural network model; calculating a quantization parameter corresponding to an operator to be quantized of the neural network model based on a binary approximation method; and quantizing the operator to be quantized of the neural network model based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

Description

Method and device for quantizing neural network model

Technical Field

The present invention relates to the field of artificial intelligence, and more particularly, to a method and an apparatus for quantizing a neural network model.

Background

With the wide application of neural network models, the complexity of the original neural network model is higher and higher, and the original neural network model cannot be operated on a plurality of devices with limited memory capacity, so that the research on the deep learning quantization method has become a current research hotspot. For a high-precision neural network model, the proportion of parameters occupied by an operator in the model in the total quantity of parameters of the original model is higher, and the use frequency of the operator is higher. Therefore, all the high-frequency operators are quantized into integer types from the original floating-point number types and then are subjected to subsequent operation, the memory occupancy rate and the operation rate of the original neural network model are greatly improved on the premise of less precision loss, and the size of the original model is compressed.

The research goals of neural network model quantification are: a method is found, the memory space occupied by the original deep learning model can be greatly compressed under the condition that the loss of the prediction accuracy of the original model is minimum, and the operation rate of the original model is remarkably improved. The input of the method in the field is an original high-precision floating-point type depth model, and the output is a quantized model quantized into a low-precision integer. The method has very important application prospect in practical application: by using the quantized neural network model, corresponding prediction tasks can be efficiently completed on a plurality of small-storage terminals through the original neural network model.

However, the existing quantization method often causes a problem that the precision and the memory of the quantized neural network model cannot be obtained at the same time.

Disclosure of Invention

The invention aims to provide a method and a device for quantizing a neural network model.

According to an aspect of the present disclosure, there is provided a quantization method of a neural network model, the quantization method including: acquiring a neural network model; calculating a quantization parameter corresponding to an operator to be quantized of the neural network model based on a binary approximation method; and quantizing the operator to be quantized of the neural network model based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

Optionally, the step of calculating a quantization parameter corresponding to an operator to be quantized of the neural network model includes: verifying the neural network model by using a verification data set to obtain input data of each operator to be quantized; and calculating a quantization parameter corresponding to the minimum mean square error of the input data of each operator to be quantized before and after quantization according to the input data of each operator to be quantized by using a binary approximation method.

Optionally, the step of calculating the quantization parameter corresponding to the minimum mean squared error comprises: reducing the dimension of input data of each operator to be quantized; dividing the input data of each dimensionality reduced operator to be quantized into a plurality of data distribution intervals based on the statistical characteristics of the input data of each dimensionality reduced operator to be quantized, and acquiring an interval supremum value array, wherein the interval supremum value array records the supremum value of each data distribution interval; and searching the quantization parameter corresponding to the minimum mean square error by approximating towards the middle binary between the starting point and the end point of each data distribution interval by using a binary approximation method.

Optionally, the quantization parameter comprises: at least one of a truncation parameter, a quantization factor parameter, and a truncation factor parameter of the data distribution interval.

Optionally, the step of searching for the quantization parameter comprises: for each data distribution interval, initializing the minimum mean square error to an initial mean square error of the data distribution interval when each acquisition interval upper bound value array, wherein the initial mean square error corresponds to a quantization parameter corresponding to a midpoint between a start point and an end point of the data distribution interval; calculating an approximation point mean square error of the data distribution interval by approximating towards the middle binary between the start point and the end point of the data distribution interval, wherein the approximation point mean square error corresponds to a quantization parameter corresponding to the approximation point of the data distribution interval; when the average square error of the approximation point is smaller than the minimum average square error, updating the minimum average square error by using the average square error of the approximation point; and outputting the quantization parameter corresponding to the minimum mean square error as the quantization parameter when traversing the data distribution interval.

Optionally, the operator to be quantized of the neural network model includes a quantifiable operator of the neural network model, wherein the operator is a quantifiable operator in response to a ratio of a parameter included in the operator of the neural network model to a total parameter in the neural network exceeding a threshold value or belonging to a computation-intensive operator.

Optionally, the quantization method further comprises: before computing a quantization parameter corresponding to a quantifiable operator, marking the quantifiable operator by inserting a quantization marker operator before the quantifiable operator, wherein marking the quantifiable operator comprises: determining whether input data of a quantifiable operator has weight data; when the input data of the quantifiable operator does not have weight data, inserting a quantification mark operator before the quantifiable operator; when input data of the quantifiable operator has weight data, a quantization flag operator is inserted before the quantifiable operator, and a quantization flag operator is inserted before the weight data to flag whether quantization is required for the weight data.

Optionally, the neural network model is a deep learning neural network model trained to perform one of image recognition, natural language processing, recommendation system processing.

According to an aspect of the present disclosure, there is provided an apparatus for quantizing a neural network model, the apparatus including: a data acquisition module configured to acquire a neural network model; the quantization parameter calculation module is configured to calculate a quantization parameter corresponding to an operator to be quantized of the neural network model based on a binary approximation method; and the quantization implementation module is configured to quantize an operator to be quantized of the neural network model based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

Optionally, the quantization parameter calculation module is configured to: verifying the neural network model by using a verification data set to obtain input data of each operator to be quantized; and calculating a quantization parameter corresponding to the minimum mean square error of the input data of each operator to be quantized before and after quantization from the input data of each operator to be quantized by using a binary approximation method.

Optionally, the quantization parameter calculation module is configured to: reducing the dimension of input data of each operator to be quantized; dividing the input data of each dimensionality reduced operator to be quantized into a plurality of data distribution intervals based on the statistical characteristics of the input data of each dimensionality reduced operator to be quantized, and acquiring an interval supremum value array, wherein the interval supremum value array records the supremum value of each data distribution interval; and searching the quantization parameter corresponding to the minimum mean square error by approximating towards the middle binary between the starting point and the end point of each data distribution interval by using a binary approximation method.

Optionally, the quantization parameter calculation module is configured to: for each data distribution interval, initializing the minimum mean square error to an initial mean square error of the data distribution interval when each acquisition interval upper bound value array, wherein the initial mean square error corresponds to a quantization parameter corresponding to a midpoint between a start point and an end point of the data distribution interval; calculating an approximation point mean square error of the data distribution interval by approximating towards the middle binary between the start point and the end point of the data distribution interval, wherein the approximation point mean square error corresponds to a quantization parameter corresponding to the approximation point of the data distribution interval; when the average square error of the approximation point is smaller than the minimum average square error, updating the minimum average square error by using the average square error of the approximation point; and outputting the quantization parameter corresponding to the minimum mean square error as the quantization parameter when traversing the data distribution interval.

Optionally, the operator to be quantized of the neural network model includes a quantifiable operator of the neural network model, wherein the operator is the quantifiable operator when a ratio of a parameter included in the operator of the neural network model to a total parameter in the neural network exceeds a threshold value or belongs to a computation-intensive operator.

Optionally, the apparatus further comprises: a quantization marking module configured to: prior to computing a quantization parameter corresponding to a quantifiable operator, marking the quantifiable operator by inserting a quantization marking operator before the quantifiable operator, wherein the quantization marking module is configured to: determining whether input data of a quantifiable operator has weight data; when the input data of the quantifiable operator does not have weight data, inserting a quantification mark operator before the quantifiable operator; when input data of the quantifiable operator has weight data, a quantization flag operator is inserted before the quantifiable operator, and a quantization flag operator is inserted before the weight data to flag whether quantization is required for the weight data.

Alternatively, the neural network model may be a deep learning neural network model trained to perform one of image recognition, natural language processing, recommendation system processing.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by one or more computing devices, causes the one or more computing devices to implement any of the quantification methods described above.

According to an aspect of the present disclosure, there is provided a multi-turn human-machine dialog system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which, when executed by the one or more computing devices, causes the one or more computing devices to implement any of the quantification methods described above.

The quantization method of the invention can effectively reduce the search space of the quantization parameter, find the optimal quantization parameter and obtain the quantized neural network model by calculating the quantization parameter corresponding to the operator to be quantized based on the binary approximation method. The quantized neural network model may occupy relatively less memory space than the pre-quantized neural network model, thereby improving memory utilization efficiency and saving the central processor, image processor and/or neural processor, etc., which may be relatively small overhead, without significant performance degradation.

In addition, the quantization method of the invention can record the subsequent weight quantization parameter information of the operator by adding a marking operator before the weighted operator to be quantized, so that the quantization of the operator with the weight can be completed without knowing the prior distribution of weight data in advance, thereby reducing the difficulty of information acquisition of the original model quantization method.

In addition, in the invention, the quantization error of each data in the distribution subinterval of the current quantized data (namely, the mean square error of the quantized data and the data before quantization) is calculated to replace the solving process based on the cross entropy theory in the existing method, so that the calculated quantization error can reflect the data distribution characteristics more than the cross entropy theory when the data distribution is calculated to be asymmetric, and the fitting degree with the model is higher, thereby overcoming the following problems: in the solving process based on the cross entropy theory in the existing method, the quantization parameter of the method cannot contain most positive data, so that a large amount of effective positive data is truncated.

In addition, the invention provides a method for individually calculating the quantization factor by combining the data distribution subinterval and the minimum mean square error, and the problem of uneven data distribution quantization can be effectively solved.

In addition, the quantization method provided by the invention combines the statistical distribution characteristics of different original input data, can calculate the optimal quantization factor parameter set suitable for the distribution of respective input data in a personalized manner, and selects the optimal quantization factor parameter suitable for the whole data through the minimum mean square error theory. Therefore, the optimal quantization factor parameter can be calculated by the method and the model quantization can be carried out no matter the input data are symmetrically distributed or asymmetrically distributed.

In addition, the invention provides a method for reducing the search space of the original quantization parameter to be traversed by adopting a binary approximation method according to the relaxation monotone change rule of the mean square error, thereby greatly improving the time for solving the quantization parameter.

In addition, the invention provides that the average square error theory and the data distribution characteristic are combined to calculate the average square error of the supremacy value of each interval according to the partitioned intervals to determine the truncation parameter, so that a large amount of positive half-axis data can be contained, the minimum error of the input data of the layer after quantization is ensured, and finally, the great reduction of the precision of the whole quantization model cannot be caused.

In addition, the quantization method can greatly improve the operation rate of the original model under the condition of hardly losing the prediction accuracy of the model, and greatly compress the size of the storage space occupied by the original model; the original model quantized by the method can run compatibly at the back ends of a CPU, a GPU and the like. The method enhances the applicability of the original depth model for many small devices with limited storage space, and has wide application prospects on more hardware terminals.

Drawings

The above and other objects and features of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate, by way of example, an example in which:

FIG. 1 shows a profile of a common asymmetric or non-uniform input data;

FIG. 2 shows a flow diagram of a method of quantification of a neural network model, according to an example embodiment of the present disclosure;

FIG. 3 shows an example of MSE relaxing monotonic trend;

FIG. 4 shows a schematic diagram of a binary approximation according to an embodiment of the invention;

FIG. 5 illustrates an apparatus for quantifying a neural network model in accordance with an exemplary embodiment of the present invention;

FIG. 6 illustrates a block diagram of a method of quantifying a neural network model in accordance with an exemplary embodiment of the present invention;

fig. 7 illustrates a schematic diagram of a quantization factor parameter calculated by a quantization method of a neural network model according to an exemplary embodiment of the present invention and a quantization factor parameter calculated by an existing quantization method;

fig. 8 illustrates a detailed flowchart of a quantization method of a neural network model according to an exemplary embodiment of the present invention.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

The existing quantization methods for the depth model mainly include a deterministic quantization method and a random quantization method. The existing quantization method based on the deterministic truncation function mainly uses the truncation function to convert a continuous value (namely, a high-precision floating point number) into a discrete value (namely, a low-precision integer) and complete model quantization. The existing method for setting the truncation function is to set a truncation function in advance according to the global data distribution, or use the maximum value as the truncation value, and in more cases, use a truncation parameter determined based on the cross entropy theory to carry out quantization. However, in the conventional method, the accuracy is greatly reduced by using the maximum value as the truncated value for quantization, and a post-quantization model must be trained in order to improve the accuracy. The truncation parameter determination based on the cross entropy theory has better precision only under the condition that the distribution is symmetrical and uniform, and the precision is obviously reduced after quantization under the condition that some data are not uniformly distributed and are asymmetrical, as shown in fig. 1. Fig. 1 shows a typical asymmetric or non-uniform input data profile. It can be seen from fig. 1 that a large number of negative numbers are distributed in a concentrated manner around 0, and positive numbers are distributed in a distributed interval from 0 to 4000 in the positive half axis.

The existing vectorization-based method mainly clusters the original high-precision operators into sub-groups, and then quantizes according to the sub-groups. The existing vectorization method mainly adopts a K-Means-based clustering method, and although the method has strong operability, the method can only finish quantification on a model which is trained in advance and defines a truncation function, so that the method has poor universality.

The existing random quantization methods are mainly classified into a quantization method based on random truncation and a quantization method based on probability distribution. The quantization method based on random truncation mainly provides a method for injecting noise in a training process, which can act as a regularizer and enable condition calculation, but the method needs the distribution characteristics of known noise data, so the method has poor practicability. While the quantization method based on probability distribution needs to be based on the assumption that the weight data is discretely distributed, and needs to know the prior knowledge of the weight data distribution, but the acquisition of the prior distribution knowledge of the weight is often difficult to realize in practical application, so the method has very limited universality.

The existing quantification method always causes the prediction performance of the original model to be obviously reduced due to the following problems:

1. the performance of the original depth model may be drastically degraded after each truncation operation in the existing maximum-value-based truncation quantization method. The quantized model needs to be trained again to improve the accuracy. Meanwhile, when the quantized model is calculated by adopting the truncated discrete value, the parameter space is smaller, so that the quantized model is difficult to converge in the training process. Furthermore, the truncation operation cannot utilize the structural information of the weights in the network. And the quantized model still needs the original precision true value for auxiliary training, which is time-consuming.

2. In the quantization method for determining the truncation parameters based on the cross entropy theory, the optimal quantization parameters cannot be calculated under the condition that the distribution characteristics of the original input data are rare or are in asymmetric distribution (as shown in fig. 1), so that the prediction performance of the original model is influenced when non-optimal quantization parameters are used for calculation.

3. The vector quantization method based on the K-means clustering has a large calculation amount. Compared with the truncated quantization method, the vector quantization method has difficulty in obtaining an integer weight. Vector quantization is usually used to quantize a pre-trained model, so if the task is to train a quantized depth model from scratch, a pre-set truncation function is required, and it is difficult to set the truncation function in advance in practical applications.

4. The random truncation quantization method adopts a truncation quantization method of random rounding, which needs to perform parameter estimation on a plurality of intermediate parameters, and the estimation often has large deviation. This deviation may lead to oscillations in the loss function during the training process, which in turn drastically affects the predictive performance of the model.

5. The probability-based quantization method needs to predefine a proper weight prior distribution, but it is often difficult to find a proper weight distribution in advance for the model. Meanwhile, many quantization methods need to traverse a relatively large solution space due to limited prior knowledge when solving the optimal parameters, and the calculation process is time-consuming.

To address one or more of the above-mentioned problems with the prior art, and to take into account accuracy and memory, the present disclosure provides a method for quantifying a neural network model using a binary approximation. Compared with the neural network model before quantization, the neural network model after quantization can occupy relatively less storage space, thereby improving the utilization efficiency of the memory. Furthermore, when the quantized neural network model is executed by a central processor, an image processor, and/or a neural processor, etc. in an electronic device (e.g., a mobile device) to perform a task such as recognition, the central processor, the image processor, and/or the neural processor, etc. can perform corresponding calculations based on the quantized neural network model with relatively little overhead, without affecting the accuracy of the task such as recognition. That is, in the present invention, the quantized neural network model obtained by the quantization method of the neural network model using the binary approximation may improve the hardware performance of the electronic device (e.g., improve the utilization rate of the memory and/or reduce the overhead of the central processing unit, the image processor, and/or the neural processor, etc.) compared to the neural network model before quantization.

Fig. 2 illustrates a flowchart of a quantization method of a neural network model according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S210, a neural network model is acquired.

Here, the neural network model may be a neural network model trained in advance. For example, the neural network model may be a deep learning model with raw precision floating point numbers. In one example, the neural network model may be obtained from a database, which may be a database of a server (e.g., a cloud server) or may be a database of a mobile device with limited memory. However, the present invention is not limited to obtaining the neural network model from a database, and obtaining the neural network model from any other hardware device is also feasible. In one embodiment, the neural network model may be a deep learning neural network model trained to perform one of image recognition, natural language processing, recommendation system processing.

In step S220, a quantization parameter corresponding to an operator to be quantized is calculated based on a binary approximation. In the present invention, preferably, the binary approximation is a bidirectional binary approximation.

In the invention, the quantization parameter corresponding to the operator to be quantized is calculated based on the binary approximation method, so that the search space of the quantization parameter can be effectively reduced, and the quantization parameter can be found at the same time.

For example, the operator to be quantized of the neural network model may comprise a quantifiable operator of the neural network model. In one example, an operator of the neural network model is a quantifiable operator when the proportional size of the parameters contained in the operator exceeds a threshold value or belongs to a computationally intensive operator. The compute intensive operators may be operators that include a large number of matrix multiplication operations (e.g., convolution-like operators, full-link operators, etc., by way of example only).

According to an example embodiment of the present invention, the quantifiable operator may be marked by inserting a quantization marking operator before the quantifiable operator is calculated. In other words, the operator needs to be quantized is marked by inserting a quantization marking operator. For example, it may be determined whether the input data of the quantifiable operator has weight data. When the input data of the quantifiable operator does not have weight data, a quantization flag operator may be inserted before the quantifiable operator. When the input data of the quantifiable operator has weight data, a quantization flag operator may be inserted before the quantifiable operator and a quantization flag operator may be inserted before the weight data to flag whether quantization is required for the weight data.

In the invention, a marking operator is added in front of the operator to be quantized with weight, so that the weight quantization parameter information of the operator is recorded, and the operator with weight can be quantized without knowing the prior distribution of weight data in advance. In other words, the optimal quantization parameter set suitable for the model is calculated in a personalized manner according to the original input data without the need of a pre-defined truncation function and weight prior distribution information, so that the information acquisition difficulty of the original model quantization method is reduced.

That is, if the data is a quantifiable operator, determining whether the input data carries weight data, and if the input data carries the weight data, inserting a quantization marking operator (2 in total) for marking whether the weight data needs to be quantized; if the weight data is not carried, the quantization marking is finished, and the next operator is processed. Thus far, for D_FPQ in (1)_NThe marking operation is completely finished, wherein the deep learning model with the original precision of floating point number can be used as D_FP＝(Q_N，O_M) Is represented by, wherein Q_NRepresenting an operator set needing quantization in the model, wherein N is the number of operators needing quantization; o is_MThe model represents the operator set which does not need quantization, and M is the number of operators which do not need quantization.

In one embodiment of the invention, the neural network model may be checked using a set of check data to obtain input data for each operator to be quantized. Then, a quantization parameter corresponding to the minimum mean square error of the input data of each operator to be quantized before and after quantization is calculated from the input data of each operator to be quantized using a binary approximation method. In the invention, the quantization error (namely, the mean square error of the quantized data and the data before quantization) of each data in the distribution subinterval of the current quantized data is calculated to replace the solving process based on the cross entropy theory in the existing method, so that the calculated quantization error can reflect the data distribution characteristics more than the cross entropy theory when the data distribution is calculated to be asymmetric, and the fitting degree with the model is higher. Because, in the solving process based on the cross entropy theory in the existing method, the quantization parameter of the method cannot contain most positive data, so that a large amount of effective positive data is truncated.

Here, in order to better process the input data of the operators to be quantized using the binary approximation method, the input data of each operator to be quantized may be subjected to dimensionality reduction. For example, the check data set X_RAfter the neural network model is injected, the input data of each operator to be quantized is extracted and subjected to dimensionality reduction operation, and the original input data is converted into a one-dimensional array X from a multidimensional matrix_iAnd i belongs to [0, N), wherein N is the number of operators to be quantized, and is convenient for dividing intervals and determining the supremum value of each interval. Wherein X_RIs a collection of small amounts of data arbitrarily extracted from a test set of predictive models.

In addition, based on the statistical characteristics of the input data of each operator to be quantized after the dimension reduction, the input data of each operator to be quantized after the dimension reduction can be divided into a plurality of data distribution intervals, and an interval supremum value array is obtained, wherein the interval supremum value array records the supremum value of each data distribution interval.

Specifically, the interval of the input data of each operator to be quantized is divided. The purpose is to record the value of the supremum for each interval. The process flow can be seen in algorithm 1 below:

referring to algorithm 1 above, first extract each operator input data set X to be quantized_iStatistical characteristic distribution information P (X) of_i) Consideration of the minimum value

And maximum value

Determining a quantization interval threshold thres according to the absolute value of the extreme value:

then, dividing the number of partitions n according to thres and known data distribution_binsCalculating the length inc of each interval:

wherein n is_bins8001 is a generalizable set value, and indicates the number of data distribution intervals in which the original input data of each operator to be quantized is quantized from the original floating point precision to the integer precision.

Finally, using the interval supremum value array

And recording the supremum value of each data distribution interval. The supremum value of the jth sub-interval is T_j:

T_j＝-thres+(inc×j)，j∈[0，n_bins]。

After the data distribution intervals are divided, a binary approximation method may be used to search for a quantization parameter corresponding to the minimum mean square error by binary approximation toward the middle between the start point and the end point of each data distribution interval. Generally, the existing quantization method cannot select the optimal quantization parameter suitable for the whole data under the condition that the original input data of an operator to be quantized is asymmetric or uneven, so that the invention provides a method for individually calculating the quantization factor by combining a data distribution subinterval and the minimum mean square error, and the problem of uneven data distribution quantization can be effectively solved.

The quantization method of the invention combines the statistical distribution characteristics of different original input data, can calculate the optimal quantization factor parameter set suitable for the respective input data distribution individually, and obtains the optimal quantization factor parameter suitable for the whole data by selecting through the minimum mean square error theory. Therefore, the optimal quantization factor parameter can be calculated by the quantization method of the invention and model quantization can be carried out no matter whether the input data are distributed symmetrically or asymmetrically.

In the present invention, the quantization parameter may include at least one of a truncation parameter, a quantization factor parameter, and a truncation factor parameter of the data distribution interval.

How to use the binary approximation is described in detail below by taking the truncation parameter α as an example.

In the invention, the subscript of the segment interval corresponding to the optimal truncation parameter corresponding to the minimum mean square error is quickly searched by using a binary approximation method. Since the trend of the mean squared error is relaxed monotonically decreasing, as shown in fig. 3, which shows an example of the MSE relaxed monotonic trend, its overall trend is a decreasing curve, but when the truncation parameter α is very large, its curve will have a slightly rising interval. Therefore, the invention utilizes a binary approximation method to continuously approximate the minimum MSE to the middle binary from two end points to finally obtain the optimal truncation parameter alpha.

FIG. 4 shows a schematic diagram of a binary approximation according to an embodiment of the invention.

In the invention, aiming at the problems of large redundant calculation amount, long quantization time and the like when the existing method traverses the search space of the truncation factor, the invention provides a method for reducing the search space of the original quantization parameter to be traversed by adopting a binary approximation method according to the relaxation monotone change rule of the mean square error, thereby greatly improving the time for solving the quantization parameter.

Calculation of the optimal truncation parameter alpha by binary approximation

And searching the space to find out the optimal truncation parameter alpha corresponding to the minimum mean square error before and after quantization, and preparing for a quantization implementation module. In one embodiment of the present invention, in order to calculate the minimum mean square error, the minimum mean square error may be initialized to an initial mean square error of the data distribution section for each data distribution section, each upon acquiring the group of on-section certainty values, wherein the initial mean square error corresponds to a quantization parameter corresponding to a midpoint between a start point and an end point of the data distribution section. Then, an approximation point mean square error of the data distribution section is calculated by approximating to the middle binary between the start point and the end point of the data distribution section, wherein the approximation point mean square error corresponds to the quantization parameter corresponding to the approximation point of the data distribution section. When the approximate point mean square error is less than the minimum mean square error, the minimum mean square error is updated using the approximate point mean square error. The specific process flow can be seen in algorithm 2 below.

Referring to algorithm 2, the step of calculating the optimal truncation parameter using a binary approximation may include:

1. the traversal start and end point indices are initialized. Firstly, calculating a traversal starting point p and an end point q of a starting interval:

q＝n_bins

where bit represents the number of quantized integer bits, typically an integer power of 2 (e.g., 8, 16, etc.).

The middle position index of the start index and the end index is calculated.

m＝(p+q)/2

2. Since the trend of the mean square error is relaxed and monotonically decreasing, the subscript of the optimal truncation variable must be somewhere in the middle (as shown in fig. 3). P is continuously approached to the middle subscript m in half steps until the calculated MSF_P≤MSE_mThe subscript p rolls back one step and jumps out of the loop.

Continuously approaching q to the middle subscript m by a half step size until MSE_q≤MSE_mThe subscript q rolls back one step and jumps out of the loop.

Wherein, p is (p + m)/2, and p is a backward step: p ═ p × 2) -m

And similarly, calculating a forward step length of q, wherein q is (q + m)/2, and a backward step length of q: q ═ q × 2) -m

3. Traversing the partitioned set of values of supremum between p and q

And calculating the quantization factor parameters alpha, scale and clip of the current subinterval according to the following formula_min，clip_maxAnd the minimum mean square error MSE. The specific process flow is shown in algorithm 3 below (mean square error MSE method).

α＝T_j，j∈[p，q]

clip_min＝-(2^bit-1)，clip_max＝2^bit-1

X′＝Q×scale

Where bit is the integer number of bits of the quantized model precision. Clip (.) represents a truncation function, and a specific calculation formula is as follows:

when the traversal is finished, the truncation parameter subscript corresponding to the minimum average square difference is solved, and the optimal truncation parameter alpha is returned to T_jIn which MSE_jIs the minimum value.

In the mean square error calculation method (algorithm 3), the search space is traversed and the mean square error value of the input data inverse quantized using the current truncation parameter and the original value is calculated, and whether this truncation parameter is optimal is determined according to whether the value is globally minimum, thereby determining whether the current quantization factor parameter is optimal.

In step S230, the operator to be quantized is quantized based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

In one embodiment of the present invention, the optimal truncation parameter α, the optimal quantization factor scale, and the corresponding clip may be used_minAnd clip_maxAnd realizing the specific quantization operation and inverse quantization operation of the operator to be quantized.

Specifically, first, an optimal quantization factor and a truncation parameter set are calculated

clip_min＝-(2^bit-1)，clip_max＝2^bit-1,

And will beOptimal scale and clip_min，clip_maxAnd assigning values to quantization marking operators before each operator to be quantized.

Then, all quantization marking operators in the quantization model are converted into the operation mode (multiplication, rounding, truncation and conversion type) of the bottom layer through bottom layer hardware (such as GPU, NPU and the like), and the integral quantization of each operator to be quantized is realized by using the optimal quantization factor parameter scale. And implementing inverse quantization operation after each operator operation to be quantized.

Finally, outputting the quantized original depth model D with the precision of bit integer_INT。

Fig. 5 illustrates an apparatus for quantizing a neural network model according to an exemplary embodiment of the present invention.

Referring to fig. 5, the apparatus 500 for quantizing a neural network model may include a data acquisition module 510, a quantization parameter calculation module 520, and a quantization implementation module 540. Here, the data acquisition module 510 is configured to acquire a neural network model. The quantization parameter calculation module 520 is configured to calculate a quantization parameter corresponding to a quantifiable operator based on a binary approximation. The quantization realization module 530 is configured to quantize the operator to be quantized based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

In other words, the data acquisition module 510 performs an operation of acquiring data, the quantization parameter calculation module 520 performs an operation of calculating a quantization parameter, and the quantization implementation module 540 performs a quantization operation. Since the operation of calculating the quantization parameter by the acquire data operation, the quantization operation, and the database update operation have been described in detail above with reference to fig. 2, the same description will not be repeated here. Further, optionally, the apparatus 500 for quantizing a neural network model may comprise a quantization marking module (not shown) configured to mark an operator to be quantized.

Fig. 6 is a structural diagram illustrating a quantization method of a neural network model according to an exemplary embodiment of the present invention.

Referring to fig. 6, the input may correspond to an acquired neural network model (i.e., a raw model). In the quantization marking module, a quantifiable original model is generated by inserting a quantization operator into the original model. Specifically, the quantization marking module identifies and marks each quantifiable operator in the original input model for computational use in quantization implementation. The module finally outputs an original floating point number precision model with a quantization marking operator. Traversing each operator of the neural network model, selecting all operators needing quantization, and inserting a quantized _ quantization operator (quantized _ quantization) in front of the operators for realizing the quantization operation in a subsequent quantization realization module.

In fig. 6, the quantization flag module may provide a quantifiable raw model to a verification calculation module (also referred to as a quantization parameter calculation module). The check calculation module aims to calculate the quantization parameter set of the original model, and then the quantization calculation of the original model is completed in the quantization realization module. In an original model of an existing quantization marking operator, firstly, acquiring statistical distribution characteristics of original input data through a check data set, and calculating quantization parameters by combining the statistical distribution characteristics and adopting a binary approximation method based on a minimum Mean Square Error (MSE) theory: quantization factor scale, truncation factor parameter clip_minAnd clip_maxAnd the output of the module is used for performing quantization calculation by a subsequent quantization implementation module by using the parameter set. In fig. 6, N indicates no and Y indicates yes.

Referring to fig. 6, the quantization implementation module mainly performs quantization and inverse quantization calculation functions and outputs a final quantization model. After the output of the last module is obtained, the quantization parameter group is assigned to a quantization marking operator, and the quantization marking operator realizes quantization operation (such as multiplication, rounding, truncation operation, data type conversion and the like) in bottom hardware. And finally, obtaining a quantized integer precision model corresponding to the input original depth model.

Fig. 7 illustrates a schematic diagram of a quantization factor parameter calculated by a quantization method of a neural network model according to an exemplary embodiment of the present invention and a quantization factor parameter calculated by an existing quantization method.

The existing quantization method determines a truncation parameter alpha according to a truncation function predefined by data distribution before and after quantization, which can cause a large amount of data of positive half shafts to be truncated, and finally causes the integral precision of a quantized model to be greatly reduced. However, if the truncation parameter α is determined by calculating the average square error worth of the supremum value of each interval according to the partitioned interval by using the average square error theory and the data distribution characteristic provided by the invention, a large amount of positive half-axis data can be contained, and the minimum error of the input data of the layer after quantization is ensured, so that the precision of the whole quantization model is not greatly reduced.

To test the effectiveness of the quantization method of the present invention, a number of test number sets were used to test the quantization method of the present invention and the existing methods. The test results are shown in table 1 below.

TABLE 1 test results

Wherein SQuAD1.1 is a test data set of a BERT model, and the model adopts F1 as an evaluation index; ML-1M is a test data set of the NCF model using HQ @10 as an evaluation index;

referring to table 1, in an example, taking a natural language processing model BERT as an example, the quantization method of the present invention implements quantization of an original model, and the prediction accuracy of a model quantized to 8-bit integer precision (i.e., bit ═ 8) is not significantly reduced compared to that of the original model (32-bit floating point type precision), and the prediction performance is greatly improved compared to the existing quantization effect based on the cross entropy theory. The quantization method of the invention compresses the size of the model into original 1/4, and the operation speed of the quantized model is 2.23 times of that of the original model.

In another example, taking the NCF model of the recommendation system as an example, the quantization method of the present invention is used to quantize the weight data, and the quantization can be implemented when the weight distribution information is unknown. The model precision is not obviously reduced after quantization, and the precision is slightly higher than the quantization precision of the quantization weight based on the cross entropy theory. The model size is compressed to 1/4 as the original, and the model operation speed is not slowed down after quantization. Therefore, when the weight distribution is unknown, the quantization method of the invention can well complete quantization on the model of the recommendation system.

In still another example, taking the image recognition model ResNet-50 as an example, with the quantization of the present invention, the model accuracy is not significantly reduced, and is slightly better than the quantization accuracy of the existing cross entropy-based method. The model size is compressed to 1/4 of the original model. The operation speed of the quantized model is 2.577 times of that of the original model.

As can be seen from table 1, the quantization method of the present invention can greatly increase the operation rate of the original model without losing the prediction accuracy of the model, and greatly reduce the size of the storage space occupied by the original model; and the original model quantized by the quantization method of the invention can be compatibly operated at the back ends of a CPU, a GPU and the like. The method enhances the applicability of the original depth model for many small devices with limited storage space, and has wide application prospects on more hardware terminals.

For example, when a neural network is used to perform an image recognition learning task, a ReNet-50 network is taken as an example: the model comprises a large number of Conv2d operators, so that the model is very suitable for being applied to related tasks of images. The method comprises the steps of firstly converting a picture of a picture data set into a dot matrix vector by pixels, entering the dot matrix vector into a pre-classifier (generally a basic classifier, such as a logistic regression classifier, a Bayesian classifier and the like) for training, and obtaining a trained model by minimizing an error value of a loss function in the training. However, in this case, all data types in the model calculation are floating point types, and a large space overhead is required for model storage, so that it is necessary to quantize convolution operators of the model to complete the compression storage of the model without affecting the prediction accuracy of the model. The quantification method of the invention can maintain the original accuracy of the ResNet-50 model and greatly reduce the size of the model. The truncation factor and the quantization parameter which are rapidly calculated according to the bidirectional binary approximation method can accurately map the input data of the high-precision floating point type into integer data in a low-precision integer interval, then the integer input data are transmitted to terminals such as a CPU (central processing unit), a GPU (graphics processing unit) or a multi-core NPU (non-uniform processing unit) and the like for data reading, the prediction time of a model for each picture data can be greatly shortened, and the prediction precision of the model is guaranteed to be hardly influenced.

For example, when a neural network is used to perform natural language processing tasks, the Bert-Large model is taken as an example: the original text data is firstly converted into word vectors through a word2vec toolkit and stored, and then the word vectors enter a Bert-Large model with a bidirectional transformer to be ready for training. The size of the model still consumes too much memory and therefore quantization needs to be done as well. In the model, the proportion of the DENSE operator is the highest, and the size of the operator is the largest, so that after the method is used for quantizing the DENSE operator, the size of the original model can be reduced to one fourth. Similarly, after the text data is converted into vectors through a word converter and subjected to quantization of the present invention, the generated integer model can be stored on a device such as a CPU for hardware acceleration, and then read out for tasks such as inference prediction.

For example, when a task of a recommendation system is constructed using a neural network, taking the NCF model as an example: and the original user-item metadata enters the NCF for collaborative filtering to obtain the item ranking of the preference of the current user, and then the item which is most interested by the user is selected according to the Top-K ranking and output as the recommended item. Taking movielens data set as an example, user-movie data enters an NCF (non-uniform correlation function) model to calculate similarity and user preference, the calculation process is mainly completed through a TAKE operator in the model, and the operator is large, so that a quantized model only one fourth of the size of an original model is obtained after quantization is performed through a quantization algorithm. Data enters a quantized model for inference, and the process can also call accelerators at the back ends of a CPU, a GPU and the like for calculating and accelerating speed. But the recommendation accuracy is not greatly affected and the recommendation calculation time can be reduced. It can be seen that the quantification method plays a crucial role in the actual task execution process of various types of neural network models.

Referring to fig. 8, the method for quantizing a neural network model of the present invention is directed toNeural network model D with original precision as floating point number_FPModel D quantized to integer precision_INTTherefore, under the condition of minimum overall prediction accuracy loss, the memory space occupied by the original model is compressed and the operation rate of the original model is improved.

In connection with FIG. 8, at input D_FPAnd then, identifying each operator needing quantization in the model by a quantization marking module, adding a quantization marking operator for finishing subsequent quantization calculation, not executing the operation by the operator needing no quantization, outputting a marked model with the marking operator after the module is finished, and then starting verification calculation.

Check computation module with mark operator mark model D_FPObtaining statistical distribution characteristic information of input data of each operator to be quantized through a check data set; according to the statistical distribution information, a quantization interval threshold thres can be determined, the value reflects quantized data distribution information, and the distribution of original data is guaranteed to be unchanged during quantization; then, the interval length inc of the quantized data distribution interval can be calculated by thres; dividing the data distribution interval into n number according to inc and the quantized data distribution interval_binsA value of supremum limit T for each subinterval divided may be determined_j(ii) a Then, a binary approximation method is utilized to reduce the search space, and the quantization factor truncation parameter alpha of the current jth interval is calculated_jScale of current subinterval_jAnd a truncation factor parameter clip_minAnd clip_maxUsing the currently calculated quantization parameter scale_jQuantizing the ith data X_iTo obtain quantized data Q_iThen Q is put in_iGeneration of X by inverse quantization_i', then by X_i' and X_iCalculating the mean square error MSE of quantization under the current jth subinterval_jAnd then the current minimum MSE_minComparing and updating are carried out, so that the optimal quantization parameter scale and clip are output after all subintervals are traversed_minAnd clip_max。

The quantization realization module assigns the optimal quantization parameter to the marking operator and then realizes quantization through bottom hardwareOperation (such as rounding, type conversion, multiplication, etc.), and finally outputting an original deep learning model D with integer precision_INT。

According to an example embodiment of the present disclosure, a computing device is provided that includes a processor and a memory. Wherein the memory stores a computer program which, when executed by the processor, implements a method of quantification of a neural network model according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method of quantifying a neural network model according to an exemplary embodiment of the present disclosure.

According to the quantization method and device of the neural network model, the quantization parameters corresponding to the quantifiable operators are calculated based on the binary approximation method, so that the search space of the quantization parameters can be effectively reduced, and meanwhile, the quantization parameters are found.

In addition, according to the quantization method and the device of the neural network model disclosed by the invention, under the condition of small precision loss, the truncation parameter value is set in a personalized manner, and the high-precision true value of the original model does not need to be stored, so that the memory overhead is reduced, and the original depth model is compressed to about 1/4 of the original size, so that the problem that the precision and the memory cannot be obtained by the existing truncation quantization method is solved.

In addition, according to the quantization method and device of the neural network model disclosed by the invention, the operation rate of the model is improved, and one more quantization marking operator is added when a weighted operator to be quantized is processed, so that the problem that the weight data distribution needs to be predefined in the existing probability-based quantization method is solved.

In addition, according to the quantization method and device of the neural network model disclosed by the invention, a binary approximation method is adopted to traverse the subintervals of the quantized data distribution according to the relaxation monotone change rule of the mean square error, and then the quantization parameters are obtained by solving. Compared with the process of traversing each subinterval by the existing method, the quantization method reduces the search space of the truncation factor to be traversed, and greatly improves the quantization rate.

In addition, according to the quantization method and the device of the neural network model disclosed by the invention, a method for determining different quantization parameters according to different data distribution characteristics is adopted, so that the obtained personalized parameters not only combine the local characteristics of the original model, but also combine the statistical characteristics of input data, and compared with the existing quantization method based on the entropy theory, the quantization method has better quantization performance when processing asymmetric data or non-uniform data.

The quantization method of the neural network model and the apparatus for quantizing the neural network model of the present disclosure according to the exemplary embodiments of the present disclosure have been described above with reference to fig. 1 to 8.

The various modules in the apparatus for quantifying neural network models of the present disclosure illustrated in fig. 5, 6, and 8 may be configured as software, hardware, firmware, or any combination thereof that performs a particular function. For example, each module may correspond to a dedicated integrated circuit, to pure software code, or to a combination of software and hardware. Furthermore, one or more functions implemented by the respective modules may also be uniformly executed by components in a physical entity device (e.g., a processor, a client, a server, or the like).

Further, the quantization method of the neural network model of the present disclosure described with reference to fig. 2, 6, and 8 may be implemented by a program (or instructions) recorded on a computer-readable storage medium. For example, according to an exemplary embodiment of the present disclosure, a computer-readable storage medium storing instructions may be provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform a method of quantifying a neural network model according to the present disclosure of the present disclosure.

The computer program in the computer-readable storage medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may also be used to perform additional steps other than or more specifically processes when the steps are performed, and the content of the additional steps and the further processes are mentioned in the description of the related methods with reference to fig. 2, fig. 6, and/or fig. 8, and therefore will not be described in detail here to avoid repetition.

It should be noted that each module in the apparatus for quantizing a neural network model according to an exemplary embodiment of the present disclosure may completely depend on the execution of the computer program to realize the corresponding function, that is, each module corresponds to each step in the functional architecture of the computer program, so that the entire system is called by a special software package (e.g., a lib library) to realize the corresponding function.

Alternatively, the various modules shown in fig. 5, 6, and 8 may be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component having stored therein a set of computer-executable instructions that, when executed by a processor, perform a method of quantifying a neural network model according to exemplary embodiments of the present disclosure.

In particular, computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions.

The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In a computing device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the quantization method of the neural network model according to the exemplary embodiments of the present disclosure may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.

The processor may execute instructions or code stored in one of the memory components, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.

The quantization method of the neural network model according to the exemplary embodiment of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.

Thus, the method of quantifying a neural network model described with reference to fig. 2, 6 and/or 8 may be implemented by a system comprising at least one computing device and at least one storage device storing instructions.

According to an exemplary embodiment of the present disclosure, the at least one computing device is a computing device for performing a method of quantizing a neural network model according to an exemplary embodiment of the present disclosure, the storage device having stored therein a set of computer-executable instructions that, when executed by the at least one computing device, perform the method of quantizing a neural network model described with reference to fig. 2, 6 and/or 8.

While various exemplary embodiments of the present disclosure have been described above, it should be understood that the above description is exemplary only, and not exhaustive, and that the present disclosure is not limited to the disclosed exemplary embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. Therefore, the protection scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A method of quantifying a neural network model, the method comprising:

acquiring a neural network model;

calculating a quantization parameter corresponding to an operator to be quantized of the neural network model based on a binary approximation method;

and quantizing the operator to be quantized of the neural network model based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

2. The quantization method of claim 1, wherein the step of calculating a quantization parameter corresponding to an operator to be quantized of the neural network model comprises:

verifying the neural network model by using a verification data set to obtain input data of each operator to be quantized;

and calculating a quantization parameter corresponding to the minimum mean square error of the input data of each operator to be quantized before and after quantization according to the input data of each operator to be quantized by using a binary approximation method.

3. The quantization method of claim 2, wherein the step of calculating the quantization parameter corresponding to the least mean squared error comprises:

reducing the dimension of input data of each operator to be quantized;

dividing the input data of each dimensionality reduced operator to be quantized into a plurality of data distribution intervals based on the statistical characteristics of the input data of each dimensionality reduced operator to be quantized, and acquiring an interval supremum value array, wherein the interval supremum value array records the supremum value of each data distribution interval;

and searching the quantization parameter corresponding to the minimum mean square error by approximating towards the middle binary between the starting point and the end point of each data distribution interval by using a binary approximation method.

4. The quantization method of claim 3, wherein the quantization parameter comprises: at least one of a truncation parameter, a quantization factor parameter, and a truncation factor parameter of the data distribution interval.

5. The quantization method of claim 3, wherein the searching for the quantization parameter comprises:

for each data distribution interval, initializing the minimum mean square error to an initial mean square error of the data distribution interval when each acquisition interval upper bound value array, wherein the initial mean square error corresponds to a quantization parameter corresponding to a midpoint between a start point and an end point of the data distribution interval;

calculating an approximation point mean square error of the data distribution interval by approximating towards the middle binary between the start point and the end point of the data distribution interval, wherein the approximation point mean square error corresponds to a quantization parameter corresponding to the approximation point of the data distribution interval;

when the average square error of the approximation point is smaller than the minimum average square error, updating the minimum average square error by using the average square error of the approximation point;

and outputting the quantization parameter corresponding to the minimum mean square error as the quantization parameter when traversing the data distribution interval.

6. A quantization method as defined in claim 1, wherein the operator to be quantized of the neural network model comprises a quantifiable operator of the neural network model,

wherein, in response to the proportion of the parameters contained in the operators of the neural network model to the total parameters in the neural network exceeding a threshold value or belonging to computation-intensive operators, the operators are quantifiable operators.

7. The quantization method of claim 6, further comprising:

before computing the quantization parameter corresponding to the quantifiable operator, the quantifiable operator is marked by inserting a quantization marking operator before the quantifiable operator,

wherein the step of marking the quantifiable operator comprises:

determining whether input data of a quantifiable operator has weight data;

when the input data of the quantifiable operator does not have weight data, inserting a quantification mark operator before the quantifiable operator;

when input data of the quantifiable operator has weight data, a quantization flag operator is inserted before the quantifiable operator, and a quantization flag operator is inserted before the weight data to flag whether quantization is required for the weight data.

8. The quantification method of claim 1, the neural network model being a deep learning neural network model trained to perform one of image recognition, natural language processing, recommendation system processing.

9. An apparatus for quantizing a neural network model, the apparatus comprising:

a data acquisition module configured to acquire a neural network model;

the quantization parameter calculation module is configured to calculate a quantization parameter corresponding to an operator to be quantized of the neural network model based on a binary approximation method;

and the quantization implementation module is configured to quantize an operator to be quantized of the neural network model based on the quantization parameter to obtain the neural network model with the quantized operator to be quantized.

10. A computer-readable storage medium having stored thereon a computer program that, when executed by one or more computing devices, causes the one or more computing devices to implement the quantization method of any of claims 1-8.

11. A multi-turn human-machine dialog system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which, when executed by the one or more computing devices, causes the one or more computing devices to implement the quantification method of any one of claims 1-8.