CN110929862B

CN110929862B - Fixed-point neural network model quantification device and method

Info

Publication number: CN110929862B
Application number: CN201911174616.XA
Authority: CN
Inventors: 陈子祺; 田甲
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2023-08-01
Anticipated expiration: 2039-11-26
Also published as: CN110929862A

Abstract

The application relates to a fixed-point neural network model quantification method and device. The method comprises the following steps: and (3) checking: the check graph model is a directed acyclic graph, and the multi-input graph model is converted into a single-input model; the preparation stage: the graph model is subjected to equivalent conversion so as to facilitate subsequent quantization; a scale stage: inputting all samples, executing in a floating point model, counting the output of each operator in the model, and predicting possible output threshold values of operators in all samples according to the characteristics of output data; quantization stage: and carrying out fixed-point conversion on operators according to the topological order of the model. The method and the device can effectively reduce the storage and calculation cost of the model, eliminate uncertainty caused by rounding errors in floating point operation, and improve the high efficiency, transparency and safety of the deep neural network model.

Description

Fixed-point neural network model quantification device and method

Technical Field

The application relates to a fixed-point neural network model quantification device and method, which are applicable to the technical field of artificial intelligence.

Background

The deep neural network model is widely applied to machine vision tasks such as image classification, object detection and the like, and has achieved great success. However, the storage and computation of neural network models on embedded chips and specifically designed neural network chips remains a significant challenge due to memory space and power consumption limitations. Meanwhile, the existing neural network model is designed by only considering the accuracy, but not the reproducibility and consistency of operation, and the operation result is possibly inconsistent under different architectures and even the same computing environment. The application of the neural network algorithm in fields with higher security requirements such as finance, trusted computing, blockchain, intelligent contracts and the like is greatly limited by the influence.

Model localization, namely, converting floating point operation of a deep neural network into shaping operation, can solve the problems in two aspects. One of the most widely adopted model compression methods in the field of deep learning can reduce the storage and calculation cost of a model, and the two-localization shaping operation can avoid rounding errors in floating point operation and eliminate uncertainty in operation.

The existing mainstream quantization method maps a parameter domain onto a discrete integer domain, for example, maps a convolution kernel parameter onto an INT8 integer domain, and can be divided into symmetric quantization and asymmetric quantization according to whether the discrete integer domain is symmetric or not, and meanwhile, whether quantization is performed based on a model channel or not can be considered, and whether zero offset is required to be increased or not so as to improve the model quantization capability. However, on the one hand, the existing model quantization technology is not mature enough, and the accuracy of the model can not be effectively ensured while the performance of the model is improved. On the other hand, in the existing model quantization device, acceleration is only performed for some specific operators (such as convolution operator and matrix multiplication operator), but a large number of floating point number intermediate values still exist in the calculation process, and rounding errors in model operation still cannot be completely avoided by the semi-shaping quantization.

Disclosure of Invention

The purpose of the application is to provide a fixed-point neural network model quantization device and method, which can effectively reduce the storage and calculation cost of the model, eliminate uncertainty caused by rounding errors in floating point operation and improve the efficiency, transparency and safety of a deep neural network model.

The application relates to a neural network model quantization device of localization, including:

model memory: configured to store at least one model;

a data storage: configured to store data;

operator localization processor: configured to execute at least one program to pinpoint operators in the neural network;

and the central processing unit: and reading the model and the data from the model memory and the data memory, calling the corresponding operators in the operator localization processor, and counting the output results of each operator in the process of actually executing the sample data.

Preferably, the central processor comprises the following program elements:

a reading program unit that reads the model from the model memory and reads the sample data from the data memory;

the checking program unit is used for performing topological ordering on the operators on the model and calling the operators according to the sequence to execute checking configuration of the corresponding operators in the processor;

a preparation program unit for performing topological ordering of operators on the model, and calling the operators according to the sequence to execute preparation configuration of the corresponding operators in the processor;

a scaler unit that counts output results of respective operators inside when actually executing the sample data, based on the read sample data;

and the quantization program unit is used for performing topological sequencing on the operators on the model and calling quantization configuration of the corresponding operators in the operator fixed-point processor according to the sequence.

Preferably, the apparatus further comprises re-quantization means configured to perform a reshaped data precision re-quantization procedure; each operator is configured as a pluggable operator execution processor; the tanh, the simgaid and the exp operator processor configure the quantization stage, the original floating point number corresponding to the input shaping data is mapped to the discrete domain INT16 by adopting a table look-up method, an index table is established, and the processor is fixed to transform the operator into an index instruction and the index table; more preferably, the tanh, simgoid, exp operator processor configures the maximum input precision to be 16.

Preferably, the softmax operator processor configures a quantization stage, adopts a table look-up method and a shaping data operation, namely, firstly quantizing according to the operator table look-up method, and then shaping and adding division; wherein, the floating point in the original expression method expects discrete operation to be the nearest rounding after the floating point is fixed by a legal method, so the floating point is converted into half of the numerator which needs to be added with denominator after the integer division; more preferably, the softmax operator processor is configured to have a maximum precision of 16.

Preferably, the convolution, matrix multiplication processor configures the verification phase to support only 2D-NCHW input formats; the convolution, matrix multiplication processor configuration preparation stage, when the result of matrix multiplication causes data overflow, it is necessary to carry out matrix decomposition operation, and the large matrix multiplication operator is converted into a plurality of small matrix operators to be combined and added; the convolution and matrix multiplication processor configures a quantization stage, and the original floating point convolution and matrix multiplication operations can be equivalently converted into shaping equivalent operators; more preferably, the convolution, matrix multiplication processor configures a maximum precision of 8.

Preferably, the normalization operator processor configures the preparation phase, and the normalization operation can be equivalently converted into matrix multiplication and addition; or the normalization operator processor configures the preparation phase, if the input data is the convolution operation result, the normalization operation can be combined into the convolution operation.

Preferably, the relu operator processor input precision is not limited; a configuration preparation stage, wherein if the child node is a transposition operation, the node and the child node are sequentially exchanged; other phases employ default operations.

Preferably, the auto-extended matrix multiplier processor configures the input precision 16, with other stages using default operations; or alternatively

The dimension addition operator processor configures the input precision to be 8, and other stages adopt default operation; or alternatively

The matrix adding and subtracting operator processor configures the input precision to be 16, and other stages adopt default operation; or alternatively

The automatic expansion matrix adding and subtracting operator processor configures the input precision to be 16, and other stages adopt default operation; or alternatively

The input precision of the matrix linking operator processor is not limited, and default operation is adopted in other stages; or alternatively

The input precision of the embedded operator processor is not limited, and other stages adopt default operation; or alternatively

The input precision of the maximum value pooling operator processor is not limited, and other stages adopt default operation; or alternatively

The processor of the average value pooling operator configures a verification stage, and when the pooling kernel window slides, the calculated average value at least comprises a peripheral filling pane; or alternatively

The input precision of the matrix interception and truncation operator processor is not limited, and default operation is adopted in other stages; or alternatively

Matrix inversion, dimension repetition, repeated linking operator processor input precision without limitation, and other stages adopting default operation; or alternatively

Dimension expansion and elimination, the input precision of the reforming operator processor is not limited, and default operation is adopted in other stages; or alternatively

The input precision of the transpose operator processor is not limited; a configuration preparation stage, if the input data is a transposition operation result, a transposition operation can be synthesized; other stages adopt default operation; or alternatively

Flattening, maximum, minimum and supersampling operator processor input precision is not limited; other phases employ default operations.

The application also relates to a fixed-point neural network model quantification method, which uses the neural network model quantification device, and comprises the following steps:

and (3) checking: the check graph model is a directed acyclic graph, and the multi-input graph model is converted into a single-input model;

the preparation stage: the graph model is subjected to equivalent conversion so as to facilitate subsequent quantization;

a scale stage: inputting all samples, executing in a floating point model, counting the output of each operator in the model, and predicting possible output threshold values of operators in all samples according to the characteristics of output data;

quantization stage: and carrying out fixed-point conversion on operators according to the topological order of the model.

Preferably, the method further comprises a re-quantization stage, wherein a maximum precision of the input data is set for the operator, and when the precision of the input data of the operator is greater than the set maximum precision, the input data is subjected to a reduced precision process.

The application also relates to a computer-readable medium storing instructions that cause a computer to:

(1) The check graph model is a directed acyclic graph, and the multi-input graph model is converted into a single-input model;

(2) The graph model is subjected to equivalent conversion so as to facilitate subsequent quantization;

(3) Inputting all samples, executing in a floating point model, counting the output of each operator in the model, and predicting possible output threshold values of operators in all samples according to the characteristics of output data;

(4) And carrying out fixed-point conversion on operators according to the topological order of the model.

Preferably, instructions that cause the computer to: setting the maximum precision of the input data for the operator, and performing precision reduction processing on the input data when the precision of the input data of the operator is greater than the set maximum precision.

The fixed-point neural network model quantification device and method have the following technical advantages:

(1) The equivalent transformation design of various graphs can effectively reduce the calculated amount of the graphs, improve the execution efficiency and transparency of the deep neural network model, and enable the deep neural network model to be better applied to embedded chips and neural network reasoning chips;

(2) Operator protocols of the fixed-point model are docked, and conversion from a common floating-point model to the fixed-point model is realized;

(3) The full integer quantization can be realized in the quantization process, no floating point data exists in the execution process, and the rounding error in the floating point operation is eliminated so as to achieve the certainty of model calculation;

(4) Better precision can be obtained in the quantization process, more subsequent complex fine adjustment operations of the model are not needed, and the use is convenient.

Drawings

Fig. 1 is a schematic diagram of a data discretization method in the present application.

Fig. 2 is a schematic diagram of a localization neural network model quantization apparatus of the present application.

Fig. 3 is a schematic diagram of a processing flow module of the central processing unit of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily combined with each other.

According to a first aspect of the present application, a process of a fixed-point neural network model quantization method includes:

and (3) checking: the check graph model is a directed acyclic graph, and the multi-input graph model is converted into a single-input model; the method comprises the steps that a graph model has no repeated naming symbol, useless parameters of the model are removed, and if the graph model is checked to be out of the standard, error reporting information can be fed back to a user; wherein, the related check of the operator can be carried out, and the content of the part can refer to the configuration in the subsequent operator fixed-point processor;

the preparation stage: the graph model is subjected to equivalent conversion so as to facilitate subsequent quantization, and operator related equivalent conversion can refer to the configuration of a subsequent operator fixed-point processor;

a scale stage: all samples are input, the operation is carried out in a floating point model, the output of each operator in the model is counted, and possible output threshold values of operators in all samples are predicted according to the maximum and minimum values, the mean value, the variance and other characteristics of output data. Taking the actual factors and time cost into consideration, collecting a small part of samples (16 samples are used in the test) to simulate all data prediction result characteristics, wherein the actual quantization effect is not inferior to all sample data;

quantization stage: performing fixed-point conversion on the operators according to the topological order, wherein the fixed-point operation enables the operators to be converted from the received floating point data to the acceptable shaping data with the fixed scaling factor m; thus, when an operator is processed, it can be generalized that the input data X thereof is mapped onto the shaping field INTp assuming that the input data X has been fixed-point, where p is defined as the precision of the input X and the fixed scale factor is m. The operator should accept the input in the original floating point model as x=x/m. An obvious example is when the operator logic abstraction is a homogeneous operation, i.e. the operator satisfies the requirement m×f (x) =f (mx), then the shaping data of the operator is input, and the output data will also have the same scaling factor m.

Re-quantization stage: considering that the operator increases in the range of the number domain after manipulating the data, an obvious example is matrix multiplication, with the result that the data accuracy is theoretically twice that of the input data. The memory of the quantization device adopts INT32 occupation space by default, the data precision of model input data can be gradually increased after the model input data is executed by a plurality of operators until the data overflow occurs beyond INT32, and the uncertainty of the result is brought. Therefore, the operator can set the maximum precision q of the input data, and data overflow is prevented from occurring when the data is executed in the operator, so that when the precision p of the input data of the operator is larger than q, the input is required to be processed with reduced precision, and the step is defined as re-quantization operation, which belongs to the preferred operation.

The method is characterized in that input and output data are mapped to a discrete integer domain by taking reference to a part of main stream quantization method thought, the discretization method of the data is as shown in fig. 1, oblique lines are actual data, horizontal lines are quantized data distribution, and mathematics are strictly described as floating point rounding.

According to a second aspect of the present application, there is provided a fixed-point neural network model quantization apparatus, as shown in fig. 2, comprising:

model memory: configured to store at least one model;

a data storage: configured to store data, including sample data, intermediate data results, final data results, and the like;

operator localization processor: an operator execution processor configured to execute at least one program to fix operators in the neural network, operators being present in the model, the operator execution processor configured to be pluggable for each operator;

and a re-quantization device: configured to perform a reshaped data precision re-quantization procedure, calculating simulated parameters for floating point operations using reshaping; given the shaping domain M to which the floating point input maps: INT (p), with scaling factor sp, expects the addition of data to map to the shaping domain N: on INT (q), the scaling factor is sq, and the expression n=m (sq/sp), i.e. M times the scaling factor (sq/sp), is given. After the re-quantization device executes the floating point data division, the floating point result hardware binary representation is directly extracted to have sq/sp=a/2^b according to the IEEE floating point data representation standard, and an operation instruction (M.smata) > > b is returned.

And the central processing unit: and reading the model and the data from the model memory and the data memory, calling the corresponding operators in the operator localization processor, counting the output results of each operator in the process of actually executing the sample data, and carrying out re-quantization processing on the data, wherein the specific configuration is shown in fig. 3.

FIG. 3 shows a process flow module of the CPU, which coordinates other components to process the neural network, implement the fixed-point conversion of the model, and internally is a series of program units.

The reader program unit 31 reads the model from the model memory and reads the sample data from the data memory.

The checker unit 32 implements a checker phase that performs a topological ordering of operators on the model, invoking the operators in order to perform a checker configuration of the corresponding operators in the processor.

The preparation program unit 33 implements a preparation phase that performs a topological ordering of the operators on the model, invoking the operators in order to perform a preparation configuration of the corresponding operators in the processor.

The scaling program unit 34 implements a scaling stage, and counts the output results of the internal operators when actually executing the sample data, and generalizes the features of the maximum and minimum values, the mean value, the variance and the like, so as to facilitate calculation of scaling factors in the subsequent execution of re-quantization operations.

The quantization program unit 35 implements a quantization phase that performs a topological ordering of the operators for the model, invoking the quantization configuration of the corresponding operators in the operator-spotting processor in sequence. As described above, firstly, the maximum precision of the input data is obtained from the operator processor, and if the input data precision is detected to be greater than the maximum precision, the re-quantization device is required to be called for performing the data precision reduction processing; and then invoking the quantization configuration of the operator processor to process the reduced precision data.

Referring to fig. 2, the neural network localization device quantifies the model into a full-shaping network by the interaction of the central processing unit with the model memory and the data memory, and executing a software method configured in the central processing unit in cooperation with the pluggable operator localization processor. The operator localization processor is configured to be pluggable, firstly, because of the concise design, all operators provide the same interface, the localization expandability of the device is improved, more operators can be configured, and more updated models can be continuously supported; secondly, the operator processor is convenient to install and deploy by hot plug, and different fixed-point conversion devices can be configured according to different application scenes.

All operators are needed to be configured with the three-stage software methods, namely a verification stage, a preparation stage and a quantization stage. Some concepts involved in the program method configured in the existing operator localization processor are described below:

constant cancellation: in the preparation stage, three types of nodes exist in the neural network, namely input, parameters and operators, wherein the parameters are known variables, and the operators are logic abstractions of data operation. Assuming that the inputs of an operator are parameters or that there is no input, the operator can be calculated ahead of time on the processor as a result, i.e. the operator can be constant eliminated.

Transpose cancellation: transpose refers to transforming a matrix of (M1, M2, M3, …, mk) dimensional size into (N1, N2, N3, …, nk), where N is a rearrangement of the M arrays; for some specific neural networks, a large number of transpose operations may be eliminated. For example two consecutive transposes can be eliminated as one transpose.

The above is a common concept in the neural network localization method, and the configuration method in the operator localization processor in the neural network localization device is described in detail below, the default operation assumes that the operator is the homogeneous operation described above, the output scaling factor is the input scaling factor, and the operator operation logic is not changed. It should be noted that, the operator setting in the application can be freely combined and selected according to different processing objects and working contents as required, and the operator setting does not need to be set at the same time.

The input precision of the relu operator processor is not limited; a configuration preparation stage, wherein if the child node is a transposition operation, the node and the child node are sequentially exchanged; other phases employ default operations.

the maximum input precision of the tanh, the simgaid and the exp operator processor is configured to be 16;

the tanh, siminoid and exp operator processor configures quantization stage, these operators are nonlinear functions, and normal shaping operators can not be adopted to simulate floating point calculation, so that a TABLE look-up method is adopted to map the original floating point number corresponding to the input shaping data to a discrete domain INT16, an index TABLE TABLE is established, and the processor is fixed to convert the operators into index instructions and index TABLEs; other phases employ default operations. For example, assuming that the input shaping data is an INT8 threshold, each of the input INT8 values corresponds to an original input floating point data, and an original floating point operator calculates an original output floating point data, and the floating point output data is mapped onto an INT16 discrete domain by default, that is, an output shaping value, in short, a two-dimensional array table may be built, and each of the input INT8 shaping values is directly mapped to its shaping value result, which is called a table lookup method.

The softmax operator processor configures the maximum precision to be 16;

the softmax operator processor configures a quantization stage, adopts a table look-up method and a shaping data operation, adopts an operator logic abstract expression as Y [ I ] = exp (I)/sum (ej, j in X), quantizes the exp (I) expression according to the operator exp table look-up method, and then performs shaping addition and division. Wherein, the expected discrete operation after floating point division in the original expression method is rounded nearby, so the numerator after conversion to integer division needs half of denominator, and the mathematical expression after quantization is Y [ I ] =round (TABLE (I)/TOTAL) = (TBALE (I) +total/2)/TOTAL, wherein total=sum (TABLE (j), j in X); other phases employ default operations.

A Convolution (Convolution), matrix multiplication (transform) processor configures a maximum precision of 8;

the Convolution (matrix) processor configures the verification phase to support only the 2D-NCHW input format.

A Convolution (Convolution), matrix multiplication (transform) processor configures a preparation phase, and performs a matrix decomposition method: vector multiplication leads to a large number of multiply-add operations, INT16 is the multiplied input data of INT8, and under the 32-bit occupation space representation, data overflow can be caused by adding K >65536 times theoretically. When K of matrix product A x B meets the above condition, matrix decomposition operation is needed to convert large matrix product operator into numerous small matrix operators for merging and adding.

The Convolution (constellation), matrix multiplication (transform) processor configures the quantization stage, and the original mathematical expression y=x×w+b, assuming that there is x=xi×sx, w=wi×sw, where Xi, wi are shaping data, sx, sw are scaling factors. The original mathematical expression is equivalent to y=xi×sxwi+b=xi×wi (sx×sw) +b, let the scaling factor of the offset B be sx×sw, and b=bi (sx×sw), where the first bracket is a shaped convolution operation, and sx×sw is the scaling factor carried by the convolution operation output. The original floating point convolution and matrix multiplication operations can be equivalently converted into shaping equivalent operators, wherein the scaling factors of the input X, W and B are Sx and Sw respectively, (Sx is Sw), and the output scaling factor is Sx is Sw.

The normalization (BatchNorm) operator processor configuration preparation stage has an operator logic abstract expression of Y= (X-mean)/var gamma+beta=X (1/var) + (beta-mean gamma/var) =X gamma+b, i.e. the normalization operation can be equivalently converted into matrix multiplication and addition, and other stages do not need configuration.

In the preparation stage of configuration of the normalization (BatchNorm) operator processor, if the input data is the convolution operation result, the above expression can be written as y=x×a+b= (d×w+b) ×a+b=d (w×a) + (b×a+b), and the convolution operation with the weight w×a and the offset b×a+b, i.e., the normalization operation can be combined into the convolution operation.

An auto-extended matrix multiplication (broadcast multiply) operator processor configuration input precision 16; other phases employ default operations.

A dimension addition (sum over axis) operator processor configures the input precision to be 8; other phases employ default operations.

Matrix addition and subtraction operator processor configuration input precision is 16; other phases employ default operations.

An automatic expansion matrix adding and subtracting (broadcast_add, broadcast_sub) operator processor is configured with the input precision of 16; other phases employ default operations.

The input precision of a matrix linking (connectate) operator processor is not limited; other phases employ default operations.

The input precision of an embedded (Embedding) operator processor is not limited; other phases employ default operations.

The maximum value pooling operator (max pooling) processor input precision is not limited; other phases employ default operations.

An average pooling operator (average pooling) processor configures a verification stage, and when the pooled kernel window slides, the calculated average value at least includes a peripheral patch pane.

An average pooling operator (average pooling) processor configures a preparation stage, an operator logic abstract expression is y=sum { kernel (X) }/size of kernel, and the operator logic abstract expression is equivalently converted into a convolution operation with a kernel size of kernel and values of 1/size of kernel, and other stages do not need configuration.

Matrix interception (slice), the input precision of the interception (clip) operator processor is not limited; other phases employ default operations.

Matrix inversion (negative), dimension repetition (repeat), repeated link (tile) operator processor input precision is not limited; other phases employ default operations.

Dimension expansion (expand dims), elimination (squeeze), reforming (reshape) operator processor input precision is not limited; other phases employ default operations.

The transpose (transfer) operator processor input precision is not limited; a configuration preparation stage, if the input data is a transposition operation result, a transposition operation can be synthesized; other phases employ default operations.

Flattening (flat), maximum, minimum (max, min), supersampling (supersampling) operator processor input accuracy is not limited; other phases employ default operations.

Although the embodiments disclosed in the present application are described above, the descriptions are merely for facilitating understanding of the present application, and are not intended to limit the present application. Any person skilled in the art to which this application pertains will be able to make any modifications and variations in form and detail of implementation without departing from the spirit and scope of the disclosure, but the scope of the patent claims of this application shall be subject to the scope of the claims that follow.

Claims

1. A fixed-point neural network model quantization apparatus, comprising:

model memory: configured to store at least one model;

a data storage: configured to store data;

and the central processing unit: reading the model and the data from the model memory and the data memory, calling the corresponding operators in the operator localization processor, and counting the output results of each operator in the process of actually executing the sample data; wherein the central processing unit comprises the following program units:

2. The neural network model quantization device of claim 1, further comprising a re-quantization device configured to perform a reshaped data precision re-quantization procedure.

3. The neural network model quantization apparatus of claim 1 or 2, wherein each operator is configured as a pluggable operator execution processor.

4. The neural network model quantization device according to claim 1 or 2, wherein the tanh, simkey, exp operator processor configures the quantization stage, maps the original floating point number corresponding to the input shaping data to the discrete domain INT16 by using a table look-up method, creates an index table, and the processor points to transform the operator into an index instruction and the index table.

5. The neural network model quantization device according to claim 1 or 2, wherein the softmax operator processor configures the quantization stage to perform the table look-up and the shaping data operation, i.e. quantization is performed according to the operator table look-up first, and then shaping addition and division are performed; in the original expression, floating point division method is utilized to multiply the floating point division method, and then discrete operation is expected to be a nearby rounding, so that the floating point division method is converted into half of the numerator which needs to be added with denominator after the integer division.

6. The neural network model quantization device of claim 1 or 2, wherein the convolution, matrix multiplication processor configures a verification phase to support only 2D-NCHW input formats; the convolution, matrix multiplication processor configuration preparation stage, when the result of matrix multiplication causes data overflow, it is necessary to carry out matrix decomposition operation, and the large matrix multiplication operator is converted into a plurality of small matrix operators to be combined and added; the convolution and matrix multiplication processor configures a quantization stage, and the original floating point convolution and matrix multiplication operations can be equivalently converted into shaping equivalent operators.

7. The neural network model quantization apparatus of claim 1 or 2, wherein the normalization operator processor configures a preparation phase, the normalization operation being equivalently converted into matrix multiplication and addition; or alternatively

The normalization operator processor configures the preparation phase and if the input data is the result of a convolution operation, the normalization operation may be incorporated into the convolution operation.

8. The neural network model quantization device of claim 1 or 2, wherein the relu operator processor input accuracy is not limited; a configuration preparation stage, wherein if the child node is a transposition operation, the node and the child node are sequentially exchanged; other phases employ default operations.

9. The neural network model quantization apparatus of claim 1 or 2, wherein the auto-extended matrix multiplier processor configures the input precision 16, and other stages employ default operations; or alternatively

the maximum input precision of the tanh, the simgaid and the exp operator processor is configured to be 16; or alternatively

The softmax operator processor configures the maximum precision to be 16; or alternatively

Convolving, and configuring the maximum precision of the matrix multiplication processor to be 8; or alternatively

10. A method for quantifying a neural network model using the neural network model quantifying apparatus according to any one of claims 1 to 9, comprising the steps of:

11. The neural network model quantization method of claim 10, further comprising a re-quantization stage in which a maximum precision of the input data is set for the operator, and the input data is subjected to a reduced precision process when the operator input data precision is greater than the set maximum precision.

12. A computer-readable medium storing instructions for causing a computer to perform the neural network model quantization method of claim 10 or 11.