CN114372553A - Neural network quantification method and device - Google Patents

Neural network quantification method and device Download PDF

Info

Publication number
CN114372553A
CN114372553A CN202111418068.8A CN202111418068A CN114372553A CN 114372553 A CN114372553 A CN 114372553A CN 202111418068 A CN202111418068 A CN 202111418068A CN 114372553 A CN114372553 A CN 114372553A
Authority
CN
China
Prior art keywords
quantization
neural network
input data
network model
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111418068.8A
Other languages
Chinese (zh)
Inventor
张书瑞
欧阳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingwei Intelligent Information Technology Co ltd
Original Assignee
Beijing Qingwei Intelligent Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingwei Intelligent Information Technology Co ltd filed Critical Beijing Qingwei Intelligent Information Technology Co ltd
Priority to CN202111418068.8A priority Critical patent/CN114372553A/en
Publication of CN114372553A publication Critical patent/CN114372553A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a quantization method and a quantization device of a neural network, wherein the method comprises the following steps: quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters; quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data; and for each layer of network, determining a fixed point value corresponding to each quantization coefficient, and completing the quantization of the neural network model. The invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the network in the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, thereby effectively improving the quantization efficiency.

Description

Neural network quantification method and device
Technical Field
The invention relates to the technical field of deep learning, in particular to a quantization method and device of a neural network.
Background
The rapid development of neural network technology brings great convenience to our lives, and many applications including neural network technology have been integrated into our daily lives. However, the application scenes of the neural network are also limited by the continuously increased neural network models, so that the neural network models which are trained in advance are subjected to quantization processing, the scale of the neural network models is reduced, the forward reasoning speed of the neural network is improved, and the application scenes of the neural network are widened, thereby having important significance. However, the existing quantization method has the problems of insufficient precision, low quantization efficiency and the like in a network model.
Disclosure of Invention
In order to solve the problems of insufficient precision and low quantization efficiency of the existing quantization method, the embodiment of the invention provides a quantization method and a quantization device of a neural network. The technical scheme is as follows:
in a first aspect, a quantization method of a neural network is provided, the method including:
quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;
quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;
and for each layer of network, determining a fixed point value corresponding to each quantization coefficient, and completing the quantization of the neural network model.
Optionally, before the step of quantizing the input data and the output data of each layer of the neural network model, the method includes:
and selecting partial data from the training set or the test set as input data for quantifying the neural network model.
Optionally, the step of quantizing the weight parameter of each layer of the neural network model includes:
and quantizing the weight parameters of each layer of the neural network model by adopting a linear quantization mode.
Optionally, the step of quantizing input data of each layer of the neural network model includes:
calculating before running, and counting the maximum value of the absolute value of input data;
determining a quantization interval of the input data based on the maximum value;
constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data;
circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2;
and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.
Optionally, when the neural network model includes a saturation activation function, the input data of the saturation activation function is quantized by using a quantization method of saturation interception.
Optionally, when the neural network model includes a classification function, a quantization method of translation processing is used to quantize input data of the classification function.
In a second aspect, an apparatus for quantizing a neural network is provided, the apparatus comprising:
the quantization module is used for quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;
the quantization module is further configured to quantize input data and output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;
and the fixed point module is used for determining a fixed point value corresponding to each quantization coefficient for each layer of network to finish the quantization of the neural network model.
Optionally, the quantization module is further configured to select a part of data from a training set or a test set as input data for quantizing the neural network model.
Optionally, the quantization module is specifically configured to quantize a weight parameter of each layer of the neural network model in a linear quantization manner.
Optionally, the quantization module is specifically configured to:
calculating before running, and counting the maximum value of the absolute value of input data;
determining a quantization interval of the input data based on the maximum value;
constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data;
circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2;
and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.
Optionally, the quantization module is specifically configured to, when the neural network model includes a saturation activation function, quantize input data of the saturation activation function by using a quantization method of saturation interception.
Optionally, the quantization module is specifically configured to, when the neural network model includes a classification function, quantize input data of the classification function by using a quantization method of translation processing.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the method for quantizing a neural network according to the first aspect when executing a program stored in the memory.
The embodiment of the invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, so that the quantization efficiency is effectively improved, and meanwhile, the quantization method of saturation interception is adopted for the activation function with the saturation threshold, so that the aim of improving the quantization precision can be fulfilled while the operation amount is reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for quantifying neural networks according to an embodiment of the present invention;
fig. 2 is a block diagram of a quantization apparatus of a neural network according to an embodiment of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a quantization method of a neural network according to an embodiment of the present invention may specifically include the following steps.
Step 101, quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters.
The unquantized neural network model is usually a floating point type, and the weight quantization is to quantize weight parameters in the floating point neural network model to obtain quantization coefficients of the weight parameters. The distribution range of the weight parameters in the floating point neural network model is relatively fixed, so that the weight parameters in the floating point neural network model can be quantized in a linear quantization mode.
The model quantization calculation process is illustrated below as a convolutional layer. For example, the convolutional layer before quantization is represented as:
Figure BDA0003375847480000041
wherein the content of the first and second substances,
Figure BDA0003375847480000042
representing output data, WfDenotes a weight parameter, IfRepresenting input data, Bias(c)Representing a deviation parameter;
after weight quantization, the convolutional layer can be represented as:
Figure BDA0003375847480000043
wherein S is(c)Quantized coefficients representing the weight parameters.
The step of quantizing the weight parameters may include: firstly, calculating the maximum value of the absolute value of the weight parameter, secondly, mapping the maximum value to a preset value, wherein the preset value is the maximum value of a quantized data range, for example, when the data is quantized to 8 bits, the preset value is 127, and then, carrying out linear quantization to calculate the quantization coefficient of the weight parameter.
The quantization method in the embodiment of the invention supports the quantization of any bit width. For the case that the input and output values of the neural network model are positive or negative, the quantized data range is [ -64,63] when the number of the input and output values is 4 bits, the quantized data range is [ -128,127] when the number of the input and output values is 8 bits, and the quantized data range is [ -256,255] when the number of the input and output values is 16 bits. Aiming at the condition that the input and the output of the neural network model are all positive values, the quantized data range is [0,127] when the input and the output are 4 bits, the quantized data range is [0,255] when the input and the output are 8 bits, and the quantized data range is [0,511] when the input and the output are 16 bits. The flexible bit width configuration can select the appropriate quantization bit width according to the storage condition of the equipment and the precision requirement of the use scene, and has better adaptability and flexibility.
And 102, quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data.
In practice, a small amount of partial data may be selected from the training set or the test set as input data for quantifying the neural network model. For example, the training set includes 10W pictures, and 1000 pictures can be selected from the pictures and input into the model, and the data is calculated before running for quantization. The embodiment of the invention has no strict requirement on the number of the data sets, can finish the quantization work by selecting partial data, and greatly improves the calculation efficiency in the quantization process while reducing the data dependence.
The model quantization calculation process is continued by taking the convolution layer as an example. When inputting data IfInt8 quantization is done when not all are positive, resulting in:
Figure BDA0003375847480000051
wherein, I8bitRepresenting quantized input data, SpreRepresenting quantized coefficients of the input data.
Substituting the quantized input data, namely formula (3), into the convolutional layer after weight quantization, namely formula (2), to obtain formula (4):
Figure BDA0003375847480000052
for output data
Figure BDA0003375847480000053
Carrying out quantization processing to obtain formula (5):
Figure BDA0003375847480000054
wherein, O8bitRepresenting the output data after quantization, ScurRepresenting the quantized coefficients of the output data.
Continue to simplify equation (5) by
Figure BDA0003375847480000055
Obtaining the quantified convolutional layer:
Figure BDA0003375847480000056
in an implementation, the step of quantizing the input data of each layer of the neural network model may include: calculating before running, and counting the maximum value of the absolute value of input data; determining a quantization interval of the input data based on the maximum value; constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data; circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2; and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.
In implementation, a KL divergence (Kullback-Leibler divergence) algorithm may be used to determine the quantization coefficients of the input data and the output data. Accordingly, the KL variance value can be calculated by calculating the variance value in the step of quantizing the input data.
For example, the input data to the convolutional layer is quantized to 8 bits. The step of quantizing the input data of the convolutional layer comprises: calculating the running direction, such as running 1000 pictures, and counting the maximum value max of the absolute value of input data; determining a quantization interval dist _ scale of the input data based on the maximum value max, wherein the quantization interval dist _ scale is equal to the maximum value max divided by 2048; constructing a histogram with length of 2048 based on quantization intervals from floating point values of input data; circularly traversing the quantization threshold th in the interval [128,2048], changing the histogram with the length of 2048 into the histogram with the length of 128 for each threshold th, calculating a divergence value, and taking the threshold with the minimum divergence value as an optimal threshold target _ th; a quantization coefficient scale of the input data is determined based on the optimal threshold value target _ th and the quantization interval dist scale,
wherein, scale ═ (target _ th +0.5) × (dist _ scale/127).
The quantization method provided by the embodiment of the invention can be applied to a neural network voice awakening model. The voice network awakening model mainly comprises a recurrent neural network and a classification function. The recurrent neural network model mainly comprises three common recurrent neural networks of RNN, GRU and LSTM. If the activation function in the recurrent neural network is a saturation activation function, such as a tanh function and a sigmoid function, the calculation result has a saturation threshold, so that when the neural network model comprises the saturation activation function, the input data of the saturation activation function can be quantized by adopting a quantization method of saturation truncation.
For a saturation activation function, when the input data is greater than the right convergence value, the function will tend to saturate to the right, i.e., the function derivative is zero or tends to zero; when the input data is less than the left convergence value, the function saturates to the left, i.e. the derivative of the function is or tends to zero. The process of performing saturation clipping quantization on the saturation activation function may include: the left and right convergence values of the saturation activation function are determined, then the maximum value of the absolute value of the input data in the interval of the left and right convergence values is determined, then the maximum value is mapped to a preset value, for example 127, and then the quantization coefficients of the input data are calculated. The saturation interception quantization method does not need to quantize all input data, and for the same quantization interval, data with less quantization can improve data expression capacity, so that the saturation interception quantization method not only can reduce the operation amount, but also can keep higher model precision.
The absolute value of the calculation result of a common classification function, such as a softmax function, is meaningless, and the meaningless is the relative value of the calculation result, so that when the neural network model includes the classification function, the input data of the classification function can be quantized by using a quantization method of translation processing by using the translation invariance of the classification function, and the quantization precision of the classification function can be ensured.
The process of quantizing the input data of the classification function by using the quantization method of translation processing may include: counting the maximum value of the absolute value of each row of data in the input data matrix, mapping the maximum value of the absolute value of each row of input data to a preset value, such as 127, and calculating the quantization coefficient of each row of input data; counting the maximum value of each row of input data; setting the offset of each row of input data, wherein the offset is equal to the maximum value of each row of input data subtracted by the preset value; each line of input data is shifted, that is, each line of input data is changed to the current input data plus a corresponding offset.
The process of quantizing the output data of each layer of network is similar to the process of quantizing the input data, and is not described herein again.
And 103, determining a fixed point value corresponding to each quantization coefficient for each layer of the network, and completing the quantization of the neural network model.
S in the convolutional layer after quantization of the above formula (6)(c)And Bias(c)For floating point values, conversion to fixed point values is required when calculating in the chip. For example, quantization is completed by performing fixed-point processing of 8 bits or 16 bits on the quantized coefficients.
For example, 16-bit fixed point is performed on the quantized coefficients in the above equation (6), and the coefficient fixed point procedure is as follows.
For sigma { Wint8*I8bitResults of (c) with Sumint32And (4) saturation storage. To S(c)Make 16bit fixed point, record fixed point qscaleAfter a fixed point value of
Figure BDA0003375847480000071
In the case of hardware, it is preferable that,
Figure BDA0003375847480000072
the overhead is too large, where Sum can be reducedint32Further fix is Sumint16Noting the point qsumThen do it later
Figure BDA0003375847480000073
To (S)(c)*∑{Wint8*I8bit}+Bias(c)) And Bias(c)Perform 16bit fixed point, and record the point qrst
Forward running, statistics ∑ Wint8*I8bitMaximum absolute value of (E) } noted as Maxsum
Statistics (S)(c)*∑{Wint8*I8bit}+Bias(c)) Maximum absolute value of (D) is recorded as Maxrst. Statistics ScMaximum value, denoted MaxscaleStatistics of Bias(c)Maximum value, denoted Maxbias
For MaxscaleMaking a left shift fixed point of 16 bits to obtain qscale(ii) a For Max (Max)rst,Maxbias) Making a left shift fixed point of 16 bits to obtain qrst(ii) a For MaxsumRight shift 16bit fixed point to obtain qsumIf Max is presentsumIn the range int16, q issumIs 0.
Carrying out parameter pretreatment:
Figure BDA0003375847480000074
Figure BDA0003375847480000075
q1=qsum;q2=qscale-qsum-qrst;q3=qrst
fixed point forward calculation:
Sumint32=Clip_int32(∑{Wint8*I8bit})
Sumint16=Clip_int16(Sumint32>>q1)
Figure BDA0003375847480000076
Figure BDA0003375847480000077
or
Figure BDA0003375847480000078
And after the coefficients are quantized to the fixed points, the quantized fixed point type neural network model can be obtained.
The embodiment of the invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, so that the quantization efficiency is effectively improved, and meanwhile, the quantization method of saturation interception is adopted for the activation function with the saturation threshold, so that the aim of improving the quantization precision can be fulfilled while the operation amount is reduced.
Referring to fig. 2, a block diagram of a charging device according to an embodiment of the present invention is shown, where the charging device includes:
a quantization module 201, configured to quantize a weight parameter of each layer in the neural network model to obtain a quantization coefficient of the weight parameter;
the quantization module 201 is further configured to quantize input data and output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;
the fixed point module 202 is configured to determine, for each layer of the network, a fixed point value corresponding to each quantized coefficient, and complete quantization of the neural network model.
Preferably, the quantification module 201 is further configured to select a part of data from a training set or a test set as input data for quantifying the neural network model.
Preferably, the quantization module 201 is specifically configured to quantize the weight parameter of each layer in the neural network model in a linear quantization manner.
Preferably, the quantization module 201 is specifically configured to:
calculating before running, and counting the maximum value of the absolute value of input data;
determining a quantization interval of the input data based on the maximum value;
constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data;
circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2;
and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.
Preferably, the quantization module 201 is specifically configured to, when the neural network model includes a saturation activation function, quantize the input data of the saturation activation function by using a quantization method of saturation interception.
Preferably, the quantization module 201 is specifically configured to, when the neural network model includes a classification function, quantize the input data of the classification function by using a quantization method of translation processing.
The embodiment of the invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, so that the quantization efficiency is effectively improved, and meanwhile, the quantization method of saturation interception is adopted for the activation function with the saturation threshold, so that the aim of improving the quantization precision can be fulfilled while the operation amount is reduced.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, including a processor 001, a communication interface 002, a memory 003 and a communication bus 004, where the processor 001, the communication interface 002 and the memory 003 complete mutual communication through the communication bus 004,
a memory 003 for storing a computer program;
the processor 001 is configured to implement the method for quantizing a neural network when executing the program stored in the memory 003, and the method includes:
quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;
quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;
and for each layer of network, determining a fixed point value corresponding to each quantization coefficient, and completing the quantization of the neural network model.
The embodiment of the invention adopts a direct quantization method, namely, the weight parameters, the input data and the output data of each layer of the neural network model are directly quantized, and the quantized network model can be directly obtained without retraining and other operations in the quantization process, so that the quantization efficiency is effectively improved, and meanwhile, the quantization method of saturation interception is adopted for the activation function with the saturation threshold, so that the aim of improving the quantization precision can be fulfilled while the operation amount is reduced.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method of quantifying neural networks, the method comprising:
quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;
quantizing the input data and the output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;
and for each layer of network, determining a fixed point value corresponding to each quantization coefficient, and completing the quantization of the neural network model.
2. The method of claim 1, wherein the step of quantifying the input data and the output data for each layer of the neural network model is preceded by:
and selecting partial data from the training set or the test set as input data for quantifying the neural network model.
3. The method of claim 1, wherein the step of quantifying the weight parameters for each layer of the network in the neural network model comprises:
and quantizing the weight parameters of each layer of the neural network model by adopting a linear quantization mode.
4. The method of claim 1, wherein the step of quantifying the input data for each layer of the neural network model comprises:
calculating before running, and counting the maximum value of the absolute value of input data;
determining a quantization interval of the input data based on the maximum value;
constructing a histogram of length n1 based on the quantization intervals for the floating point values of the input data;
circularly traversing the quantization threshold values, changing the histogram with the length of n1 into the histogram with the length of n2 for each threshold value, calculating a divergence value, and taking the threshold value with the smallest divergence value as an optimal threshold value, wherein n1 is larger than n 2;
and determining a quantization coefficient of the input data based on the optimal threshold and the quantization interval.
5. The method of claim 1, wherein when the neural network model includes a saturation activation function, the input data to the saturation activation function is quantized using a quantization method of saturation clipping.
6. The method of claim 1, wherein when the neural network model includes a classification function, the input data of the classification function is quantized using a quantization method of a translation process.
7. An apparatus for quantization of a neural network, the apparatus comprising:
the quantization module is used for quantizing the weight parameters of each layer of the neural network model to obtain quantization coefficients of the weight parameters;
the quantization module is further configured to quantize input data and output data of each layer of the neural network model to obtain a quantization coefficient of the input data and a quantization coefficient of the output data;
and the fixed point module is used for determining a fixed point value corresponding to each quantization coefficient for each layer of network to finish the quantization of the neural network model.
8. The apparatus of claim 7, wherein the quantization module is further configured to select a portion of data from a training set or a test set as input data for quantizing the neural network model.
9. The apparatus according to claim 7, wherein the quantization module is specifically configured to quantize the weight parameter of each layer in the neural network model by using linear quantization.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.
CN202111418068.8A 2021-11-25 2021-11-25 Neural network quantification method and device Pending CN114372553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111418068.8A CN114372553A (en) 2021-11-25 2021-11-25 Neural network quantification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111418068.8A CN114372553A (en) 2021-11-25 2021-11-25 Neural network quantification method and device

Publications (1)

Publication Number Publication Date
CN114372553A true CN114372553A (en) 2022-04-19

Family

ID=81139075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111418068.8A Pending CN114372553A (en) 2021-11-25 2021-11-25 Neural network quantification method and device

Country Status (1)

Country Link
CN (1) CN114372553A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108896A (en) * 2023-04-11 2023-05-12 上海登临科技有限公司 Model quantization method, device, medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108896A (en) * 2023-04-11 2023-05-12 上海登临科技有限公司 Model quantization method, device, medium and electronic equipment
CN116108896B (en) * 2023-04-11 2023-07-07 上海登临科技有限公司 Model quantization method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
US11373087B2 (en) Method and apparatus for generating fixed-point type neural network
US20210256348A1 (en) Automated methods for conversions to a lower precision data format
CN109840589B (en) Method and device for operating convolutional neural network on FPGA
JP2019528502A (en) Method and apparatus for optimizing a model applicable to pattern recognition and terminal device
WO2019238029A1 (en) Convolutional neural network system, and method for quantifying convolutional neural network
WO2021135715A1 (en) Image compression method and apparatus
CN109558945A (en) The method and device that artificial neural network and floating-point neural network are quantified
CN111723901B (en) Training method and device for neural network model
CN110717585B (en) Training method of neural network model, data processing method and related product
WO2020001401A1 (en) Operation method and apparatus for network layer in deep neural network
CN111191783B (en) Self-adaptive quantization method and device, equipment and medium
CN110874625A (en) Deep neural network quantification method and device
CN114677548B (en) Neural network image classification system and method based on resistive random access memory
CN110827208A (en) General pooling enhancement method, device, equipment and medium for convolutional neural network
US20200257966A1 (en) Quality monitoring and hidden quantization in artificial neural network computations
CN114372553A (en) Neural network quantification method and device
CN111027684A (en) Deep learning model quantification method and device, electronic equipment and storage medium
CN110874635B (en) Deep neural network model compression method and device
CN111126501B (en) Image identification method, terminal equipment and storage medium
CN114169513B (en) Neural network quantization method and device, storage medium and electronic equipment
CN111582229A (en) Network self-adaptive semi-precision quantized image processing method and system
Dinčić et al. Support region of μ-law logarithmic quantizers for Laplacian source applied in neural networks
CN114065913A (en) Model quantization method and device and terminal equipment
CN112183744A (en) Neural network pruning method and device
CN114757348A (en) Model quantitative training method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination