CN112381205A - Neural network low bit quantization method - Google Patents
Neural network low bit quantization method Download PDFInfo
- Publication number
- CN112381205A CN112381205A CN202011057930.2A CN202011057930A CN112381205A CN 112381205 A CN112381205 A CN 112381205A CN 202011057930 A CN202011057930 A CN 202011057930A CN 112381205 A CN112381205 A CN 112381205A
- Authority
- CN
- China
- Prior art keywords
- quantization
- input
- neural network
- value
- scaling factor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013139 quantization Methods 0.000 title claims abstract description 134
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 abstract description 8
- 230000006870 function Effects 0.000 description 4
- YBJHBAHKTGYVGT-ZKWXMUAHSA-N (+)-Biotin Chemical compound N1C(=O)N[C@@H]2[C@H](CCCCC(=O)O)SC[C@@H]21 YBJHBAHKTGYVGT-ZKWXMUAHSA-N 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- FEPMHVLSLDOMQC-UHFFFAOYSA-N virginiamycin-S1 Natural products CC1OC(=O)C(C=2C=CC=CC=2)NC(=O)C2CC(=O)CCN2C(=O)C(CC=2C=CC=CC=2)N(C)C(=O)C2CCCN2C(=O)C(CC)NC(=O)C1NC(=O)C1=NC=CC=C1O FEPMHVLSLDOMQC-UHFFFAOYSA-N 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a low bit quantization method of a neural network, which quantizes the weight value of each channel of the neural network to be low specific point weight. And acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval. And taking the input quantized coefficient of the next convolutional layer as the output quantized coefficient of the current convolutional layer. Input floating point data is quantized to obtain input fixed point data. And quantizing the output floating point data to obtain the output fixed point data. The scaling factor and the bias are converted into a scaling factor fixed point value and a bias fixed point value, respectively. And acquiring the neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value, wherein the quantized neural network model can be applied to embedded equipment.
Description
Technical Field
The invention relates to the field of data compression, in particular to a low bit quantization method of a neural network.
Background
The neural network technology has good effects on tasks including image classification, target detection, natural language processing and the like, and in order to improve the identification accuracy, the scale of a model of the neural network is continuously increased, and the complexity of the model is continuously improved.
Higher demands are also put on the operational performance of the device, and the increasing network size and power consumption gradually become the main obstacles limiting the application of the neural network. The ever-increasing neural network causes the memory required for the operation of the neural network to be larger and larger, and the increasing of the network model also requires larger bandwidth, which greatly limits the application of the neural network in the embedded device.
Disclosure of Invention
The invention aims to provide a low-bit quantization method of a neural network, so that a neural network model can be applied to an embedded device.
In order to realize the purpose, the technical scheme is as follows: a neural network low bit quantization method comprises the following steps.
S101: obtaining an initial neural network, and obtaining weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer.
S102: and counting the maximum value of the input of the C channels in the initial neural network.
S103: and obtaining the scaling factor according to the input maximum value in the C channels.
S104: and quantizing the weight value of each channel to be lower than the specific point weight according to the scaling factor and the weight value of each channel.
S105: and inputting set data to the initial neural network for forward calculation.
S106: and counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer.
S107: repeating S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval.
S108: and circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with the first set length into the histogram with the second set length for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value with the minimum KL divergence as a target quantization threshold value.
S109: and acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval.
S110: and taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating S105 to 109 to obtain the input quantized coefficient of the next convolutional layer, and taking the input quantized coefficient of the next convolutional layer as the output quantized coefficient of the current convolutional layer.
S111: and quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolution layer and the input floating point data of the current convolution layer.
S112: and quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer.
S113: the scaling factor and the bias are converted into a scaling factor fixed point value and a bias fixed point value, respectively.
S114: and obtaining the quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.
Compared with the prior art, the invention has the technical effects that: the invention has stronger practicability, improves the quantization efficiency and the quantization precision greatly, and the quantized neural network structure built by using the quantization method can be directly applied to the deployment of a special chip FPGA/CGRA hardware platform.
Drawings
Fig. 1 is a flow chart illustrating a neural network low bit quantization method according to the present invention.
FIG. 2 is a schematic flow chart of the present invention for converting the scaling factor and the bias into the scaling factor fixed-point value and the bias fixed-point value, respectively.
Detailed Description
The following describes embodiments of the present invention with reference to the drawings.
As shown in fig. 1, an embodiment of the invention is a neural network low bit quantization method, which includes S101 to S114.
S101: obtaining an initial neural network, and obtaining weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer.
The initial neural network can be applied to tasks such as image classification, target detection and natural language processing, and the initial neural network is trained. The initial neural network is a floating-point store operation, i.e., originally representing a weight needs to be represented using float32, i.e., the initial neural network is a floating-point operation.
And (4) quantizing the initial neural network, namely converting floating point operation into integer memory operation, and realizing the compression technology of the model initial neural network. In short, the initial neural network needs to be quantified and then represented using int 8.
The initial neural network comprises a convolution layer, a batch normalization layer and an activation function layer, a common neural network model mainly comprises the three structures, and meanwhile, the following steps of the method can also be used for quantifying the eltwise layer, so that the principle is the same and the method is more universal. The invention is applicable to all neural network models.
The step S101 comprises the following steps: eliminating the batch normalization layer to enable the network structure of the initial neural network to be a convolution layer and an activation function layer; and acquiring the weight values and the offsets of the C channels in the initial neural network after the batch normalization layer is eliminated.
S102: and counting the maximum value of the input of the C channels in the initial neural network.
S103: and obtaining the scaling factor according to the input maximum value in the C channels.
Obtaining a vector with the length of C according to the maximum value input in the C channels; the scaling factor is calculated by equation (1).
S(C)=thC/127.0 equation (1)
Wherein S is(C)Representing a scaling factor; th (h)CRepresenting a vector of length C.
S104: quantizing the weight value of each channel to a lower specific point weight according to the scaling factor and the weight value of each channel.
The weight value of each channel is quantized to a low specific point weight by formula (2).
Wherein C is [0, C ]];Wint8Int8 represents the weight value of the channel after lower quantization; roundtrip represents rounding the calculation; w(C)A weight value representing a channel; s(C)Representing a scaling factor.
And S101 to S104 finish the quantization of each channel weight value in the initial neural network. I.e. only the quantization operation needs to be performed on the weights in the initial neural network. Since the weights of the initial neural network are generally preserved, we can quantify in advance according to the weights.
S105 to S104 below are excitation quantization of the neural network, that is, input floating point data and output floating point data of the neural network are quantized.
S105: and inputting setting data to the initial neural network for forward calculation.
For example, the data is set to 1000 images, and the 1000 images are input into an initial neural network for forward calculation.
S106: and counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer.
And counting the maximum value of the input absolute value of the current convolution layer in the initial neural network and recording the maximum value as max, and calculating a floating point quantization interval by a formula dist _ scale which is max/2048.0. Wherein dist _ scale is a floating point quantization interval. Assume that the calculated floating point quantization interval is [128 ]. 2048].
S107: repeating the step S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval.
And inputting 1000 images into the initial neural network again for forward calculation, and constructing a histogram with a first set length of 2048 by using the input floating point value of the current convolutional layer based on the floating point quantization interval.
S108: and circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with the first set length into a histogram for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value when the KL divergence is minimum as a target quantization threshold value.
In the quantization threshold th set for the round-robin vector in the floating-point quantization interval [128,2048], the histogram with the first set length of 2048 is changed to the histogram with the second set length of 128 for each quantization threshold th, KL divergence is calculated, and the quantization threshold with the smallest KL divergence is set as the target quantization threshold. In other words, the quantization threshold value when the KL divergence is the smallest is taken as the optimal threshold value target _ th.
The KL divergence (Kullback-Leibler divergence) is a measure of the asymmetry of the difference between two probability distributions (probability distributions).
S109: and acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval.
The input quantization coefficient is calculated by formula (3).
scale ═ (target _ th +0.5) · dist _ scale/127.0 equation (3)
Wherein scale is denoted as SpreRepresenting the input quantized coefficients; target _ th represents a target quantization threshold; dist _ scale represents a floating point quantization interval.
Through the above-mentioned S105 to S109, the input quantized coefficients of the current convolutional layer are obtained.
Further, in the case where the input and output of the activation function layer in the initial neural network are all positive values, the quantization range of 8-bit quantization includes the cause of a negative number. This results in wasted quantization space and extra overhead in storing parameters.
Thus, the activation function layer output is optimized to the uint8 asymmetric quantization; that is, in S108, the floating point quantization interval becomes [256,2048], and a histogram of length 256 is generated based on different quantization thresholds set; the input quantized coefficients are finally divided by 255.0 when they are calculated. The waste of quantization space and extra parameter storage overhead are reduced, and further precision improvement can be obtained.
S110: and taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating the steps from S105 to S109 to obtain the input quantization coefficient of the next convolutional layer, and taking the input quantization coefficient of the next convolutional layer as the output quantization coefficient of the current convolutional layer.
Wherein the input quantization coefficient of the next convolutional layer is the output quantization coefficient of the current convolutional layer and is recorded as Scur。
S111: and quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolutional layer and the input floating point data of the current convolutional layer.
When I isfInt8 quantization is carried out when the input is not all right, and the input floating point data quantization is obtained through calculation of formula (4).
Wherein, I8bitRepresenting input fixed point data; i isfRepresenting input floating point data; spreRepresenting the input quantized coefficients;
will IfBased on SpreMaking 8bit quantization, when inputting IfQuantization of the agent 8 is done for the ReLU output. I.e. by the formulaAnd (6) quantizing.
S112: and quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer.
The output floating point data quantization is obtained by calculation of a formula (5) and comprises the following steps:
wherein, O8bitIndicating output fixed point data; s(C)Representing a scaling factor;Wint88bit weight after representing the weight value quantization of the channel; i is8bitRepresenting input fixed point data; biasCRepresents a bias; scurRepresenting the output quantized coefficients.
S113 described below is a process of coefficient fixing.
S113: scaling the scaling factor SCAnd said BiasCRespectively into a scaling factor fixed-point value and a bias fixed-point value.
Step S113 includes steps S201 to S206 described below.
S201: calculating to obtain a target scaling factor according to the scaling factor, the input quantization coefficient and the output quantization coefficient; i.e. a target scaling factor of
S202: calculating to obtain target bias according to the bias and the output quantization coefficient; i.e. the target is biased to
The target scaling factor and the target offset are floating points, and when the target scaling factor and the target offset are deployed in a chip for calculation, the target scaling factor and the target offset need to be converted into fixed point values, and the following is to make the target scaling factor and the target offset into 16-bit fixed points. The details are as follows.
S203: and inputting setting data to the initial neural network for forward calculation.
S204: statistics (S)(c)*∑{Wint8*I8bit}+Bias(c)) Maximum absolute value of (D) is recorded as Maxrst。
S205: statistical target scaling factor SCMax ofscale(ii) a For MaxscaleMaking a left shift fixed point of 16 bits to obtain a fixed point value q of a scaling factorscale。
S206: statistical target BiasCMaximum value, denoted Maxbias(ii) a For Max (Max)rst,Maxbias) Making a 16-bit left shift fixed point to obtain a bias fixed point value qrst。
Fixed and floating points are both representations of numerical values (representation), fixed points hold specific number integers and decimals, and floating points hold specific number significands (signed) and exponents (exponents). Fixed-point is distinguished from floating-point in that the integer (integer) and fractional (fractional) parts are separated.
And after the weight quantization, the excitation quantization and the coefficient fixed point are carried out, the quantization of the neural network can be completed.
S114: and acquiring a quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.
The quantized neural network structure built by using the quantization method can be directly applied to the deployment of hardware platforms such as Field Programmable Gate Array (FPGA) or reconfigurable computing architecture (CGRA) of a special chip.
Table 1 below shows the quantization accuracy comparison of the neural network before and after quantization.
TABLE 1
Kind of model | Before quantization | After quantization |
GRU-L | 95.09% | 94.87% |
LSTM-L | 91.37% | 90.35% |
Wherein, the precision of GRU-L before quantization is 95.09%, and the precision after quantization is 94.87%. The accuracy before LSTM-L quantization was 91.37%, and the accuracy after quantization was 90.35%. Therefore, the quantized precision loss of the neural network is small by the quantization method, and the quantized neural network can be ensured to have higher precision and accuracy.
The invention can directly deploy the quantified neural network on the chip. Firstly, for any given neural network floating point model, the quantization method provided by the invention can be used for directly quantizing the neural network model, and the quantized network model does not need to be retrained and other operations, so that the quantization efficiency is high.
Secondly, the quantization method provided by the invention achieves the purpose of quantizing the whole neural network model by a layer-by-layer quantization method of the common layers of the neural network, and has proved that the neural network structures such as gru, lstm, rnn and the like have higher identification accuracy, can be further popularized to other neural network structures, and have stronger universality.
Finally, the neural network low bit quantization method provided by the invention aims to directly deploy the quantized neural network model on a chip, and is already applied to the actual voice recognition product, the quantization result can ensure higher precision and accuracy, the recognition rate requirement of the actual scene is met, and the method is extremely friendly to the development and deployment of a hardware platform, so that the method has important practical significance for widening the application of the neural network model in embedded equipment.
Claims (8)
1. A neural network low bit quantization method, comprising:
s101: acquiring an initial neural network, and acquiring weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer;
s102: counting the maximum value of the input of C channels in the initial neural network;
s103: obtaining a scaling factor according to the input maximum value in the C channels;
s104: quantizing the weight value of each channel to a low specific point weight according to the scaling factor and the weight value of each channel;
s105: inputting set data to the initial neural network for forward calculation;
s106: counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer;
s107: repeating the step S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval;
s108: circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with a first set length into a histogram with a second set length for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value with the smallest KL divergence as a target quantization threshold value;
s109: acquiring an input quantization coefficient of the current convolution layer according to the target quantization threshold and the floating point quantization interval;
s110: taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating the steps from S105 to S109 to obtain the input quantization coefficient of the next convolutional layer, and taking the input quantization coefficient of the next convolutional layer as the output quantization coefficient of the current convolutional layer;
s111: quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolution layer and the input floating point data of the current convolution layer;
s112: quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer;
s113: converting the scaling factor and the bias into a scaling factor fixed-point value and a bias fixed-point value, respectively;
s114: and acquiring a quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.
2. The neural network low bit quantization method of claim 1, wherein the initial neural network comprises a convolutional layer, a bulk normalization layer, and an activation function layer;
the S101 includes:
eliminating the batch normalization layer to enable the network structure of the initial neural network to be a convolution layer and an activation function layer; and acquiring the weight values and the offsets of the C channels in the initial neural network after the batch normalization layer is eliminated.
3. The neural network low bit quantization method of claim 2, wherein obtaining the scaling factor according to the maximum value input in the C channels in S103 comprises:
obtaining a vector with the length of C according to the maximum value input in the C channels; calculating the scaling factor by equation (1);
S(C)=thC/127.0 equation (1)
Wherein S is(C)Representing a scaling factor; th (h)CRepresenting a vector of length C.
4. The neural network low bit quantization method of claim 3, wherein said S104 comprises:
quantizing the weight value of each channel to a low specific point weight by formula (2);
wherein C is [0, C ]];Wint8Int8 represents the weight value of the channel after lower quantization; roundtrip represents rounding the calculation; w(C)A weight value representing a channel; s(C)Representing a scaling factor.
5. The neural network low bit quantization method of claim 4, wherein said S109 comprises:
calculating the input quantization coefficient by formula (3);
scale ═ (target _ th +0.5) · dist _ scale/127.0 equation (3)
Wherein scale is denoted as SpreRepresenting the input quantized coefficients; target _ th represents a target quantization threshold; dist _ scale represents a floating point quantization interval.
6. The neural network low bit quantization method of claim 5, wherein the quantization of the input floating point data in S111 is calculated by formula (4);
wherein, I8bitRepresenting input fixed point data; i isfRepresenting input floating point data; spreRepresenting the input quantized coefficients;
the output floating point data quantization in the S112 is obtained by calculation of formula (5);
wherein, O8bitIndicating output fixed point data; s(C)Representing a scaling factor; wint88bit weight after representing the weight value quantization of the channel; i is8bitRepresenting input fixed point data; biasCRepresents a bias; scurRepresenting the output quantized coefficients.
7. The neural network low bit quantization method of claim 6, wherein said converting said scaling factor and said bias into a scaling factor fixed point value and a bias fixed point value in said S113 comprises:
calculating to obtain a target scaling factor according to the scaling factor, the input quantization coefficient and the output quantization coefficient;
calculating to obtain target bias according to the bias and the output quantization coefficient;
inputting set data to the initial neural network for forward calculation;
statistics (S)(c)*∑{Wint8*I8bit}+Bias(c)) Maximum absolute value of (D) is recorded as Maxrst;
Max of the maximum of the statistical target scaling factorscale(ii) a For MaxscaleMaking a left shift fixed point of 16 bits to obtain a fixed point value q of a scaling factorscale;
Counting the maximum value of the target offset, and recording as Maxbias(ii) a For Max (Max)rst,Maxbias) Making a 16-bit left shift fixed point to obtain a bias fixed point value qrst。
8. The neural network low bit quantization method of claim 5, wherein in case that the input and output of the activation function layer in the initial neural network are all positive values, the activation function layer output is optimized to be asymmetric quantization; the floating point quantization interval is changed to [256,2048], and a histogram with the length of 256 is generated according to different set quantization threshold values; the input quantized coefficients are finally divided by 255.0 when they are calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011057930.2A CN112381205A (en) | 2020-09-29 | 2020-09-29 | Neural network low bit quantization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011057930.2A CN112381205A (en) | 2020-09-29 | 2020-09-29 | Neural network low bit quantization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112381205A true CN112381205A (en) | 2021-02-19 |
Family
ID=74580892
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011057930.2A Pending CN112381205A (en) | 2020-09-29 | 2020-09-29 | Neural network low bit quantization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112381205A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011571A (en) * | 2021-03-03 | 2021-06-22 | 华南理工大学 | INT8 offline quantization and integer inference method based on Transformer model |
CN113487014A (en) * | 2021-07-05 | 2021-10-08 | 上海西井信息科技有限公司 | Method and equipment for quantizing any bit based on semantic segmentation neural network model |
CN113255901B (en) * | 2021-07-06 | 2021-10-08 | 上海齐感电子信息科技有限公司 | Real-time quantization method and real-time quantization system |
CN113554149A (en) * | 2021-06-18 | 2021-10-26 | 北京百度网讯科技有限公司 | Neural network processing unit NPU, neural network processing method and device |
CN113747155A (en) * | 2021-09-06 | 2021-12-03 | 中国电信股份有限公司 | Feature quantization method and device, encoder and communication system |
CN113780523A (en) * | 2021-08-27 | 2021-12-10 | 深圳云天励飞技术股份有限公司 | Image processing method, image processing device, terminal equipment and storage medium |
CN114781604A (en) * | 2022-04-13 | 2022-07-22 | 广州安凯微电子股份有限公司 | Coding method of neural network weight parameter, coder and neural network processor |
CN116992965A (en) * | 2023-09-27 | 2023-11-03 | 之江实验室 | Reasoning method, device, computer equipment and storage medium of transducer large model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106575379A (en) * | 2014-09-09 | 2017-04-19 | 英特尔公司 | Improved fixed point integer implementations for neural networks |
JP2019160319A (en) * | 2018-03-09 | 2019-09-19 | キヤノン株式会社 | Method and device for optimizing and applying multi-layer neural network model, and storage medium |
CN110363281A (en) * | 2019-06-06 | 2019-10-22 | 上海交通大学 | A kind of convolutional neural networks quantization method, device, computer and storage medium |
CN110610237A (en) * | 2019-09-17 | 2019-12-24 | 普联技术有限公司 | Quantitative training method and device of model and storage medium |
CN111260022A (en) * | 2019-11-22 | 2020-06-09 | 中国电子科技集团公司第五十二研究所 | Method for fixed-point quantization of complete INT8 of convolutional neural network |
CN111461302A (en) * | 2020-03-30 | 2020-07-28 | 杭州嘉楠耘智信息科技有限公司 | Data processing method, device and storage medium based on convolutional neural network |
-
2020
- 2020-09-29 CN CN202011057930.2A patent/CN112381205A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106575379A (en) * | 2014-09-09 | 2017-04-19 | 英特尔公司 | Improved fixed point integer implementations for neural networks |
JP2019160319A (en) * | 2018-03-09 | 2019-09-19 | キヤノン株式会社 | Method and device for optimizing and applying multi-layer neural network model, and storage medium |
CN110363281A (en) * | 2019-06-06 | 2019-10-22 | 上海交通大学 | A kind of convolutional neural networks quantization method, device, computer and storage medium |
CN110610237A (en) * | 2019-09-17 | 2019-12-24 | 普联技术有限公司 | Quantitative training method and device of model and storage medium |
CN111260022A (en) * | 2019-11-22 | 2020-06-09 | 中国电子科技集团公司第五十二研究所 | Method for fixed-point quantization of complete INT8 of convolutional neural network |
CN111461302A (en) * | 2020-03-30 | 2020-07-28 | 杭州嘉楠耘智信息科技有限公司 | Data processing method, device and storage medium based on convolutional neural network |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011571A (en) * | 2021-03-03 | 2021-06-22 | 华南理工大学 | INT8 offline quantization and integer inference method based on Transformer model |
CN113554149A (en) * | 2021-06-18 | 2021-10-26 | 北京百度网讯科技有限公司 | Neural network processing unit NPU, neural network processing method and device |
CN113487014A (en) * | 2021-07-05 | 2021-10-08 | 上海西井信息科技有限公司 | Method and equipment for quantizing any bit based on semantic segmentation neural network model |
CN113255901B (en) * | 2021-07-06 | 2021-10-08 | 上海齐感电子信息科技有限公司 | Real-time quantization method and real-time quantization system |
CN113780523A (en) * | 2021-08-27 | 2021-12-10 | 深圳云天励飞技术股份有限公司 | Image processing method, image processing device, terminal equipment and storage medium |
CN113780523B (en) * | 2021-08-27 | 2024-03-29 | 深圳云天励飞技术股份有限公司 | Image processing method, device, terminal equipment and storage medium |
CN113747155A (en) * | 2021-09-06 | 2021-12-03 | 中国电信股份有限公司 | Feature quantization method and device, encoder and communication system |
CN113747155B (en) * | 2021-09-06 | 2022-08-19 | 中国电信股份有限公司 | Characteristic quantization method and device, encoder and communication system |
CN114781604A (en) * | 2022-04-13 | 2022-07-22 | 广州安凯微电子股份有限公司 | Coding method of neural network weight parameter, coder and neural network processor |
CN114781604B (en) * | 2022-04-13 | 2024-02-20 | 广州安凯微电子股份有限公司 | Coding method of neural network weight parameters, coder and neural network processor |
CN116992965A (en) * | 2023-09-27 | 2023-11-03 | 之江实验室 | Reasoning method, device, computer equipment and storage medium of transducer large model |
CN116992965B (en) * | 2023-09-27 | 2024-01-09 | 之江实验室 | Reasoning method, device, computer equipment and storage medium of transducer large model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112381205A (en) | Neural network low bit quantization method | |
Lee et al. | Lognet: Energy-efficient neural networks using logarithmic computation | |
EP3276540B1 (en) | Neural network method and apparatus | |
US10096134B2 (en) | Data compaction and memory bandwidth reduction for sparse neural networks | |
CN109102064B (en) | High-precision neural network quantization compression method | |
CN111612147A (en) | Quantization method of deep convolutional network | |
EP3651069A1 (en) | Data processing device, data processing method, and compressed data | |
US10491239B1 (en) | Large-scale computations using an adaptive numerical format | |
CN112329922A (en) | Neural network model compression method and system based on mass spectrum data set | |
WO2021135715A1 (en) | Image compression method and apparatus | |
US20230300354A1 (en) | Method and System for Image Compressing and Coding with Deep Learning | |
Wang et al. | QGAN: Quantized generative adversarial networks | |
CN111240746B (en) | Floating point data inverse quantization and quantization method and equipment | |
CN110874625A (en) | Deep neural network quantification method and device | |
CN109978144B (en) | Model compression method and system | |
CN112766484A (en) | Floating point neural network model quantization system and method | |
US11531884B2 (en) | Separate quantization method of forming combination of 4-bit and 8-bit data of neural network | |
CN111178427A (en) | Depth self-coding embedded clustering method based on Sliced-Wasserstein distance | |
CN112652299B (en) | Quantification method and device of time series speech recognition deep learning model | |
Ullah et al. | L2L: A highly accurate Log_2_Lead quantization of pre-trained neural networks | |
CN101467459A (en) | Restrained vector quantization | |
CN112613604A (en) | Neural network quantification method and device | |
Enderich et al. | Learning multimodal fixed-point weights using gradient descent | |
US20210132866A1 (en) | Data processing device, method of operating the same, and program | |
JPWO2020049681A1 (en) | Information processing equipment, methods and programs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Zhang Shurui Inventor after: OuYang Peng Inventor before: Zhang Shurui Inventor before: OuYang Peng Inventor before: Yin Shouyi |