CN112381205A - Neural network low bit quantization method - Google Patents

Neural network low bit quantization method Download PDF

Info

Publication number
CN112381205A
CN112381205A CN202011057930.2A CN202011057930A CN112381205A CN 112381205 A CN112381205 A CN 112381205A CN 202011057930 A CN202011057930 A CN 202011057930A CN 112381205 A CN112381205 A CN 112381205A
Authority
CN
China
Prior art keywords
quantization
input
neural network
value
scaling factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011057930.2A
Other languages
Chinese (zh)
Inventor
张书瑞
欧阳鹏
尹首一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingwei Intelligent Technology Co ltd
Original Assignee
Beijing Qingwei Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingwei Intelligent Technology Co ltd filed Critical Beijing Qingwei Intelligent Technology Co ltd
Priority to CN202011057930.2A priority Critical patent/CN112381205A/en
Publication of CN112381205A publication Critical patent/CN112381205A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a low bit quantization method of a neural network, which quantizes the weight value of each channel of the neural network to be low specific point weight. And acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval. And taking the input quantized coefficient of the next convolutional layer as the output quantized coefficient of the current convolutional layer. Input floating point data is quantized to obtain input fixed point data. And quantizing the output floating point data to obtain the output fixed point data. The scaling factor and the bias are converted into a scaling factor fixed point value and a bias fixed point value, respectively. And acquiring the neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value, wherein the quantized neural network model can be applied to embedded equipment.

Description

Neural network low bit quantization method
Technical Field
The invention relates to the field of data compression, in particular to a low bit quantization method of a neural network.
Background
The neural network technology has good effects on tasks including image classification, target detection, natural language processing and the like, and in order to improve the identification accuracy, the scale of a model of the neural network is continuously increased, and the complexity of the model is continuously improved.
Higher demands are also put on the operational performance of the device, and the increasing network size and power consumption gradually become the main obstacles limiting the application of the neural network. The ever-increasing neural network causes the memory required for the operation of the neural network to be larger and larger, and the increasing of the network model also requires larger bandwidth, which greatly limits the application of the neural network in the embedded device.
Disclosure of Invention
The invention aims to provide a low-bit quantization method of a neural network, so that a neural network model can be applied to an embedded device.
In order to realize the purpose, the technical scheme is as follows: a neural network low bit quantization method comprises the following steps.
S101: obtaining an initial neural network, and obtaining weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer.
S102: and counting the maximum value of the input of the C channels in the initial neural network.
S103: and obtaining the scaling factor according to the input maximum value in the C channels.
S104: and quantizing the weight value of each channel to be lower than the specific point weight according to the scaling factor and the weight value of each channel.
S105: and inputting set data to the initial neural network for forward calculation.
S106: and counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer.
S107: repeating S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval.
S108: and circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with the first set length into the histogram with the second set length for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value with the minimum KL divergence as a target quantization threshold value.
S109: and acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval.
S110: and taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating S105 to 109 to obtain the input quantized coefficient of the next convolutional layer, and taking the input quantized coefficient of the next convolutional layer as the output quantized coefficient of the current convolutional layer.
S111: and quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolution layer and the input floating point data of the current convolution layer.
S112: and quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer.
S113: the scaling factor and the bias are converted into a scaling factor fixed point value and a bias fixed point value, respectively.
S114: and obtaining the quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.
Compared with the prior art, the invention has the technical effects that: the invention has stronger practicability, improves the quantization efficiency and the quantization precision greatly, and the quantized neural network structure built by using the quantization method can be directly applied to the deployment of a special chip FPGA/CGRA hardware platform.
Drawings
Fig. 1 is a flow chart illustrating a neural network low bit quantization method according to the present invention.
FIG. 2 is a schematic flow chart of the present invention for converting the scaling factor and the bias into the scaling factor fixed-point value and the bias fixed-point value, respectively.
Detailed Description
The following describes embodiments of the present invention with reference to the drawings.
As shown in fig. 1, an embodiment of the invention is a neural network low bit quantization method, which includes S101 to S114.
S101: obtaining an initial neural network, and obtaining weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer.
The initial neural network can be applied to tasks such as image classification, target detection and natural language processing, and the initial neural network is trained. The initial neural network is a floating-point store operation, i.e., originally representing a weight needs to be represented using float32, i.e., the initial neural network is a floating-point operation.
And (4) quantizing the initial neural network, namely converting floating point operation into integer memory operation, and realizing the compression technology of the model initial neural network. In short, the initial neural network needs to be quantified and then represented using int 8.
The initial neural network comprises a convolution layer, a batch normalization layer and an activation function layer, a common neural network model mainly comprises the three structures, and meanwhile, the following steps of the method can also be used for quantifying the eltwise layer, so that the principle is the same and the method is more universal. The invention is applicable to all neural network models.
The step S101 comprises the following steps: eliminating the batch normalization layer to enable the network structure of the initial neural network to be a convolution layer and an activation function layer; and acquiring the weight values and the offsets of the C channels in the initial neural network after the batch normalization layer is eliminated.
S102: and counting the maximum value of the input of the C channels in the initial neural network.
S103: and obtaining the scaling factor according to the input maximum value in the C channels.
Obtaining a vector with the length of C according to the maximum value input in the C channels; the scaling factor is calculated by equation (1).
S(C)=thC/127.0 equation (1)
Wherein S is(C)Representing a scaling factor; th (h)CRepresenting a vector of length C.
S104: quantizing the weight value of each channel to a lower specific point weight according to the scaling factor and the weight value of each channel.
The weight value of each channel is quantized to a low specific point weight by formula (2).
Figure BDA0002711387940000041
Wherein C is [0, C ]];Wint8Int8 represents the weight value of the channel after lower quantization; roundtrip represents rounding the calculation; w(C)A weight value representing a channel; s(C)Representing a scaling factor.
And S101 to S104 finish the quantization of each channel weight value in the initial neural network. I.e. only the quantization operation needs to be performed on the weights in the initial neural network. Since the weights of the initial neural network are generally preserved, we can quantify in advance according to the weights.
S105 to S104 below are excitation quantization of the neural network, that is, input floating point data and output floating point data of the neural network are quantized.
S105: and inputting setting data to the initial neural network for forward calculation.
For example, the data is set to 1000 images, and the 1000 images are input into an initial neural network for forward calculation.
S106: and counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer.
And counting the maximum value of the input absolute value of the current convolution layer in the initial neural network and recording the maximum value as max, and calculating a floating point quantization interval by a formula dist _ scale which is max/2048.0. Wherein dist _ scale is a floating point quantization interval. Assume that the calculated floating point quantization interval is [128 ]. 2048].
S107: repeating the step S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval.
And inputting 1000 images into the initial neural network again for forward calculation, and constructing a histogram with a first set length of 2048 by using the input floating point value of the current convolutional layer based on the floating point quantization interval.
S108: and circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with the first set length into a histogram for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value when the KL divergence is minimum as a target quantization threshold value.
In the quantization threshold th set for the round-robin vector in the floating-point quantization interval [128,2048], the histogram with the first set length of 2048 is changed to the histogram with the second set length of 128 for each quantization threshold th, KL divergence is calculated, and the quantization threshold with the smallest KL divergence is set as the target quantization threshold. In other words, the quantization threshold value when the KL divergence is the smallest is taken as the optimal threshold value target _ th.
The KL divergence (Kullback-Leibler divergence) is a measure of the asymmetry of the difference between two probability distributions (probability distributions).
S109: and acquiring the input quantization coefficient of the current convolutional layer according to the target quantization threshold and the floating point quantization interval.
The input quantization coefficient is calculated by formula (3).
scale ═ (target _ th +0.5) · dist _ scale/127.0 equation (3)
Wherein scale is denoted as SpreRepresenting the input quantized coefficients; target _ th represents a target quantization threshold; dist _ scale represents a floating point quantization interval.
Through the above-mentioned S105 to S109, the input quantized coefficients of the current convolutional layer are obtained.
Further, in the case where the input and output of the activation function layer in the initial neural network are all positive values, the quantization range of 8-bit quantization includes the cause of a negative number. This results in wasted quantization space and extra overhead in storing parameters.
Thus, the activation function layer output is optimized to the uint8 asymmetric quantization; that is, in S108, the floating point quantization interval becomes [256,2048], and a histogram of length 256 is generated based on different quantization thresholds set; the input quantized coefficients are finally divided by 255.0 when they are calculated. The waste of quantization space and extra parameter storage overhead are reduced, and further precision improvement can be obtained.
S110: and taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating the steps from S105 to S109 to obtain the input quantization coefficient of the next convolutional layer, and taking the input quantization coefficient of the next convolutional layer as the output quantization coefficient of the current convolutional layer.
Wherein the input quantization coefficient of the next convolutional layer is the output quantization coefficient of the current convolutional layer and is recorded as Scur
S111: and quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolutional layer and the input floating point data of the current convolutional layer.
When I isfInt8 quantization is carried out when the input is not all right, and the input floating point data quantization is obtained through calculation of formula (4).
Figure BDA0002711387940000061
Wherein, I8bitRepresenting input fixed point data; i isfRepresenting input floating point data; spreRepresenting the input quantized coefficients;
will IfBased on SpreMaking 8bit quantization, when inputting IfQuantization of the agent 8 is done for the ReLU output. I.e. by the formula
Figure BDA0002711387940000062
And (6) quantizing.
S112: and quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer.
The output floating point data quantization is obtained by calculation of a formula (5) and comprises the following steps:
Figure BDA0002711387940000063
wherein, O8bitIndicating output fixed point data; s(C)Representing a scaling factor;Wint88bit weight after representing the weight value quantization of the channel; i is8bitRepresenting input fixed point data; biasCRepresents a bias; scurRepresenting the output quantized coefficients.
S113 described below is a process of coefficient fixing.
S113: scaling the scaling factor SCAnd said BiasCRespectively into a scaling factor fixed-point value and a bias fixed-point value.
Step S113 includes steps S201 to S206 described below.
S201: calculating to obtain a target scaling factor according to the scaling factor, the input quantization coefficient and the output quantization coefficient; i.e. a target scaling factor of
Figure BDA0002711387940000071
S202: calculating to obtain target bias according to the bias and the output quantization coefficient; i.e. the target is biased to
Figure BDA0002711387940000072
The target scaling factor and the target offset are floating points, and when the target scaling factor and the target offset are deployed in a chip for calculation, the target scaling factor and the target offset need to be converted into fixed point values, and the following is to make the target scaling factor and the target offset into 16-bit fixed points. The details are as follows.
S203: and inputting setting data to the initial neural network for forward calculation.
S204: statistics (S)(c)*∑{Wint8*I8bit}+Bias(c)) Maximum absolute value of (D) is recorded as Maxrst
S205: statistical target scaling factor SCMax ofscale(ii) a For MaxscaleMaking a left shift fixed point of 16 bits to obtain a fixed point value q of a scaling factorscale
S206: statistical target BiasCMaximum value, denoted Maxbias(ii) a For Max (Max)rst,Maxbias) Making a 16-bit left shift fixed point to obtain a bias fixed point value qrst
Fixed and floating points are both representations of numerical values (representation), fixed points hold specific number integers and decimals, and floating points hold specific number significands (signed) and exponents (exponents). Fixed-point is distinguished from floating-point in that the integer (integer) and fractional (fractional) parts are separated.
And after the weight quantization, the excitation quantization and the coefficient fixed point are carried out, the quantization of the neural network can be completed.
S114: and acquiring a quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.
The quantized neural network structure built by using the quantization method can be directly applied to the deployment of hardware platforms such as Field Programmable Gate Array (FPGA) or reconfigurable computing architecture (CGRA) of a special chip.
Table 1 below shows the quantization accuracy comparison of the neural network before and after quantization.
TABLE 1
Kind of model Before quantization After quantization
GRU-L 95.09% 94.87%
LSTM-L 91.37% 90.35%
Wherein, the precision of GRU-L before quantization is 95.09%, and the precision after quantization is 94.87%. The accuracy before LSTM-L quantization was 91.37%, and the accuracy after quantization was 90.35%. Therefore, the quantized precision loss of the neural network is small by the quantization method, and the quantized neural network can be ensured to have higher precision and accuracy.
The invention can directly deploy the quantified neural network on the chip. Firstly, for any given neural network floating point model, the quantization method provided by the invention can be used for directly quantizing the neural network model, and the quantized network model does not need to be retrained and other operations, so that the quantization efficiency is high.
Secondly, the quantization method provided by the invention achieves the purpose of quantizing the whole neural network model by a layer-by-layer quantization method of the common layers of the neural network, and has proved that the neural network structures such as gru, lstm, rnn and the like have higher identification accuracy, can be further popularized to other neural network structures, and have stronger universality.
Finally, the neural network low bit quantization method provided by the invention aims to directly deploy the quantized neural network model on a chip, and is already applied to the actual voice recognition product, the quantization result can ensure higher precision and accuracy, the recognition rate requirement of the actual scene is met, and the method is extremely friendly to the development and deployment of a hardware platform, so that the method has important practical significance for widening the application of the neural network model in embedded equipment.

Claims (8)

1. A neural network low bit quantization method, comprising:
s101: acquiring an initial neural network, and acquiring weight values and offsets of C channels in the initial neural network, wherein each channel comprises a convolutional layer;
s102: counting the maximum value of the input of C channels in the initial neural network;
s103: obtaining a scaling factor according to the input maximum value in the C channels;
s104: quantizing the weight value of each channel to a low specific point weight according to the scaling factor and the weight value of each channel;
s105: inputting set data to the initial neural network for forward calculation;
s106: counting the maximum value of the input absolute value of the current convolution layer in the initial neural network, and acquiring a floating point quantization interval according to the maximum value of the input absolute value of the current convolution layer;
s107: repeating the step S105, and constructing a histogram with a first set length by using the input floating point value of the current convolutional layer based on the floating point quantization interval;
s108: circularly traversing the set quantization threshold values in the floating point quantization interval, converting the histogram with a first set length into a histogram with a second set length for each quantization threshold value, calculating KL divergence, and taking the quantization threshold value with the smallest KL divergence as a target quantization threshold value;
s109: acquiring an input quantization coefficient of the current convolution layer according to the target quantization threshold and the floating point quantization interval;
s110: taking the next convolutional layer connected after the current convolutional layer as the current convolutional layer, repeating the steps from S105 to S109 to obtain the input quantization coefficient of the next convolutional layer, and taking the input quantization coefficient of the next convolutional layer as the output quantization coefficient of the current convolutional layer;
s111: quantizing the input floating point data to obtain input fixed point data according to the input quantization coefficient of the current convolution layer and the input floating point data of the current convolution layer;
s112: quantizing the output floating point data to obtain output fixed point data according to the scaling factor, the input quantization coefficient of the current convolutional layer, the output quantization coefficient of the current convolutional layer and the output floating point data of the current convolutional layer;
s113: converting the scaling factor and the bias into a scaling factor fixed-point value and a bias fixed-point value, respectively;
s114: and acquiring a quantized neural network according to the low specific point weight, the input fixed point data, the output fixed point data, the scaling factor fixed point value and the bias fixed point value.
2. The neural network low bit quantization method of claim 1, wherein the initial neural network comprises a convolutional layer, a bulk normalization layer, and an activation function layer;
the S101 includes:
eliminating the batch normalization layer to enable the network structure of the initial neural network to be a convolution layer and an activation function layer; and acquiring the weight values and the offsets of the C channels in the initial neural network after the batch normalization layer is eliminated.
3. The neural network low bit quantization method of claim 2, wherein obtaining the scaling factor according to the maximum value input in the C channels in S103 comprises:
obtaining a vector with the length of C according to the maximum value input in the C channels; calculating the scaling factor by equation (1);
S(C)=thC/127.0 equation (1)
Wherein S is(C)Representing a scaling factor; th (h)CRepresenting a vector of length C.
4. The neural network low bit quantization method of claim 3, wherein said S104 comprises:
quantizing the weight value of each channel to a low specific point weight by formula (2);
Figure FDA0002711387930000021
wherein C is [0, C ]];Wint8Int8 represents the weight value of the channel after lower quantization; roundtrip represents rounding the calculation; w(C)A weight value representing a channel; s(C)Representing a scaling factor.
5. The neural network low bit quantization method of claim 4, wherein said S109 comprises:
calculating the input quantization coefficient by formula (3);
scale ═ (target _ th +0.5) · dist _ scale/127.0 equation (3)
Wherein scale is denoted as SpreRepresenting the input quantized coefficients; target _ th represents a target quantization threshold; dist _ scale represents a floating point quantization interval.
6. The neural network low bit quantization method of claim 5, wherein the quantization of the input floating point data in S111 is calculated by formula (4);
Figure FDA0002711387930000031
wherein, I8bitRepresenting input fixed point data; i isfRepresenting input floating point data; spreRepresenting the input quantized coefficients;
the output floating point data quantization in the S112 is obtained by calculation of formula (5);
Figure FDA0002711387930000032
wherein, O8bitIndicating output fixed point data; s(C)Representing a scaling factor; wint88bit weight after representing the weight value quantization of the channel; i is8bitRepresenting input fixed point data; biasCRepresents a bias; scurRepresenting the output quantized coefficients.
7. The neural network low bit quantization method of claim 6, wherein said converting said scaling factor and said bias into a scaling factor fixed point value and a bias fixed point value in said S113 comprises:
calculating to obtain a target scaling factor according to the scaling factor, the input quantization coefficient and the output quantization coefficient;
calculating to obtain target bias according to the bias and the output quantization coefficient;
inputting set data to the initial neural network for forward calculation;
statistics (S)(c)*∑{Wint8*I8bit}+Bias(c)) Maximum absolute value of (D) is recorded as Maxrst
Max of the maximum of the statistical target scaling factorscale(ii) a For MaxscaleMaking a left shift fixed point of 16 bits to obtain a fixed point value q of a scaling factorscale
Counting the maximum value of the target offset, and recording as Maxbias(ii) a For Max (Max)rst,Maxbias) Making a 16-bit left shift fixed point to obtain a bias fixed point value qrst
8. The neural network low bit quantization method of claim 5, wherein in case that the input and output of the activation function layer in the initial neural network are all positive values, the activation function layer output is optimized to be asymmetric quantization; the floating point quantization interval is changed to [256,2048], and a histogram with the length of 256 is generated according to different set quantization threshold values; the input quantized coefficients are finally divided by 255.0 when they are calculated.
CN202011057930.2A 2020-09-29 2020-09-29 Neural network low bit quantization method Pending CN112381205A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011057930.2A CN112381205A (en) 2020-09-29 2020-09-29 Neural network low bit quantization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011057930.2A CN112381205A (en) 2020-09-29 2020-09-29 Neural network low bit quantization method

Publications (1)

Publication Number Publication Date
CN112381205A true CN112381205A (en) 2021-02-19

Family

ID=74580892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011057930.2A Pending CN112381205A (en) 2020-09-29 2020-09-29 Neural network low bit quantization method

Country Status (1)

Country Link
CN (1) CN112381205A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011571A (en) * 2021-03-03 2021-06-22 华南理工大学 INT8 offline quantization and integer inference method based on Transformer model
CN113487014A (en) * 2021-07-05 2021-10-08 上海西井信息科技有限公司 Method and equipment for quantizing any bit based on semantic segmentation neural network model
CN113255901B (en) * 2021-07-06 2021-10-08 上海齐感电子信息科技有限公司 Real-time quantization method and real-time quantization system
CN113554149A (en) * 2021-06-18 2021-10-26 北京百度网讯科技有限公司 Neural network processing unit NPU, neural network processing method and device
CN113747155A (en) * 2021-09-06 2021-12-03 中国电信股份有限公司 Feature quantization method and device, encoder and communication system
CN113780523A (en) * 2021-08-27 2021-12-10 深圳云天励飞技术股份有限公司 Image processing method, image processing device, terminal equipment and storage medium
CN114781604A (en) * 2022-04-13 2022-07-22 广州安凯微电子股份有限公司 Coding method of neural network weight parameter, coder and neural network processor
CN116992965A (en) * 2023-09-27 2023-11-03 之江实验室 Reasoning method, device, computer equipment and storage medium of transducer large model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106575379A (en) * 2014-09-09 2017-04-19 英特尔公司 Improved fixed point integer implementations for neural networks
JP2019160319A (en) * 2018-03-09 2019-09-19 キヤノン株式会社 Method and device for optimizing and applying multi-layer neural network model, and storage medium
CN110363281A (en) * 2019-06-06 2019-10-22 上海交通大学 A kind of convolutional neural networks quantization method, device, computer and storage medium
CN110610237A (en) * 2019-09-17 2019-12-24 普联技术有限公司 Quantitative training method and device of model and storage medium
CN111260022A (en) * 2019-11-22 2020-06-09 中国电子科技集团公司第五十二研究所 Method for fixed-point quantization of complete INT8 of convolutional neural network
CN111461302A (en) * 2020-03-30 2020-07-28 杭州嘉楠耘智信息科技有限公司 Data processing method, device and storage medium based on convolutional neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106575379A (en) * 2014-09-09 2017-04-19 英特尔公司 Improved fixed point integer implementations for neural networks
JP2019160319A (en) * 2018-03-09 2019-09-19 キヤノン株式会社 Method and device for optimizing and applying multi-layer neural network model, and storage medium
CN110363281A (en) * 2019-06-06 2019-10-22 上海交通大学 A kind of convolutional neural networks quantization method, device, computer and storage medium
CN110610237A (en) * 2019-09-17 2019-12-24 普联技术有限公司 Quantitative training method and device of model and storage medium
CN111260022A (en) * 2019-11-22 2020-06-09 中国电子科技集团公司第五十二研究所 Method for fixed-point quantization of complete INT8 of convolutional neural network
CN111461302A (en) * 2020-03-30 2020-07-28 杭州嘉楠耘智信息科技有限公司 Data processing method, device and storage medium based on convolutional neural network

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011571A (en) * 2021-03-03 2021-06-22 华南理工大学 INT8 offline quantization and integer inference method based on Transformer model
CN113554149A (en) * 2021-06-18 2021-10-26 北京百度网讯科技有限公司 Neural network processing unit NPU, neural network processing method and device
CN113487014A (en) * 2021-07-05 2021-10-08 上海西井信息科技有限公司 Method and equipment for quantizing any bit based on semantic segmentation neural network model
CN113255901B (en) * 2021-07-06 2021-10-08 上海齐感电子信息科技有限公司 Real-time quantization method and real-time quantization system
CN113780523A (en) * 2021-08-27 2021-12-10 深圳云天励飞技术股份有限公司 Image processing method, image processing device, terminal equipment and storage medium
CN113780523B (en) * 2021-08-27 2024-03-29 深圳云天励飞技术股份有限公司 Image processing method, device, terminal equipment and storage medium
CN113747155A (en) * 2021-09-06 2021-12-03 中国电信股份有限公司 Feature quantization method and device, encoder and communication system
CN113747155B (en) * 2021-09-06 2022-08-19 中国电信股份有限公司 Characteristic quantization method and device, encoder and communication system
CN114781604A (en) * 2022-04-13 2022-07-22 广州安凯微电子股份有限公司 Coding method of neural network weight parameter, coder and neural network processor
CN114781604B (en) * 2022-04-13 2024-02-20 广州安凯微电子股份有限公司 Coding method of neural network weight parameters, coder and neural network processor
CN116992965A (en) * 2023-09-27 2023-11-03 之江实验室 Reasoning method, device, computer equipment and storage medium of transducer large model
CN116992965B (en) * 2023-09-27 2024-01-09 之江实验室 Reasoning method, device, computer equipment and storage medium of transducer large model

Similar Documents

Publication Publication Date Title
CN112381205A (en) Neural network low bit quantization method
Lee et al. Lognet: Energy-efficient neural networks using logarithmic computation
EP3276540B1 (en) Neural network method and apparatus
US10096134B2 (en) Data compaction and memory bandwidth reduction for sparse neural networks
CN109102064B (en) High-precision neural network quantization compression method
CN111612147A (en) Quantization method of deep convolutional network
EP3651069A1 (en) Data processing device, data processing method, and compressed data
US10491239B1 (en) Large-scale computations using an adaptive numerical format
CN112329922A (en) Neural network model compression method and system based on mass spectrum data set
WO2021135715A1 (en) Image compression method and apparatus
US20230300354A1 (en) Method and System for Image Compressing and Coding with Deep Learning
Wang et al. QGAN: Quantized generative adversarial networks
CN111240746B (en) Floating point data inverse quantization and quantization method and equipment
CN110874625A (en) Deep neural network quantification method and device
CN109978144B (en) Model compression method and system
CN112766484A (en) Floating point neural network model quantization system and method
US11531884B2 (en) Separate quantization method of forming combination of 4-bit and 8-bit data of neural network
CN111178427A (en) Depth self-coding embedded clustering method based on Sliced-Wasserstein distance
CN112652299B (en) Quantification method and device of time series speech recognition deep learning model
Ullah et al. L2L: A highly accurate Log_2_Lead quantization of pre-trained neural networks
CN101467459A (en) Restrained vector quantization
CN112613604A (en) Neural network quantification method and device
Enderich et al. Learning multimodal fixed-point weights using gradient descent
US20210132866A1 (en) Data processing device, method of operating the same, and program
JPWO2020049681A1 (en) Information processing equipment, methods and programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhang Shurui

Inventor after: OuYang Peng

Inventor before: Zhang Shurui

Inventor before: OuYang Peng

Inventor before: Yin Shouyi