CN111768002B - Deep neural network quantization method based on elastic significance - Google Patents

Deep neural network quantization method based on elastic significance Download PDF

Info

Publication number
CN111768002B
CN111768002B CN202010661226.1A CN202010661226A CN111768002B CN 111768002 B CN111768002 B CN 111768002B CN 202010661226 A CN202010661226 A CN 202010661226A CN 111768002 B CN111768002 B CN 111768002B
Authority
CN
China
Prior art keywords
quantization
elastic
significance
neural network
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010661226.1A
Other languages
Chinese (zh)
Other versions
CN111768002A (en
Inventor
龚成
卢冶
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202010661226.1A priority Critical patent/CN111768002B/en
Publication of CN111768002A publication Critical patent/CN111768002A/en
Application granted granted Critical
Publication of CN111768002B publication Critical patent/CN111768002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention provides a deep neural network quantization method based on elastic significance, which quantizes fixed points or floating points into a quantization value with elastic significance, discards redundant mantissa parts and quantitatively evaluates the distribution difference between the quantization value and original data in a feasible solving mode. The invention has the quantization value of the elastic effective bit, and the distribution of the quantization value can cover a series of bell-shaped distributions from a long tail to an even one through different effective bits, so as to adapt to the weight/activation distribution of DNNs, thereby ensuring low precision loss; multiplication can be realized by multiple shift addition on hardware, so that the overall efficiency of the quantization model is improved; the distribution difference function quantitatively estimates the quantization loss caused by different quantization schemes, and the optimal quantization scheme can be selected under different conditions, so that lower quantization loss is realized, and the precision of a quantization model is improved.

Description

Deep neural network quantization method based on elastic significance
Technical Field
The invention belongs to the technical field of deep neural network compression, and particularly relates to a deep neural network quantization method based on elastic significant bits.
Background
Deep Neural Networks (DNNs) quantification is an effective method for compressing DNNs, and can remarkably improve the calculation efficiency of DNNs, so that the network can be deployed on an edge calculation platform with limited resources. One of the most common methods to achieve quantization of DNNs is to project high precision floating point values into low precision quantized values. For example, DNNs with 32-bit floating points may achieve 32x model size compression by replacing weights with only one bit, and may even reduce complex multiplication operations to simple bit operations in hardware. Therefore, under the condition of less bit values, the DNN quantization can significantly reduce the calculation scale or the memory occupation, thereby improving the calculation efficiency.
However, quantization also results in a substantial reduction in accuracy, mainly because the use of low-accuracy replacement destroys the original data distribution of weights/activations in DNNs. For example, binarization converts original values subject to arbitrary distribution into Bernoulli distribution, resulting in severe degradation of accuracy. Therefore, in order to obtain higher precision, it is still a challenge how to improve the approximation between two distributions before and after quantization as much as possible with the least number of bits as appropriate. Some low-bit quantization efforts such as binary/ternary may achieve high computational efficiency, but the model accuracy is too low. Therefore, researchers in recent years have focused more on multi-bit quantization, which can be roughly divided into two categories: linear quantization and non-linear quantization.
Linear quantization: linear quantization quantizes the data into continuous fixed-point values. The Dorefa-Net first finds a scaling factor to scale the data to the target range and then applies a rounding operation to implement the projection process. TSQ gives two-step quantization of activation and weight, respectively. It first sets the activations below the threshold to 0, then quantizes the other activations by linear quantization, and then quantizes the weights to three values by quantization based on a convolution kernel. QIL introduces a trainable linear quantization process that is re-parameterized by optimizing task loss. PACT considers that the reason for the difficulty of activation quantization is the lack of any trainable parameters. It re-parameterizes the clipping parameters used to activate quantization to limit the range of values over which linear quantization is then used. BCGD uses a simple step function with one scaling factor to achieve linear quantization of all weights and activations. DSQ proposes a soft, derivable linear quantization process that solves the quantization-irreducible problem by using a tanh function to iteratively approximate a step function step by step. The latest μ L2Q introduces scaling and shifting factors for weight transformation, and then truncates and rounds the transformed data to discrete values. VecQ scales the weights using the scaling factor only within the target range and then truncates and rounds them to a fixed point value.
Nonlinear quantization: non-linear quantization projects floating point numbers into low precision values with non-uniform spacing. TTQ attempts to correct the degradation in accuracy caused by the three values and the quantization weights to three values that are not centrosymmetric. To achieve this goal, it introduces two trainable scaling factors. To exploit bit manipulation from binary quantization while preventing severe degradation of precision, LQ-Nets and ABC-Nets quantize weights and activations as the sum of multiple binary quantization results to achieve a new trade-off between precision and efficiency. Logarithmic quantization attempts to map floating point numbers to values based on different base exponents, such as base 1 being 1.2 or 2.1. The second power (PoT) was found to be computationally efficient, but has poor performance. INQ attempts to solve the problem of fast convergence of quantized networks, which proposes a method of quantizing weights step by step into PoT. ENN introduces an Alternating Direction Multiplier Method (ADMM) into the quantization to solve the problem of gradient propagation of the non-derivable quantization. It also quantizes the weight to PoT.
Recent APoT corrects the accuracy by adding a complex quantization value in PoT, but it actually generates a multi-peak distributed quantization value, resulting in some accuracy degradation. And its complex quantization interval introduces complex projection operations as a new performance bottleneck, resulting in extremely low computational efficiency. APoT quantizes floating point numbers into the form of a sum of some PoT. For example, if there are b given quantization bits, APoT will first generate n groups from b, each group containing
Figure BDA0002578591210000021
PoT. The remaining 1 bit is removed as the sign bit and n must be guaranteed to be divisible by (b-1). Then, the random combination of the values in each group is extracted to generate the final 2b1 quantization value (0 value recurs). Fig. 1 shows the quantized values of different packets for the case of APoT where b is 5. APoT is equivalent to PoT quantization when n is 1, and linear quantization when n is 4. At n-2, the quantified values of ApoT are multimodal. Neither of these distributions is adapted to a bell-shaped distribution, but in most cases the DNN weights/activation values are bell-shaped, which indicates that APoT cannot adapt to the vast majority of DNN quantizations, with a large degradation in accuracy. The APoT shown in fig. 2 is used for quantizing a complex step function of a projection, which is difficult to implement with a simple scaling and rounding function, so that the time complexity and the space complexity thereof reach o (n), which greatly reduces the computational efficiency of a quantization model.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides a deep neural network quantization method based on elastic significant bits, and the distribution of quantization values can cover a series of bell-shaped distributions from long tails to uniformity through different significant bits, so that the weight/activation distribution of DNNs is adapted, the low precision loss is ensured, and the overall efficiency of a quantization model is improved.
The technical scheme adopted by the invention is as follows: a deep neural network quantization method based on elastic significant bits quantizes fixed-point numbers or floating-point numbers into quantized values with elastic significant bits, and discards redundant mantissa parts.
The elastic significant bit is a finite number of significant bits reserved from the most significant bit. The number of significant bits is determined according to the fixed-point number or the floating-point number itself.
For a given data v, k +1 significant bits are specified from their most significant bit positions, as follows:
Figure BDA0002578591210000031
wherein the (n-k) to n sections are reserved significant sections and the 0 (-infinity) to (n-k-1) sections are mantissa sections requiring rounding; quantization of fixed or floating point numbers:
P(v)=R(v>>n-k)<<n-k
where > and < are shift operations and R () is a rounding operation.
The multiplication of the elastic significant bit is based on a multi-stage shift accumulation implementation of the significant bit.
The number of significance determines the number of stages of shift addition.
The quantitative evaluation is carried out on the distribution difference between the quantized value and the original data by adopting a feasible solving mode:
the quantization weight is W, the sampling self-random variable t-p (t), the collection of all quantization values is Q, and the distribution difference function is defined as follows:
Figure BDA0002578591210000032
wherein (q-q)l,q+qu]Representing a range of continuous data that can be projected onto a q-value, the range being centered at q, q1And q isuIndicating its floating range.
The distribution difference is used to evaluate the optimal quantization at different elastic significands.
The working principle is as follows: the elastic effective bit realizes the quantized values of different distributions to improve the model precision: the key point affecting accuracy is actually how many bits in a given bit can actually be used to control the distribution of quantization values to approximate the original distribution. In the present invention, these bits are referred to as significant bits, and the number of significant bits, i.e. the number of bits between the most significant bit and the least significant bit, will affect the distribution of the quantization values. Due to the flexible effective bit number design, the quantization value of the quantization method can cover all possible bell-shaped data distribution forms from long tail to uniform distribution so as to adapt to the weight/activation of the DNNs distributed in the bell shape, and therefore the quantization precision is effectively guaranteed. In addition, the invention also designs a distribution difference function to estimate the quantization error, so as to find the most suitable quantization scheme, namely the most suitable effective bit number.
The high-efficiency quantization value set ensures the calculation efficiency of the quantization model: by setting the quantization value with limited effective bits, the quantization value has less effective bits than the number of continuous fixed points under the same quantization bit number, so the multiplication can be realized by shift addition with less stages, and the calculation efficiency of the multiplication of a quantization model is improved.
Fast projection function: compared with a complex step function as a function, the method realizes fast projection operation from a floating point value to a quantized value by using simple shifting and rounding operations, only needs O (1) time complexity and space complexity in algorithm, and only needs one CPU clock cycle in hardware implementation, so that the projection efficiency is greatly improved.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention has the quantization value of the elastic effective bit, and the distribution of the quantization value can cover a series of bell-shaped distributions from a long tail to an even one through different effective bits, so as to adapt to the weight/activation distribution of DNNs, thereby ensuring low precision loss; experiments show that the ResNet18 is quantized into 2 bits on ImageNet, the accuracy based on EpoT quantization reaches 68.13%, and is improved by 1.03% compared with 67.1% of APoT;
2. the multiplication of the elastic effective bit can be realized by multiple shift addition on hardware, and the grade number of the shift addition depends on the effective digit;
3. according to the invention, the shift operation and the rounding operation can be completed in one clock cycle of the CPU, so that the projection algorithm can realize high-efficiency projection operation;
4. the distribution difference function quantitatively estimates the quantization loss caused by different quantization schemes, and can select the optimal quantization scheme under different conditions, thereby realizing lower quantization loss and improving the precision of a quantization model.
Drawings
FIG. 1 is a graph of the prior art quantification of APoT to form a multimodal distribution;
FIG. 2 is a prior art complex step function caused by APoT;
FIG. 3 is a schematic diagram of a shift and rounding operation according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating multiplication by multi-stage shift-accumulate operations according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a shift and rounding operation according to another embodiment of the present invention;
FIG. 6 is a diagram illustrating multiplication by multi-stage shift-accumulate operations according to another embodiment of the present invention;
FIG. 7 is a schematic diagram illustrating a distribution variance solution according to an embodiment of the present invention;
fig. 8 is a quantization process of quantizing ResNet18 according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Example one
The embodiment of the invention provides a deep neural network quantization method based on elastic significant bits, which quantizes fixed-point numbers or floating-point numbers into quantized values with elastic significant bits, and discards redundant mantissa parts. The elastic significant bit is a finite number of significant bits reserved from the most significant bit.
For a given data v, k +1 significant bits are specified from their most significant bit positions, as follows:
Figure BDA0002578591210000051
wherein the (n-k) to n sections are reserved significant sections and the 0 (-infinity) to (n-k-1) sections are mantissa sections requiring rounding; quantization of fixed or floating point numbers:
P(v)=R(v>>n-k)<<n-k
where > and < are shift operations and R () is a rounding operation.
As shown in fig. 3, in the present embodiment, 91 is quantized, the elastic valid bit is set to retain 4 valid bits, two shift operations and one rounding operation are performed, and 91 is quantized to 88.
For the multiplication operation of 91 and 91, after the shift and rounding operation, the operation is changed into 88 and 88, and then the multi-stage shift accumulation is carried out, as shown in fig. 4, the operation is the shift accumulation of four numbers, and the calculation efficiency is greatly improved. The embodiment can further improve the calculation efficiency by removing the calculation of the 0 value in the valid bit.
Example two
The embodiment of the invention provides a deep neural network quantization method based on elastic significant bits, which quantizes fixed-point numbers or floating-point numbers into quantized values with elastic significant bits, and discards redundant mantissa parts. The elastic significant bit is a finite number of significant bits reserved from the most significant bit.
For a given data v, k +1 significant bits are specified from their most significant bit positions, as follows:
Figure BDA0002578591210000061
wherein the (n-k) to n sections are reserved significant sections and the 0 (-infinity) to (n-k-1) sections are mantissa sections requiring rounding; quantization of fixed or floating point numbers:
P(v)=R(v>>n-k)<<n-k
where > and < are shift operations and R () is a rounding operation.
As shown in fig. 5, in the present embodiment, 92 is quantized, the elastic valid bit is set to retain 4 valid bits, two shift operations and one rounding operation are performed, and 92 is quantized to 96.
For the multiplication operation of 92 and 92, after shift and rounding operations, the operation is changed into 96 and 96, and then multi-stage shift accumulation is performed, as shown in fig. 6, the operation is shift accumulation of two numbers, so that the calculation efficiency is greatly improved.
EXAMPLE III
The quantitative evaluation is carried out on the distribution difference between the quantized value and the original data by adopting a feasible solving mode:
the quantization weight is W, the sampling self-random variable t-p (t), the collection of all quantization values is Q, and the distribution difference function is defined as follows:
Figure BDA0002578591210000071
s.t.S=(q-ql,q+qu]
wherein (q-q)l,q+qu]Representing a range of continuous data that can be projected onto a q-value, the range being centered at q, q1And q isuIndicating its floating range. The solving diagram is shown in fig. 7. The distribution difference can be used to evaluate the optimal quantization at different elastic significands.
Inputting: there is a DNN weight w sampled from a standard normal distribution N (0, 1)fIt needs to be quantized to a low precision value of 4 bits;
and (3) outputting: optimum significand, quantized weight wq
A. For a 4-bit quantized value, the most significant bits possible are 1, 2, 3, respectively denoted as Qi,i∈{1,2,3} quantization value set.
B. In order to align the distribution of quantization values to ensure accuracy, a scaling parameter λ is allowed. Randomly initializing a lambdaiE is [0, 3 ]) of
Figure BDA0002578591210000072
Scaling to different distribution ranges to align quantization values QiDistribution of (2).
C. For based on QiA set of quantization values which can be quantized by a fast quantization technique
Figure BDA0002578591210000073
Quantized to low-precision Q-onlyiQuantization weight w represented by set medianqi
D. Calculating the distribution difference J between the quantized weight and the original weight calculated based on the quantized set by using the distribution differencei
E. Calculation of JiFor lambdaiGradient of (2)
Figure BDA0002578591210000074
Searching for optimal solutions by gradient descent algorithm
Figure BDA0002578591210000075
And the corresponding
Figure BDA0002578591210000076
Or by calculation
Figure BDA0002578591210000077
Directly solving the extreme value to obtain a local optimal solution
Figure BDA0002578591210000078
And the corresponding
Figure BDA0002578591210000079
F. Obtaining the optimal significance:
Figure BDA00025785912100000710
G. based on Qi*A set of quantization values to be quantized using an elastic significance quantization technique
Figure BDA00025785912100000711
Quantization weight w with low precision for speed decision quantizationq
Example four
Taking ResNet 8 as an example, this embodiment quantizes all convolutional layers and fully-connected layers of ResNet18 except the first layer and the last layer, and the quantization process is shown in fig. 8. In the forward process, for each layer to be quantized, the weight of the layer and the activation value from the upper layer are first quantized into quantized values of a finite number of Significant Bits by a projection process of efficient Elastic Significant Bits (ESB). And then inputting the quantized value into a multiplication accumulator, and converting the multiplication calculation of a limited number of valid bits into shift addition based on the valid bits to improve the calculation efficiency of the neural network. And finally, obtaining an output result of the layer through accumulation operation. In the back propagation process, the present embodiment utilizes a direct Estimation method (STE) to implement gradient transmission in the quantization process, and in brief, directly estimates the network loss to the quantized weight or activated gradient as the full-precision weight or activated gradient before quantization, so as to implement network update. Finally, experiments show that the weight and activation of ResNet18 are quantized to 2 bits on ImageNet, and the quantization precision reaches 68.13%.
The present invention has been described in detail with reference to the embodiments, but the description is only illustrative of the present invention and should not be construed as limiting the scope of the present invention. The scope of the invention is defined by the claims. The technical solutions of the present invention or those skilled in the art, based on the teaching of the technical solutions of the present invention, should be considered to be within the scope of the present invention, and all equivalent changes and modifications made within the scope of the present invention or equivalent technical solutions designed to achieve the above technical effects are also within the scope of the present invention.

Claims (6)

1. A deep neural network quantization method based on elastic significance is characterized in that: quantizing the fixed point number or the floating point number into a quantized value with an elastic significant digit, and discarding redundant mantissa parts;
inputting the quantized value into a multiplication accumulator, converting the value into a shift addition based on a finite effective bit through multiplication calculation of the finite effective bit, and finally obtaining an output result of the layer through accumulation operation;
and quantitatively evaluating the distribution difference between the quantized value and the original data in a feasible solving way:
the quantization weight is W, the sampling self-random variable t-p (t), the collection of all quantization values is Q, and the distribution difference function is defined as follows:
Figure FDA0003019739720000011
s.t.S=(q-ql,q+qu]
wherein (q-q)l,q+qu]Representing a range of continuous data that can be projected onto a q-value, the range being centered at q, qlAnd q isuIndicating its floating range.
2. The elastic significance-based deep neural network quantization method of claim 1, wherein: the elastic significant bit is a finite number of significant bits reserved from the most significant bit.
3. The elastic significance-based deep neural network quantization method of claim 1 or 2, wherein: for a given data v, k +1 significant bits are specified from their most significant bit positions, as follows:
Figure FDA0003019739720000012
wherein the (n-k) to n sections are reserved significant sections and the 0 (-infinity) to (n-k-1) sections are mantissa sections requiring rounding; quantization of fixed or floating point numbers:
P(v)=R(v>>n-k)<<n-k
where > and < are shift operations and R () is a rounding operation.
4. The elastic significance-based deep neural network quantization method of claim 3, wherein: the multiplication of the elastic significant bit is based on a multi-stage shift accumulation implementation of the significant bit.
5. The elastic significance-based deep neural network quantization method of claim 4, wherein: the number of significance determines the number of stages of shift addition.
6. The elastic significance-based deep neural network quantization method of claim 3, wherein: the distribution difference is used to evaluate the optimal quantization at different elastic significands.
CN202010661226.1A 2020-07-10 2020-07-10 Deep neural network quantization method based on elastic significance Active CN111768002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010661226.1A CN111768002B (en) 2020-07-10 2020-07-10 Deep neural network quantization method based on elastic significance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010661226.1A CN111768002B (en) 2020-07-10 2020-07-10 Deep neural network quantization method based on elastic significance

Publications (2)

Publication Number Publication Date
CN111768002A CN111768002A (en) 2020-10-13
CN111768002B true CN111768002B (en) 2021-06-22

Family

ID=72726684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010661226.1A Active CN111768002B (en) 2020-07-10 2020-07-10 Deep neural network quantization method based on elastic significance

Country Status (1)

Country Link
CN (1) CN111768002B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176000A1 (en) 2017-03-23 2018-09-27 DeepScale, Inc. Data synthesis for autonomous control systems
US11157441B2 (en) 2017-07-24 2021-10-26 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US10671349B2 (en) 2017-07-24 2020-06-02 Tesla, Inc. Accelerated mathematical engine
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11215999B2 (en) 2018-06-20 2022-01-04 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11361457B2 (en) 2018-07-20 2022-06-14 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
IL305330A (en) 2018-10-11 2023-10-01 Tesla Inc Systems and methods for training machine models with augmented data
US11196678B2 (en) 2018-10-25 2021-12-07 Tesla, Inc. QOS manager for system on a chip communications
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US10997461B2 (en) 2019-02-01 2021-05-04 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US10956755B2 (en) 2019-02-19 2021-03-23 Tesla, Inc. Estimating object properties using visual image data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9224089B2 (en) * 2012-08-07 2015-12-29 Qualcomm Incorporated Method and apparatus for adaptive bit-allocation in neural systems
CN108616927B (en) * 2018-04-04 2021-08-27 北京锐安科技有限公司 Data sending and receiving method and device
CN109711542B (en) * 2018-12-29 2020-08-18 西安交通大学 DNN accelerator supporting dynamic precision and implementation method thereof
CN110222821B (en) * 2019-05-30 2022-03-25 浙江大学 Weight distribution-based convolutional neural network low bit width quantization method
CN110363281A (en) * 2019-06-06 2019-10-22 上海交通大学 A kind of convolutional neural networks quantization method, device, computer and storage medium

Also Published As

Publication number Publication date
CN111768002A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
CN111768002B (en) Deep neural network quantization method based on elastic significance
CN109472353B (en) Convolutional neural network quantization circuit and method
US11403528B2 (en) Self-tuning incremental model compression solution in deep neural network with guaranteed accuracy performance
EP3451164A1 (en) Neural network operation device and method supporting few-bit fixed-point number
KR20190051755A (en) Method and apparatus for learning low-precision neural network
KR20180043172A (en) Method and apparatus for neural network quantization
Zhao et al. Focused quantization for sparse CNNs
CN111985523A (en) Knowledge distillation training-based 2-exponential power deep neural network quantification method
CN110309904B (en) Neural network compression method
CN111310890B (en) Optimization method and device of deep learning model and terminal equipment
CN111178516A (en) Softmax function calculation method based on segmented lookup table and hardware system
CN109165006B (en) Design optimization and hardware implementation method and system of Softmax function
CN111382860A (en) Compression acceleration method of LSTM network and FPGA accelerator
US20210294874A1 (en) Quantization method based on hardware of in-memory computing and system thereof
CN110545162B (en) Multivariate LDPC decoding method and device based on code element reliability dominance degree node subset partition criterion
CN113902109A (en) Compression method and device for regular bit serial computation of neural network
KR102092634B1 (en) Low density parity check code decoder and method for decoding ldpc code
CN112686384A (en) Bit-width-adaptive neural network quantization method and device
CN116934487A (en) Financial clearing data optimal storage method and system
CN110955405A (en) Input data processing and index value obtaining method and device and electronic equipment
Kalali et al. A power-efficient parameter quantization technique for CNN accelerators
CN110619392A (en) Deep neural network compression method for embedded mobile equipment
JP7016559B1 (en) Error-biased approximation multiplier for normalized floating-point numbers and how to implement it
CN113779861B (en) Photovoltaic Power Prediction Method and Terminal Equipment
JP2012060210A (en) Method, apparatus and program for adaptive quantization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant