CN111950715A - 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift - Google Patents

8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift Download PDF

Info

Publication number
CN111950715A
CN111950715A CN202010859153.7A CN202010859153A CN111950715A CN 111950715 A CN111950715 A CN 111950715A CN 202010859153 A CN202010859153 A CN 202010859153A CN 111950715 A CN111950715 A CN 111950715A
Authority
CN
China
Prior art keywords
point
floating
shift
layer
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010859153.7A
Other languages
Chinese (zh)
Inventor
谢远东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010859153.7A priority Critical patent/CN111950715A/en
Publication of CN111950715A publication Critical patent/CN111950715A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides an 8-bit integer full-quantization inference method and a device based on self-adaptive dynamic shift, wherein the method comprises the following steps: acquiring a trained floating point model; obtaining the weight of each channel in the floating-point model; calculating an activation value of each layer in the floating-point model through KLD; based on the activation value, determining a conversion factor aiming at layer jump and convolution channel scrambling operations of the floating point model, and pre-storing all fixed point values and shift values; and acquiring the weight scale of the fixed point of the floating point model according to the quantization table, and outputting an integer result based on the weight. The method provided by the embodiment of the invention greatly reduces the error of converting floating point to fixed point according to the channel full-fixed-point quantization, does not relate to floating point operation in the reasoning process, and full-fixed-point shift operation, can verify whether the result error after the model full quantization meets the requirements of an artificial intelligent chip, and can self-adaptively carry out dynamic shift, thereby avoiding overflow error caused by fixed shift, and the intermediate value is optimized from int32 to int8 to further reduce on-chip internal memory.

Description

8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift
Technical Field
One or more embodiments of the present disclosure relate to the field of Convolutional Neural Networks (CNNs), and in particular, to an 8-bit integer full-quantization inference method and apparatus based on adaptive dynamic shift.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
CNNs (neural networks) obtain superior results in the fields of image classification, target detection, face recognition and the like, but due to the complexity and the calculation delay of a network structure, the real-time forward reasoning of CNNs is realized on an embedded platform with relatively insufficient storage resources and calculation resources, and the size of a neural network model needs to be reduced and the calculation efficiency of the model needs to be improved under the condition of control precision loss
In the prior art, each layer of the neural network is uniformly quantized, the weight and the activation value are quantized, and then fixed-point multiplication and addition are carried out to achieve an acceleration effect.
This technique has the following problems:
firstly, even quantization is adopted for the activation value, although the calculation amount is small, the quantization error is large and is hardly available;
secondly, the weight is quantized layer by layer, and the error of the multi-channel convolutional layer is far larger than that of the multi-channel convolutional layer;
thirdly, in the prior art, floating point calculation still needs to be performed in reasoning, and an artificial intelligence Chip which only supports fixed point operation equipment is not available.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure describe an 8-bit integer full-quantization inference method for adaptive dynamic shift, which can solve the AI Chip type fixed-point performance verification problem and further reduce on-Chip memory.
The technical scheme provided by one or more embodiments of the specification is as follows:
in order to solve the above problem, in a first aspect, the present invention provides an 8-bit integer full-quantization inference method based on adaptive dynamic shift, including:
acquiring a trained floating point model;
calculating the weight of each layer in the floating-point model according to channels;
calculating an activation value of each layer in the floating-point model;
based on the weight and the activation value, determining a conversion factor aiming at layer jump and convolution channel scrambling operation of the floating point model, and pre-storing all fixed point values and shift values to a quantization table;
based on the quantization table, each layer accepts int8 type quantization inputs and generates int8 quantization outputs.
In one embodiment, the weight of each layer in the floating-point model is computed by channel by the following formula:
127/xmax
wherein x ismaxIs the maximum value of the channel weight.
In an embodiment, the calculating the activation value of each layer in the floating-point model specifically includes:
preparing a calibration data set;
initializing an activation value distribution from the calibration data set;
performing distribution normalization treatment to obtain threshold corresponding to the minimum kl divergence;
and solving the activation value of each layer in the floating point model based on the threshold.
In one embodiment, the normalization process specifically includes:
setting the number of blocks, target _ bins, to 128, and performing the following processing for each threshold of [ target _ bins,2048 ]:
adding the distributions of [ threshold,2048] to threshold _ sum, assigning the original distribution [ threshold ] to threshold _ sum, and referring the new distribution to be a P matrix;
dividing a sampling interval into threshold/target _ bins, and resampling the original distribution, wherein the dimensionality is the same as that of a P matrix and is called as a Q matrix;
according to the formula
Figure BDA0002647558110000031
Obtaining a threshold corresponding to the minimum kl divergence; wherein p is a target distribution; q is the de-matched distribution; dKL(p | | q) represents the KL divergence, and measures the similarity of two distributions of p and q, also referred to as the information loss of q relative to p.
In an embodiment, the obtaining the activation value of each layer in the floating-point model based on the threshold specifically includes:
calculating an activation value according to a formula 127/((threshold +0.5) ×) interval; wherein the interval is a sampling interval.
In one embodiment, based on the weight and the activation value, a conversion factor is determined for layer jump and convolution channel scrambling operations of the floating-point model, and all fixed-point values and shift values are prestored in a quantization table, specifically:
fixed-point processing of a floating-point scale;
shifting the weight of the floating point model to a fixed point;
and pre-storing all fixed-point values and shift values in a binary system.
In one embodiment, the fix-point value and the shift value are obtained by the following formulas:
Shift=8-log2(A_max)Aint8=Int(Afloat/(pow(2,-Shift)))
wherein Shift is Shift; log2(a _ max) is the second logarithm for the maximum value of the activation value; a. theint8An 8-bit activation value; a. thefloatActivating a value for a floating point; pow (2, -Shift) is a power of 2 to the Shift.
In one embodiment, the receiving int8 type quantization input and generating int8 quantization output for each layer based on the quantization table specifically includes:
shifting the input activation value of each layer by a fixed point;
fixed point input and fixed point weight multiplication accumulation;
and outputting an integer result.
In a second aspect, the present invention provides an 8-bit integer full-quantization inference device based on adaptive dynamic shift, the device comprising:
an obtaining module configured to obtain the trained floating point model;
a weight module configured to compute a weight for each layer in the floating-point model per channel;
an activation value module configured to calculate an activation value for each layer in the floating-point model;
a quantization table module configured to determine a conversion factor for layer jump and convolution channel scrambling operations of the floating point model based on the weight and the activation value, and pre-store all fixed point values and shift values to a quantization table;
an int8 quantization output module configured to accept an int8 type quantization input for each layer and generate an int8 quantization output based on the quantization table.
In a third aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the method of the first aspect.
The method provided by the embodiment of the invention greatly reduces the error of converting floating point to fixed point according to the channel full-fixed-point quantization, does not relate to floating point operation in the reasoning process, and full-fixed-point shift operation, can verify whether the result error after the model full quantization meets the requirements of an artificial intelligent chip, and can self-adaptively carry out dynamic shift, thereby avoiding overflow error caused by fixed shift, and the intermediate value is optimized from int32 to int8 to further reduce on-chip internal memory.
Drawings
FIG. 1 is a schematic flow diagram of the overall invention;
fig. 2 is a schematic flow chart of an 8-bit integer full-quantization inference method based on adaptive dynamic shift according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for calculating activation values for each layer in a floating-point model;
FIG. 4 is a flow chart illustrating one process for determining a conversion factor;
FIG. 5 is a second flowchart illustrating a process for determining a conversion factor;
fig. 6 is a schematic structural diagram of an 8-bit integer full-quantization inference device based on adaptive dynamic shift according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be further noted that, for the convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
For the convenience of understanding the present invention, the following part of the specialized vocabulary is explained:
split-like replication, Cube stands for replicated block;
conv is the normal convolution;
DwConv is a separable convolution;
BinaryOp is matrix and vector operation;
the shuffle channel shuffles the operation for the convolution channel;
right _ int8 is the right 8-bit input of fig. 4;
left _ int8 is the left 8-bit input of FIG. 4;
factor is a factor, and the number is meaningless; factor _ float is a floating-point Factor;
scale _ top is the output Scale (this is the assumed value);
scale _ top _ real is output real Scale;
top represents output, bottom represents input;
scale _ layer _ float is the layer floating point Scale value, Scale meaning: scale int8, which is the floating point of the layer, Scale _ activate is the active value Scale; scale _ weight is the weight Scale.
Weight _ int8 is an 8-bit Weight;
GEMM _ int8 is a bit-generic matrix multiplication;
weight _ int8 is an 8-bit Weight;
bottom _ int8 is an 8-bit input;
bias _ int32 is a 32-bit integer Bias;
top _ int8 is an 8-bit output;
slice is the slicing operation, i.e. taking the dimension of the previous Slice.
Fig. 1 is a schematic general flow chart of the present invention, and as shown in fig. 1, the method includes two parts: pre-inference and post-inference. Before reasoning, the floating point is mainly fixed-point processed, and in the reasoning, the fixed point is adaptively and dynamically shifted, and an integer result is output.
The following describes specific implementation steps of the above method with reference to specific examples. Specifically, fig. 2 is a schematic flow chart of an 8-bit integer full-quantization inference method based on adaptive dynamic shift according to an embodiment of the present invention, and it can be understood by combining fig. 1 and fig. 2 that steps 10 to 40 in fig. 2 are processing procedures before inference, and step 50 is a procedure of obtaining an output integer result in inference. Specifically, the method comprises the following steps:
the method comprises the following steps:
and step 10, acquiring the trained floating point model.
And step 20, calculating the weight of each layer in the floating point model according to channels.
Specifically, the weight scale of each layer in the floating-point model is calculated by a channel by the following formula:
127/xmax
wherein x ismaxIs the maximum value of the channel weight; scale herein has no specific meaning and represents a number, with the numerical meaning: int8 scale is float, i.e. the product of the layer fixed point and scale is the layer original floating point, where float represents floating point.
And step 30, calculating the activation value of each layer in the floating point model.
Fig. 3 is a schematic flowchart of a process of calculating an activation value of each layer in a floating point model, and the present invention uses a KLD algorithm to calculate an activation value scale of each layer in a neural network, specifically, as shown in fig. 3, the method includes the following steps:
step 301, prepare a calibration data set.
Step 302, initializing an activation value distribution from the calibration data set.
For each activation value bins is set to 2048, and the sampling interval x is setmaxAnd/bins, initializing an activation value distribution according to the calibration data set, namely acquiring the number of samples in each interval by adopting symmetric quantization.
And 303, carrying out distribution normalization treatment to obtain threshold corresponding to the minimum kl divergence.
Specifically, the number of blocks, target _ bins, is set to 128, and the following processing is performed for each threshold of [ target _ bins,2048 ]:
adding the distributions of [ threshold,2048] to threshold _ sum, assigning the original distribution [ threshold ] to threshold _ sum, and referring the new distribution to be a P matrix;
dividing a sampling interval into threshold/target _ bins, and resampling the original distribution, wherein the dimensionality is the same as that of a P matrix and is called as a Q matrix;
according to the formula
Figure BDA0002647558110000071
Obtaining a threshold corresponding to the minimum kl divergence; wherein p is a target distribution; q is the de-matched distribution; dKL(p | | q) represents KL divergence, measures the similarity of two distributions of p and q, also called information loss of q relative to p
And 304, solving the activation value of each layer in the floating point model based on the threshold.
Calculating an activation value according to a formula 127/((threshold +0.5) ×) interval; wherein the interval is a sampling interval.
And step 40, based on the weight and the activation value, determining conversion factors aiming at layer jump and convolution channel scrambling operations of the floating point model, and pre-storing all fixed point values and shift values to a quantization table.
Specifically, as shown in fig. 1, the step operation specifically includes: fixed-point processing of a floating-point scale; shifting the weight of the floating point model to a fixed point; and pre-storing all fixed-point values and shift values in a binary system. The following is described in detail with reference to the schematic drawings:
specifically, the fixed point value and the displacement value are obtained through the following formulas:
Shift=8-log2(A_max)Aint8=Int(Afloat/(pow(2,-Shift))) (1)
wherein Shift is Shift; log2(a _ max) is the second logarithm for the maximum value of the activation value; a. theint8An 8-bit activation value; a. thefloatActivating a value for a floating point; pow (2, -Shift) is a power of 2 to the Shift.
Fig. 4 is a schematic diagram of a process flow for determining a conversion factor, as shown in fig. 4, when Split is a replicated floating point model, the left and right paths of replicated Split output are the same, and when full int8 is quantized, the left and right paths are int8 cube with different scales. BinaryOp is the dot-and-add operation of elementwise of the left and right two cubes. To enable direct addition, S2, S3 are made to be the same value. S2, S3 selects the scale of the first conv/dwconv input after BinaryOp. Since the sample of the cube input by the Split operation at the top in the figure is S1, the Split operation at the time of fixed-point processing is to calculate a cube corresponding to one cube and its scale. Therefore, it is
Figure BDA0002647558110000081
At this time
Figure BDA0002647558110000082
The fixed-point values and shifts of the factors are obtained by the formula (1) and stored in the split layer.
FIG. 5 is a second schematic flow chart of the process for determining the conversion factor, as shown in FIG. 5, when a ShuffeChannel occurs after the convolution operation, followed by a Slice operation, and the layers are continuous, it is difficult to track scale because the ShuffeChannel changes according to the channeltopSo that a multi-layer maximum scale is usedtopThe fitting is carried out by the user,
Figure BDA0002647558110000083
the fixed-point value and shift of the factor are obtained by equation (1) and stored in the Slice layer.
Except for the two cases, the scales of the first conv/conv are used in the other conv/convdw layersactivateAs scaletopThen give an order
Figure BDA0002647558110000084
The factor fixed-point value and the shift are obtained by the formula (1) and stored in the conv/convdw layer.
And step 50, receiving int8 type quantization input by each layer and generating int8 quantization output based on the quantization table.
In the stage of the inference, a quantization table is read to obtain a scale of a fixed point, and for each layer:
reading fixed point weight for conv/convdw/innerproductint8And scalelayer_int8According to formula according to GEMMINT8=(weightint8*bottomint8)+biasint32>>Shift*scalelayer_int8The arithmetic is multiplied and accumulated to obtain topint8Bottom as the next layerint8
For the Split layer in step 40, by topleft_int8=bottomint8,topright_int8=bottomint8*Factorint8> Shift, get topint8Bottom as the next layerint8
For Slice layer in step 40, pass topleft_int8=bottomint8[0:slice],topright_int8=bottomint8[slcie:]*Factorint8> Shift, get topint8Bottom as the next layerint8
According to the method provided by the embodiment of the invention, the error of converting floating point to fixed point is greatly reduced by full fixed point quantization according to the channel, floating point operation and full fixed point shift operation are not involved in the reasoning process, whether the result error after the full quantization of the model meets the requirements of an artificial intelligent chip or not can be verified, and the method is self-adaptive to dynamic shift and avoids overflow error caused by fixed shift. Because top and bar bottom are both int8, that is, the input and output of each layer are both int8, that is, the intermediate value is int8, the method provided by the invention is applied, and the intermediate value is optimized from int32 to int8, thereby further reducing the on-chip memory.
Corresponding to the above method, an embodiment of the present specification further provides an 8-bit integer full-quantization inference device based on adaptive dynamic shift, as shown in fig. 6, where the device includes: an acquisition module 601, a weight module 602, an activation value module 603, a quantization table module 604, and an int8 quantization output module 605.
An obtaining module 601 configured to obtain the trained floating point model;
a weight module 602 configured to compute a weight for each layer in the floating-point model per channel;
an activation value module 603 configured to calculate an activation value for each layer in the floating-point model;
a quantization table module 604 configured to determine a conversion factor for layer jump and convolution channel scrambling operations of the floating point model based on the weight and the activation value, and pre-store all fixed point values and shift values to a quantization table;
an int8 quantization output module 605 configured to accept int8 type quantization inputs for each layer and generate an int8 quantization output based on the quantization table.
The functions executed by each component in the apparatus provided in the embodiment of the present invention have been described in detail in the above-mentioned method, and therefore, redundant description is not repeated here.
Corresponding to the above embodiments, the present invention provides a system, which includes a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method described in the above embodiments.
Corresponding to the above embodiment, an embodiment of the present invention further provides a chip, where the chip is coupled to the memory in the system, so that the chip calls the program instructions stored in the memory when running, so as to implement the method described in the above embodiment.
In correspondence with the above-described embodiments, the present invention provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the above-described method.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An 8-bit integer full-quantization inference method based on adaptive dynamic shift is characterized by comprising the following steps:
acquiring a trained floating point model;
calculating the weight of each layer in the floating-point model according to channels;
calculating an activation value of each layer in the floating-point model;
based on the weight and the activation value, determining a conversion factor aiming at layer jump and convolution channel scrambling operation of the floating point model, and pre-storing all fixed point values and shift values to a quantization table;
based on the quantization table, each layer accepts int8 type quantization inputs and generates int8 quantization outputs.
2. The method of claim 1, wherein the weight of each layer in the floating-point model is computed per channel by the following formula:
127/xmax
wherein x ismaxIs the maximum value of the channel weight.
3. The method according to claim 1, wherein the calculating the activation value for each layer in the floating-point model is specifically:
preparing a calibration data set;
initializing an activation value distribution from the calibration data set;
performing distribution normalization treatment to obtain threshold corresponding to the minimum kl divergence;
and solving the activation value of each layer in the floating point model based on the threshold.
4. The method according to claim 3, wherein the normalization process is specifically:
setting the number of blocks, target _ bins, to 128, and performing the following processing for each threshold of [ target _ bins,2048 ]:
adding the distributions of [ threshold,2048] to threshold _ sum, assigning the original distribution [ threshold ] to threshold _ sum, and referring the new distribution to be a P matrix;
dividing a sampling interval into threshold/target _ bins, and resampling the original distribution, wherein the dimensionality is the same as that of a P matrix and is called as a Q matrix;
according to the formula
Figure FDA0002647558100000021
Obtaining a threshold corresponding to the minimum kl divergence; wherein p is a target distribution; q is the de-matched distribution; dKL(p | | q) represents the KL divergence, and measures the similarity of two distributions of p and q, also referred to as the information loss of q relative to p.
5. The method according to claim 3, wherein the activation value of each layer in the floating-point model is obtained based on the threshold, specifically:
calculating an activation value according to a formula 127/((threshold +0.5) ×) interval; wherein the interval is a sampling interval.
6. The method according to claim 1, wherein based on the weights and the activation values, conversion factors are determined for layer jump and convolution channel scrambling operations of the floating-point model, and all fixed-point values and shift values are pre-stored in a quantization table, specifically:
fixed-point processing of a floating-point scale;
shifting the weight of the floating point model to a fixed point;
and pre-storing all fixed-point values and shift values in a binary system.
7. The method of claim 6, wherein the fixed-point value and the shift value are obtained by the following formulas:
Shift=8-log2(A_max)Aint8=Int(Afloat/(pow(2,-Shift)))
wherein Shift is Shift; log2(a _ max) is the second logarithm for the maximum value of the activation value; a. theint8An 8-bit activation value; a. thefloatActivating a value for a floating point; pow (2, -Shift) is a power of 2 to the Shift.
8. The method according to claim 1, wherein the accepting int8 type quantization input and generating int8 quantization output for each layer based on the quantization table is specifically:
shifting the input activation value of each layer by a fixed point;
fixed point input and fixed point weight multiplication accumulation;
and outputting an integer result.
9. An 8-bit integer full-quantization inference device based on adaptive dynamic shift, the device comprising:
an obtaining module configured to obtain the trained floating point model;
a weight module configured to compute a weight for each layer in the floating-point model per channel;
an activation value module configured to calculate an activation value for each layer in the floating-point model;
a quantization table module configured to determine a conversion factor for layer jump and convolution channel scrambling operations of the floating point model based on the weight and the activation value, and pre-store all fixed point values and shift values to a quantization table;
an int8 quantization output module configured to accept an int8 type quantization input for each layer and generate an int8 quantization output based on the quantization table.
10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program is caused to carry out the method of any one or more of claims 1-8, when the computer program is carried out in a computer.
CN202010859153.7A 2020-08-24 2020-08-24 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift Pending CN111950715A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010859153.7A CN111950715A (en) 2020-08-24 2020-08-24 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010859153.7A CN111950715A (en) 2020-08-24 2020-08-24 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift

Publications (1)

Publication Number Publication Date
CN111950715A true CN111950715A (en) 2020-11-17

Family

ID=73359841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010859153.7A Pending CN111950715A (en) 2020-08-24 2020-08-24 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift

Country Status (1)

Country Link
CN (1) CN111950715A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112558887A (en) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 Vector quantization method, device and equipment for multimedia data processing
CN113469324A (en) * 2021-03-23 2021-10-01 中科创达软件股份有限公司 Model dynamic quantization method and device, electronic equipment and computer readable medium
CN114821660A (en) * 2022-05-12 2022-07-29 山东浪潮科学研究院有限公司 Pedestrian detection inference method based on embedded equipment
WO2023060959A1 (en) * 2021-10-13 2023-04-20 山东浪潮科学研究院有限公司 Neural network model quantification method, system and device, and computer-readable medium
WO2024031989A1 (en) * 2022-08-11 2024-02-15 山东浪潮科学研究院有限公司 Memory optimization method and system for deep learning reasoning of embedded device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389219A (en) * 2017-08-04 2019-02-26 三星电子株式会社 The method and apparatus quantified for the parameter to neural network
US10579383B1 (en) * 2018-05-30 2020-03-03 Facebook, Inc. Systems and methods for efficient scaling of quantized integers
CN111260022A (en) * 2019-11-22 2020-06-09 中国电子科技集团公司第五十二研究所 Method for fixed-point quantization of complete INT8 of convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389219A (en) * 2017-08-04 2019-02-26 三星电子株式会社 The method and apparatus quantified for the parameter to neural network
US10579383B1 (en) * 2018-05-30 2020-03-03 Facebook, Inc. Systems and methods for efficient scaling of quantized integers
CN111260022A (en) * 2019-11-22 2020-06-09 中国电子科技集团公司第五十二研究所 Method for fixed-point quantization of complete INT8 of convolutional neural network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112558887A (en) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 Vector quantization method, device and equipment for multimedia data processing
CN112558887B (en) * 2020-12-25 2023-09-22 北京百度网讯科技有限公司 Vector quantization method, device and equipment for multimedia data processing
CN113469324A (en) * 2021-03-23 2021-10-01 中科创达软件股份有限公司 Model dynamic quantization method and device, electronic equipment and computer readable medium
CN113469324B (en) * 2021-03-23 2024-03-22 中科创达软件股份有限公司 Model dynamic quantization method, device, electronic equipment and computer readable medium
WO2023060959A1 (en) * 2021-10-13 2023-04-20 山东浪潮科学研究院有限公司 Neural network model quantification method, system and device, and computer-readable medium
CN114821660A (en) * 2022-05-12 2022-07-29 山东浪潮科学研究院有限公司 Pedestrian detection inference method based on embedded equipment
WO2024031989A1 (en) * 2022-08-11 2024-02-15 山东浪潮科学研究院有限公司 Memory optimization method and system for deep learning reasoning of embedded device

Similar Documents

Publication Publication Date Title
CN111950715A (en) 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift
CN110413255B (en) Artificial neural network adjusting method and device
CN110097172B (en) Convolutional neural network data processing method and device based on Winograd convolutional operation
CN110555450A (en) Face recognition neural network adjusting method and device
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN111783961A (en) Activation fixed point fitting-based convolutional neural network post-training quantization method and system
CN112686382B (en) Convolution model lightweight method and system
CN110265002B (en) Speech recognition method, speech recognition device, computer equipment and computer readable storage medium
CN110175641A (en) Image-recognizing method, device, equipment and storage medium
CN113610232A (en) Network model quantization method and device, computer equipment and storage medium
CN111695671A (en) Method and device for training neural network and electronic equipment
CN113011571A (en) INT8 offline quantization and integer inference method based on Transformer model
CN112733964A (en) Convolutional neural network quantification method for reinforcement learning automatic perception weight distribution
US11531884B2 (en) Separate quantization method of forming combination of 4-bit and 8-bit data of neural network
CN110837890A (en) Weight value fixed-point quantization method for lightweight convolutional neural network
CN112116061A (en) Weight and activation value quantification method for long-term and short-term memory network
US20240071070A1 (en) Algorithm and method for dynamically changing quantization precision of deep-learning network
CN114943335A (en) Layer-by-layer optimization method of ternary neural network
Jing et al. The optimisation of speech recognition based on convolutional neural network
Rajagopal et al. Accurate and efficient fixed point inference for deep neural networks
CN112613604A (en) Neural network quantification method and device
CN114444688A (en) Neural network quantization method, apparatus, device, storage medium, and program product
US20220207346A1 (en) Data processing method and device used in neural network
CN114386469A (en) Method and device for quantizing convolutional neural network model and electronic equipment
CN114298291A (en) Model quantization processing system and model quantization processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination