CN111950715A

CN111950715A - 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift

Info

Publication number: CN111950715A
Application number: CN202010859153.7A
Authority: CN
Inventors: 谢远东
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-11-17

Abstract

The invention provides an 8-bit integer full-quantization inference method and a device based on self-adaptive dynamic shift, wherein the method comprises the following steps: acquiring a trained floating point model; obtaining the weight of each channel in the floating-point model; calculating an activation value of each layer in the floating-point model through KLD; based on the activation value, determining a conversion factor aiming at layer jump and convolution channel scrambling operations of the floating point model, and pre-storing all fixed point values and shift values; and acquiring the weight scale of the fixed point of the floating point model according to the quantization table, and outputting an integer result based on the weight. The method provided by the embodiment of the invention greatly reduces the error of converting floating point to fixed point according to the channel full-fixed-point quantization, does not relate to floating point operation in the reasoning process, and full-fixed-point shift operation, can verify whether the result error after the model full quantization meets the requirements of an artificial intelligent chip, and can self-adaptively carry out dynamic shift, thereby avoiding overflow error caused by fixed shift, and the intermediate value is optimized from int32 to int8 to further reduce on-chip internal memory.

Description

8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift

Technical Field

One or more embodiments of the present disclosure relate to the field of Convolutional Neural Networks (CNNs), and in particular, to an 8-bit integer full-quantization inference method and apparatus based on adaptive dynamic shift.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

CNNs (neural networks) obtain superior results in the fields of image classification, target detection, face recognition and the like, but due to the complexity and the calculation delay of a network structure, the real-time forward reasoning of CNNs is realized on an embedded platform with relatively insufficient storage resources and calculation resources, and the size of a neural network model needs to be reduced and the calculation efficiency of the model needs to be improved under the condition of control precision loss

In the prior art, each layer of the neural network is uniformly quantized, the weight and the activation value are quantized, and then fixed-point multiplication and addition are carried out to achieve an acceleration effect.

This technique has the following problems:

firstly, even quantization is adopted for the activation value, although the calculation amount is small, the quantization error is large and is hardly available;

secondly, the weight is quantized layer by layer, and the error of the multi-channel convolutional layer is far larger than that of the multi-channel convolutional layer;

thirdly, in the prior art, floating point calculation still needs to be performed in reasoning, and an artificial intelligence Chip which only supports fixed point operation equipment is not available.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure describe an 8-bit integer full-quantization inference method for adaptive dynamic shift, which can solve the AI Chip type fixed-point performance verification problem and further reduce on-Chip memory.

The technical scheme provided by one or more embodiments of the specification is as follows:

in order to solve the above problem, in a first aspect, the present invention provides an 8-bit integer full-quantization inference method based on adaptive dynamic shift, including:

acquiring a trained floating point model;

calculating the weight of each layer in the floating-point model according to channels;

calculating an activation value of each layer in the floating-point model;

based on the weight and the activation value, determining a conversion factor aiming at layer jump and convolution channel scrambling operation of the floating point model, and pre-storing all fixed point values and shift values to a quantization table;

based on the quantization table, each layer accepts int8 type quantization inputs and generates int8 quantization outputs.

In one embodiment, the weight of each layer in the floating-point model is computed by channel by the following formula:

127/x_max

wherein x is_maxIs the maximum value of the channel weight.

In an embodiment, the calculating the activation value of each layer in the floating-point model specifically includes:

preparing a calibration data set;

initializing an activation value distribution from the calibration data set;

performing distribution normalization treatment to obtain threshold corresponding to the minimum kl divergence;

and solving the activation value of each layer in the floating point model based on the threshold.

In one embodiment, the normalization process specifically includes:

setting the number of blocks, target _ bins, to 128, and performing the following processing for each threshold of [ target _ bins,2048 ]:

adding the distributions of [ threshold,2048] to threshold _ sum, assigning the original distribution [ threshold ] to threshold _ sum, and referring the new distribution to be a P matrix;

dividing a sampling interval into threshold/target _ bins, and resampling the original distribution, wherein the dimensionality is the same as that of a P matrix and is called as a Q matrix;

according to the formula

Obtaining a threshold corresponding to the minimum kl divergence; wherein p is a target distribution; q is the de-matched distribution; d_KL(p | | q) represents the KL divergence, and measures the similarity of two distributions of p and q, also referred to as the information loss of q relative to p.

In an embodiment, the obtaining the activation value of each layer in the floating-point model based on the threshold specifically includes:

calculating an activation value according to a formula 127/((threshold +0.5) ×) interval; wherein the interval is a sampling interval.

In one embodiment, based on the weight and the activation value, a conversion factor is determined for layer jump and convolution channel scrambling operations of the floating-point model, and all fixed-point values and shift values are prestored in a quantization table, specifically:

fixed-point processing of a floating-point scale;

shifting the weight of the floating point model to a fixed point;

and pre-storing all fixed-point values and shift values in a binary system.

In one embodiment, the fix-point value and the shift value are obtained by the following formulas:

Shift＝8-log2(A_max)A_int8＝Int(A_float/(pow(2，-Shift)))

wherein Shift is Shift; log2(a _ max) is the second logarithm for the maximum value of the activation value; a. the_int8An 8-bit activation value; a. the_floatActivating a value for a floating point; pow (2, -Shift) is a power of 2 to the Shift.

In one embodiment, the receiving int8 type quantization input and generating int8 quantization output for each layer based on the quantization table specifically includes:

shifting the input activation value of each layer by a fixed point;

fixed point input and fixed point weight multiplication accumulation;

and outputting an integer result.

In a second aspect, the present invention provides an 8-bit integer full-quantization inference device based on adaptive dynamic shift, the device comprising:

an obtaining module configured to obtain the trained floating point model;

a weight module configured to compute a weight for each layer in the floating-point model per channel;

an activation value module configured to calculate an activation value for each layer in the floating-point model;

a quantization table module configured to determine a conversion factor for layer jump and convolution channel scrambling operations of the floating point model based on the weight and the activation value, and pre-store all fixed point values and shift values to a quantization table;

an int8 quantization output module configured to accept an int8 type quantization input for each layer and generate an int8 quantization output based on the quantization table.

In a third aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the method of the first aspect.

The method provided by the embodiment of the invention greatly reduces the error of converting floating point to fixed point according to the channel full-fixed-point quantization, does not relate to floating point operation in the reasoning process, and full-fixed-point shift operation, can verify whether the result error after the model full quantization meets the requirements of an artificial intelligent chip, and can self-adaptively carry out dynamic shift, thereby avoiding overflow error caused by fixed shift, and the intermediate value is optimized from int32 to int8 to further reduce on-chip internal memory.

Drawings

FIG. 1 is a schematic flow diagram of the overall invention;

fig. 2 is a schematic flow chart of an 8-bit integer full-quantization inference method based on adaptive dynamic shift according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for calculating activation values for each layer in a floating-point model;

FIG. 4 is a flow chart illustrating one process for determining a conversion factor;

FIG. 5 is a second flowchart illustrating a process for determining a conversion factor;

fig. 6 is a schematic structural diagram of an 8-bit integer full-quantization inference device based on adaptive dynamic shift according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be further noted that, for the convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

For the convenience of understanding the present invention, the following part of the specialized vocabulary is explained:

split-like replication, Cube stands for replicated block;

conv is the normal convolution;

DwConv is a separable convolution;

BinaryOp is matrix and vector operation;

the shuffle channel shuffles the operation for the convolution channel;

right _ int8 is the right 8-bit input of fig. 4;

left _ int8 is the left 8-bit input of FIG. 4;

factor is a factor, and the number is meaningless; factor _ float is a floating-point Factor;

scale _ top is the output Scale (this is the assumed value);

scale _ top _ real is output real Scale;

top represents output, bottom represents input;

scale _ layer _ float is the layer floating point Scale value, Scale meaning: scale int8, which is the floating point of the layer, Scale _ activate is the active value Scale; scale _ weight is the weight Scale.

Weight _ int8 is an 8-bit Weight;

GEMM _ int8 is a bit-generic matrix multiplication;

weight _ int8 is an 8-bit Weight;

bottom _ int8 is an 8-bit input;

bias _ int32 is a 32-bit integer Bias;

top _ int8 is an 8-bit output;

slice is the slicing operation, i.e. taking the dimension of the previous Slice.

Fig. 1 is a schematic general flow chart of the present invention, and as shown in fig. 1, the method includes two parts: pre-inference and post-inference. Before reasoning, the floating point is mainly fixed-point processed, and in the reasoning, the fixed point is adaptively and dynamically shifted, and an integer result is output.

The following describes specific implementation steps of the above method with reference to specific examples. Specifically, fig. 2 is a schematic flow chart of an 8-bit integer full-quantization inference method based on adaptive dynamic shift according to an embodiment of the present invention, and it can be understood by combining fig. 1 and fig. 2 that steps 10 to 40 in fig. 2 are processing procedures before inference, and step 50 is a procedure of obtaining an output integer result in inference. Specifically, the method comprises the following steps:

the method comprises the following steps:

and step 10, acquiring the trained floating point model.

And step 20, calculating the weight of each layer in the floating point model according to channels.

Specifically, the weight scale of each layer in the floating-point model is calculated by a channel by the following formula:

127/x_max

wherein x is_maxIs the maximum value of the channel weight; scale herein has no specific meaning and represents a number, with the numerical meaning: int8 scale is float, i.e. the product of the layer fixed point and scale is the layer original floating point, where float represents floating point.

And step 30, calculating the activation value of each layer in the floating point model.

Fig. 3 is a schematic flowchart of a process of calculating an activation value of each layer in a floating point model, and the present invention uses a KLD algorithm to calculate an activation value scale of each layer in a neural network, specifically, as shown in fig. 3, the method includes the following steps:

step 301, prepare a calibration data set.

Step 302, initializing an activation value distribution from the calibration data set.

For each activation value bins is set to 2048, and the sampling interval x is set_maxAnd/bins, initializing an activation value distribution according to the calibration data set, namely acquiring the number of samples in each interval by adopting symmetric quantization.

And 303, carrying out distribution normalization treatment to obtain threshold corresponding to the minimum kl divergence.

Specifically, the number of blocks, target _ bins, is set to 128, and the following processing is performed for each threshold of [ target _ bins,2048 ]:

according to the formula

Obtaining a threshold corresponding to the minimum kl divergence; wherein p is a target distribution; q is the de-matched distribution; d_KL(p | | q) represents KL divergence, measures the similarity of two distributions of p and q, also called information loss of q relative to p

And 304, solving the activation value of each layer in the floating point model based on the threshold.

And step 40, based on the weight and the activation value, determining conversion factors aiming at layer jump and convolution channel scrambling operations of the floating point model, and pre-storing all fixed point values and shift values to a quantization table.

Specifically, as shown in fig. 1, the step operation specifically includes: fixed-point processing of a floating-point scale; shifting the weight of the floating point model to a fixed point; and pre-storing all fixed-point values and shift values in a binary system. The following is described in detail with reference to the schematic drawings:

specifically, the fixed point value and the displacement value are obtained through the following formulas:

Shift＝8-log2(A_max)A_int8＝Int(A_float/(pow(2，-Shift))) (1)

Fig. 4 is a schematic diagram of a process flow for determining a conversion factor, as shown in fig. 4, when Split is a replicated floating point model, the left and right paths of replicated Split output are the same, and when full int8 is quantized, the left and right paths are int8 cube with different scales. BinaryOp is the dot-and-add operation of elementwise of the left and right two cubes. To enable direct addition, S2, S3 are made to be the same value. S2, S3 selects the scale of the first conv/dwconv input after BinaryOp. Since the sample of the cube input by the Split operation at the top in the figure is S1, the Split operation at the time of fixed-point processing is to calculate a cube corresponding to one cube and its scale. Therefore, it is

At this time

The fixed-point values and shifts of the factors are obtained by the formula (1) and stored in the split layer.

FIG. 5 is a second schematic flow chart of the process for determining the conversion factor, as shown in FIG. 5, when a ShuffeChannel occurs after the convolution operation, followed by a Slice operation, and the layers are continuous, it is difficult to track scale because the ShuffeChannel changes according to the channel_topSo that a multi-layer maximum scale is used_topThe fitting is carried out by the user,

the fixed-point value and shift of the factor are obtained by equation (1) and stored in the Slice layer.

Except for the two cases, the scales of the first conv/conv are used in the other conv/convdw layers_activateAs scale_topThen give an order

The factor fixed-point value and the shift are obtained by the formula (1) and stored in the conv/convdw layer.

And step 50, receiving int8 type quantization input by each layer and generating int8 quantization output based on the quantization table.

In the stage of the inference, a quantization table is read to obtain a scale of a fixed point, and for each layer:

reading fixed point weight for conv/convdw/innerproduct_int8And scale_{layer_int8}According to formula according to GEMM_INT8＝(weight_int8*bottom_int8)+bias_int32＞＞Shift*scale_{layer_int8}The arithmetic is multiplied and accumulated to obtain top_int8Bottom as the next layer_int8。

For the Split layer in step 40, by top_{left_int8}＝bottom_int8，top_{right_int8}＝bottom_int8*Factor_int8> Shift, get top_int8Bottom as the next layer_int8

For Slice layer in step 40, pass top_{left_int8}＝bottom_int8[0:slice]，top_{right_int8}＝bottom_int8[slcie:]*Factor_int8> Shift, get top_int8Bottom as the next layer_int8。

According to the method provided by the embodiment of the invention, the error of converting floating point to fixed point is greatly reduced by full fixed point quantization according to the channel, floating point operation and full fixed point shift operation are not involved in the reasoning process, whether the result error after the full quantization of the model meets the requirements of an artificial intelligent chip or not can be verified, and the method is self-adaptive to dynamic shift and avoids overflow error caused by fixed shift. Because top and bar bottom are both int8, that is, the input and output of each layer are both int8, that is, the intermediate value is int8, the method provided by the invention is applied, and the intermediate value is optimized from int32 to int8, thereby further reducing the on-chip memory.

Corresponding to the above method, an embodiment of the present specification further provides an 8-bit integer full-quantization inference device based on adaptive dynamic shift, as shown in fig. 6, where the device includes: an acquisition module 601, a weight module 602, an activation value module 603, a quantization table module 604, and an int8 quantization output module 605.

An obtaining module 601 configured to obtain the trained floating point model;

a weight module 602 configured to compute a weight for each layer in the floating-point model per channel;

an activation value module 603 configured to calculate an activation value for each layer in the floating-point model;

a quantization table module 604 configured to determine a conversion factor for layer jump and convolution channel scrambling operations of the floating point model based on the weight and the activation value, and pre-store all fixed point values and shift values to a quantization table;

an int8 quantization output module 605 configured to accept int8 type quantization inputs for each layer and generate an int8 quantization output based on the quantization table.

The functions executed by each component in the apparatus provided in the embodiment of the present invention have been described in detail in the above-mentioned method, and therefore, redundant description is not repeated here.

Corresponding to the above embodiments, the present invention provides a system, which includes a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method described in the above embodiments.

Corresponding to the above embodiment, an embodiment of the present invention further provides a chip, where the chip is coupled to the memory in the system, so that the chip calls the program instructions stored in the memory when running, so as to implement the method described in the above embodiment.

In correspondence with the above-described embodiments, the present invention provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the above-described method.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An 8-bit integer full-quantization inference method based on adaptive dynamic shift is characterized by comprising the following steps:

acquiring a trained floating point model;

calculating an activation value of each layer in the floating-point model;

2. The method of claim 1, wherein the weight of each layer in the floating-point model is computed per channel by the following formula:

127/x_max

wherein x is_maxIs the maximum value of the channel weight.

3. The method according to claim 1, wherein the calculating the activation value for each layer in the floating-point model is specifically:

preparing a calibration data set;

initializing an activation value distribution from the calibration data set;

4. The method according to claim 3, wherein the normalization process is specifically:

according to the formula

5. The method according to claim 3, wherein the activation value of each layer in the floating-point model is obtained based on the threshold, specifically:

6. The method according to claim 1, wherein based on the weights and the activation values, conversion factors are determined for layer jump and convolution channel scrambling operations of the floating-point model, and all fixed-point values and shift values are pre-stored in a quantization table, specifically:

fixed-point processing of a floating-point scale;

shifting the weight of the floating point model to a fixed point;

and pre-storing all fixed-point values and shift values in a binary system.

7. The method of claim 6, wherein the fixed-point value and the shift value are obtained by the following formulas:

Shift＝8-log2(A_max)A_int8＝Int(A_float/(pow(2，-Shift)))

8. The method according to claim 1, wherein the accepting int8 type quantization input and generating int8 quantization output for each layer based on the quantization table is specifically:

shifting the input activation value of each layer by a fixed point;

fixed point input and fixed point weight multiplication accumulation;

and outputting an integer result.

9. An 8-bit integer full-quantization inference device based on adaptive dynamic shift, the device comprising:

an obtaining module configured to obtain the trained floating point model;

10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program is caused to carry out the method of any one or more of claims 1-8, when the computer program is carried out in a computer.