CN112712164A

CN112712164A - Non-uniform quantization method of neural network

Info

Publication number: CN112712164A
Application number: CN202011616502.9A
Authority: CN
Inventors: 黄宇扬; 冯建豪; 陈家麒
Original assignee: Thinkforce Electronic Technology Co ltd
Current assignee: Thinkforce Electronic Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-27
Anticipated expiration: 2040-12-30
Also published as: CN112712164B

Abstract

The invention discloses a non-uniform quantization method of a neural network, which comprises the steps of firstly quantizing input data into a first fixed point number by adopting a piecewise function, storing the first fixed point number, searching a first lookup table, confirming a second fixed point number corresponding to the first fixed point number, wherein the bit number of the second fixed point number is higher than that of the first fixed point number, then performing convolution operation by adopting the second fixed point number to obtain a calculation result, searching the second lookup table, converting the calculation result into a third fixed point number for storing, and enabling the data type of the third fixed point number to be the same as that of the first fixed point number.

Description

Non-uniform quantization method of neural network

Technical Field

The invention relates to the technical field of neural networks, in particular to a non-uniform quantization method of a neural network.

Background

The application of artificial neural networks has made great progress in many aspects, for example, many practical problems which are difficult to solve by modern computers have been successfully solved in the fields of pattern recognition, intelligent robots, automatic control, prediction estimation, biology, medicine, economy and the like, and good intelligent characteristics are shown.

As model prediction (prediction) becomes more accurate and the network becomes deeper, the computational and memory resources consumed by neural networks become an issue, especially on mobile devices. For example, if a relatively small ResNet-50 network is deployed for classification, a memory bandwidth of 3GB/s is required for running a network model, and memory, a CPU and a battery are all consumed at a high speed when the network runs, so that the equipment becomes intelligent and needs to pay a high cost. As neural networks evolve, large neural networks have an increasing number of levels and data volumes, which present a significant challenge to the deployment of neural networks.

To address these issues, on the one hand, acceptable accuracy can be achieved with relatively small model sizes by designing more efficient network architectures; on the other hand, the network size can be reduced by compression, encoding, and the like. Quantization is one of the most widely used compression methods.

Neural network quantization can significantly improve the computational efficiency of neural networks, enabling the networks to be deployed on resource-limited chips or other computing platforms. Currently, the most common neural network quantization method projects high-precision floating point numbers into low-precision quantization values, for example, 32-bit floating point numbers are converted into 8-bit fixed point numbers, and 8-bit data streams are adopted to store parameters such as input and output data and weights, so as to reduce bandwidth requirements.

However, in the convolution operation, since the use of low-precision quantization values destroys the original data distribution of weights/activations in the neural network, the direct calculation using low-bit data may result in a great reduction in the network precision. In order to solve the problem, a non-uniform quantization algorithm is researched and proposed, although the non-uniform quantization algorithm can better maintain the network precision, the existing non-uniform quantization algorithm is complex, and a chip is difficult to load the calculated amount, so that the bandwidth requirement is not effectively reduced and the reasoning speed is not increased in the aspect of actual operation.

Disclosure of Invention

Aiming at partial and all problems in the prior art, the invention provides a non-uniform quantization method of a neural network, which comprises the following steps:

quantizing the input data into a first fixed point number by adopting a piecewise function, and storing the first fixed point number;

searching a first lookup table, and confirming a second fixed point number corresponding to the first fixed point number, wherein the bit number of the second fixed point number is higher than that of the first fixed point number;

performing convolution operation by adopting the second fixed point number to obtain a calculation result; and

and searching a second lookup table, converting the calculation result into a third fixed point number for storage, wherein the data type of the third fixed point number is the same as that of the first fixed point number.

Further, the data type of the first fixed point number is an 8-bit fixed point number.

Further, the first lookup table and the second lookup table are configured inside the chip.

Further, each segment function of the piecewise function is a linear function with different or same slope, and the slope is determined according to the upper and lower bounds of each segment function:

wherein the content of the first and second substances,

for characterizing [ r_i1,r_i2]The value range of the first fixed point number of the input data in the range.

Further, the piecewise function is:

wherein [ -f [ ]₁,f₁]K is an arbitrary number between 0 and 1, and is the value range of the input data r.

Further, the forming of the first lookup table includes:

confirming a value range of input data, and quantizing the data in the value range according to the piecewise function to obtain first data;

dequantizing the first data to a floating point number according to the piecewise function;

quantizing the floating point number to a second data; and

and configuring the first data and the corresponding second data into a first lookup table.

Further, the forming of the second lookup table includes:

estimating the value range of the convolution calculation result;

establishing a mapping table from a low specific point number to a high specific point number in the value range; and

for each possible high-bit fixed-point value, finding the nearest mapped number in the mapping table to form a second lookup table.

In the embodiment of the present invention, the low-bit fixed point number refers to a data type used for storing input data and a calculation result, and the high-bit fixed point number refers to a data type with a bit number higher than that of the low-bit fixed point number, and is mainly used for convolution operation.

According to the non-uniform quantization method of the neural network, the input data of each layer is stored in a simple non-uniform quantization mode, and when the most core convolution operation of the neural network is performed, a higher bit value is searched upwards in a chip through a pre-configured lookup table, and then high bit calculation is performed, so that the bandwidth and the burden of calculation resources are not greatly increased while the precision is ensured. According to the method, the lookup table is configured in advance through off-chip calculation, so that all operations of high-precision neural network training or reasoning can be performed in the neural network chip without floating point number calculation, the calculation force requirement on the main control CPU is reduced, and the chip resources are saved.

Drawings

To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the present invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.

Fig. 1 is a flow chart illustrating a non-uniform quantization method of a neural network according to an embodiment of the present invention.

Detailed Description

In the following description, the present invention is described with reference to examples. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details, or with other alternative and/or additional methods, materials, or components. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention. Similarly, for purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the embodiments of the invention. However, the invention is not limited to these specific details. Further, it should be understood that the embodiments shown in the figures are illustrative representations and are not necessarily drawn to scale.

Reference in the specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

It should be noted that the embodiment of the present invention describes the process steps in a specific order, however, this is only for the purpose of illustrating the specific embodiment, and does not limit the sequence of the steps. Rather, in various embodiments of the present invention, the order of the steps may be adjusted according to process adjustments.

In order to reduce bandwidth requirements, low bit numbers, e.g., 8-bit data streams, are often used in neural networks to store data, weights, etc. However, in the convolution operation of the most core of the neural network, if a low bit number is used for calculation, the accuracy of the network is easily lost, and usually, the low bit number needs to be inversely quantized into a floating point number before accurate calculation can be performed. In the convolution operation, the matrix is generally expanded according to the size of a convolution kernel, and then matrix multiplication is performed, wherein the calculation of the kernel is to perform dot product operation on rows and columns in the matrix. The dot product operation acts on two equal length vectors, which can be described by the following formula:

the length of a and b is n, and the length of C is the result of dot product, so it can be seen that if the floating point number is used for dot product operation, the resource consumption is more, and the occupied bandwidth is larger, which is easy to become the bottleneck of calculation. In view of this problem, the present invention provides a non-uniform quantization method for neural networks, and the scheme of the present invention is further described below with reference to the accompanying drawings of embodiments.

Fixed-point and floating-point are both numerical representations, which differ in the location of the point separating the integer part from the fractional part. Fixed points hold integers and decimals of a particular number of bits, while floating points hold significant digits and exponents of a particular number of bits. Taking 8-bit fixed point integer INT8 and 32-bit floating point FP32 as examples, INT8 uses only 25% of bits of FP 32. However, the method of converting values between INT8 and FP32 is very important because it significantly affects the prediction accuracy.

Fig. 1 is a flow chart illustrating a non-uniform quantization method of a neural network according to an embodiment of the present invention. As shown in fig. 1, a non-uniform quantization method of a neural network includes:

first, in step 101, a look-up table is configured. In order to reduce the amount of calculation in the neural network chip, the corresponding relation between the low bit count and the high bit count is calculated in advance outside the chip to form a first lookup table and a second lookup table, and the first lookup table and the second lookup table are configured in the chip. The first lookup table is used for converting input data in the form of low fixed point numbers into high fixed point numbers, and the second lookup table is used for converting calculation results in the form of high fixed point numbers into low fixed point numbers. In one embodiment of the invention, the forming of the first lookup table comprises:

firstly, confirming a value range of input data, and quantizing the data in the value range according to a piecewise function to obtain first data; typically, the input data is in the form of 32-bit floating point numbers, and in one embodiment of the invention, the data type of the first data is 8-bit floating point numbers; in a further embodiment of the present invention, the segments of the piecewise function are linear functions with the same or different slopes:

wherein the content of the first and second substances,

for the value range of the input data in the segment interval, and [ q_i1,q_i2]For characterizing [ r_i1,r_i2]In an embodiment of the present invention, the piecewise function includes three segments:

wherein [ -f [ ]₁,f₁]K is any number between 0 and 1 for the value range of the input data r, and in one embodiment of the present invention, the value is 0.125, and the value range of q is [0,2 ]ⁿ-1]Wherein n is the number of bits of the first data, and considering that the input data is more distributed in the middle section of the value range, in the piecewise function, scale₂Is less than scale₁、scale₃That is, in the second segment interval, the representation is performed using more points, and the q value range of each segment interval is [0,63 ] taking 8 bits as an example]，[64,191]And [192,25 ]5](ii) a It should be understood that in other embodiments of the present invention, the segmentation interval and/or the number of segmentation segments of the segmentation function may be set differently according to the distribution of the input data;

next, inverse quantizing the first data into floating point numbers using an inverse of the piecewise function;

then, quantizing the floating point number into second data, wherein the bit number of the second data is higher than that of the first data, and the bit number of the second data is determined according to requirements after comprehensively considering the chip area and the quantization performance, such as 10-12 bit fixed point numbers; in one embodiment of the invention, the floating point number is quantized into the second data using a uniform quantization algorithm, in particular a linear function. The original floating point number r and the quantized numerical value q have the following relations:

r＝scale*(q-zeroPoint)；

wherein, scale and zeroPoint are obtained by calculating the upper and lower bounds of the quantization interval, in one embodiment of the invention, the range of the floating point number to be quantized is assumed to be [ -f [ ]₁,f₁]The range of the second data is [0, i ]₁]Then, there are:

and

and finally, configuring the first data and the corresponding second data into a first lookup table. In one embodiment of the invention, the forming of the second lookup table comprises:

firstly, estimating the value range of a neural network convolution calculation result;

then, in the value range, establishing a mapping table from a low specific point number to a high specific point number, wherein the specific establishment method is the same as that of the first lookup table, the low specific point number refers to a data type adopted when storing input data and a calculation result, namely the data type is the same as that of the first data, and the high specific point number refers to a data type with a bit number higher than that of the low specific point number, is used for convolution operation, and is the same as that of the second data; and

finally, for each possible high-bit fixed point value, finding the nearest mapped number in the mapping table, and establishing a second lookup table;

next, at step 102, the input data is stored. Quantizing the input data into a first fixed point number by adopting a piecewise function, and storing the first fixed point number; in one embodiment of the present invention, each segment of the piecewise function is a linear function with the same or different slope, and the slope of the function is determined according to the value range of the input data of each segment and the data type of the first fixed point number; in yet another embodiment of the present invention, the data type of the first fixed point number is an 8-bit fixed point number;

next, at step 103, a high bit count number is looked up. Confirming a second fixed point number corresponding to the first fixed point number through the first lookup table, wherein the number of bits of the second fixed point number is higher than that of the first fixed point number;

next, at step 104, a convolution operation is performed. Performing convolution operation by adopting the second fixed point number to obtain a calculation result; because the input data has been converted into the fixed point number of the high bit at this moment, reduced the loss of the neural network precision on the one hand, on the other hand, convert the dot product operation for the multiply-add operation of pure fixed point again, go on in the integer level completely, greatly reduced the bandwidth demand:

wherein, scale_a、scale_b、scale_c、zeroPoint_a、zeroPoint_b、zeroPoint_cIs a quantization parameter; and

finally, in step 105, the calculation results are stored. And converting the convolution calculation result into a third fixed point number for storage by searching the second lookup table, wherein the data type of the third fixed point number is the same as that of the first fixed point number.

According to the non-uniform quantization method of the neural network, the input data of each layer is stored in a simple non-uniform quantization mode, and when the most core convolution operation of the neural network is performed, a higher bit value is searched upwards in a chip through a pre-configured lookup table, and then high bit calculation is performed, so that the bandwidth and the burden of calculation resources are not greatly increased while the precision is ensured.

Embodiments may be provided as a computer program product that may include one or more machine-readable media having stored thereon machine-executable instructions that, when executed by one or more machines such as a computer, network of computers, or other electronic devices, may result in the one or more machines performing operations in accordance with embodiments of the present invention. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc read-only memories), and magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read-only memories), EEPROMs (electrically erasable programmable read-only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions.

Moreover, embodiments may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of one or more data signals embodied in and/or modulated by a carrier wave or other propagation medium via a communication link (e.g., a modem and/or network connection). Accordingly, a machine-readable medium as used herein may include, but is not required to be, such a carrier wave.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method for non-uniform quantization of a neural network, comprising the steps of:

2. The non-uniform quantization method of claim 1, wherein the data type of the first fixed-point number is an 8-bit fixed-point number.

3. The non-uniform quantization method of claim 1, wherein the first lookup table and the second lookup table are configured within a chip.

4. The non-uniform quantization method of claim 1, wherein each segment function of the piecewise function is a linear function with different or same slope:

wherein the content of the first and second substances,

[q_i1,q_i2]for characterizing [ r_i1,r_i2]The value range of the first fixed point number of the input data in the range.

5. The non-uniform quantization method of claim 4, wherein the piecewise function is:

6. The non-uniform quantization method of claim 1, wherein the forming of the first lookup table comprises the steps of:

quantizing the floating point number to a second data; and

7. The non-uniform quantization method of claim 6, wherein the floating point quantization is performed on the second data by using a linear function to quantize the floating point, and wherein the original floating point r and the quantized second data q have the following relationships:

r＝scale*(q-zeroPoint)；

wherein, scale and zeroPoint are calculated from the upper and lower bounds of the quantization interval.

8. The non-uniform quantization method of claim 1, wherein the forming of the second lookup table comprises the steps of:

estimating the value range of the convolution calculation result;

9. A computer-readable storage medium comprising instructions that, when executed, cause a system to perform the method of any of claims 1-8.

10. A non-uniform quantization system for neural networks, comprising:

a memory; and

a processor coupled to the memory and configured to perform the method of any of claims 1-8.