WO2019080483A1

WO2019080483A1 - Neural network computation acceleration method and system based on non-uniform quantization and look-up table

Info

Publication number: WO2019080483A1
Application number: PCT/CN2018/087117
Authority: WO
Inventors: 江帆; 王瑜; 盛骁; 韩松; 单羿
Original assignee: 北京深鉴智能科技有限公司
Priority date: 2017-10-23
Filing date: 2018-05-16
Publication date: 2019-05-02
Also published as: CN109697508A

Abstract

A neural network computation acceleration method and system based on non-uniform quantization and a look-up table. The method (300) comprises: performing non-uniform quantization on parameters of each layer of a neural network (S310); performing non-uniform quantization on inputs of each layer of the neural network (S320); constructing a look-up table for each layer by multiplying each of the quantization values of the parameters of said layer by each of the quantization values of the inputs of said layer (S330); and when forward computation of the neural network is to be performed, looking up a result of multiplication computation of the parameters and inputs of each layer in the look-up table of said layer, and performing computation layer by layer until all computation is done (S340). The method performs non-uniform quantization on all parameters and inputs of a neural network, and further adopts a look-up table to replace multiplication computation, thus accelerating computation of the neural network.

Description

Method and system for accelerating neural network computation using non-uniform quantization and lookup tables

Technical field

The present invention relates to artificial neural networks, and more particularly to methods and systems for accelerating neural network computation using non-uniform quantization and lookup tables.

Background technique

Artificial Neural Networks (abbreviated as ANN), also referred to as neural network (NN), is a mathematical calculation model that mimics the behavioral characteristics of animal neural networks and performs distributed parallel information processing. In recent years, neural networks have developed rapidly and are widely used in many fields, such as image recognition, speech recognition, natural language processing, weather forecasting, gene expression, content push, and so on.

In recent years, neural networks have developed rapidly, the number of layers has become deeper and deeper, and the number of parameters has increased accordingly. The demand for computing resources has also increased. However, on mobile devices, in embedded application scenarios, computing power and power consumption are strictly limited, so the related technologies of network compression are getting more and more attention.

Mainstream technologies for network compression include pruning, quantification, distillation, and the like. On the other hand, well-designed small-volume networks such as SqueezeNet, MobileNets, and ShuffleNet have also achieved good results in some applications.

In addition, some researchers have proposed the concept of deep compression (Deep Compression), which proposes non-uniform quantization of parameters (such as weights) of various layers of the neural network while pruning, thereby reducing computational complexity and speeding up operations. To accelerate neural network calculations. See S Han, H Mao, WJ Dally; Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding; Fiber, 2015, 56(4): 3-7. However, the multiplication of parameters and inputs only quantifies the parameters, only a part of the workload is reduced, multiplication is still the main source of computational complexity.

Therefore, it is desirable to propose a method for accelerating the calculation of neural networks, which not only quantifies the parameters, but also quantizes the input, and converts the multiplication of the two sets of quantized values into a lookup table to check the table. Instead of multiplication, the computational complexity is reduced to a greater extent, and the computational speed is increased.

Summary of the invention

In light of the above description, it is an object of the present invention to provide a method of accelerating neural network calculations.

As described above, Deep compression only performs non-uniform quantization on the parameters (such as weights) of the neural network. The contribution of the present invention is to adopt a non-uniform quantization method, and based on the quantization parameters, the calibration is adopted. The method quantifies the input of each layer in the network, and further uses a lookup table instead of multiplication to accelerate the calculation of the neural network.

According to a first aspect of the present invention, there is provided a method of accelerating neural network computation using non-uniform quantization and lookup tables. The method can include: non-uniform quantization of parameters of each layer of the neural network; non-uniform quantization of inputs of each layer of the neural network; by inputting all quantized values of parameters of each layer with inputs of each layer Multiply all the quantized values to construct a lookup table for each layer; for the forward calculation of the neural network, for the multiplication of the parameters of each layer and the input, look up the result of the multiplication from the lookup table of the layer , layer by layer until all is completed.

In the method according to the first aspect of the invention, the non-uniform quantization of the input to each layer of the neural network can be done by a calibrated method.

More specifically, the calibration can be performed by dividing the calibration data into M batches; performing a forward calculation of the network for each batch to obtain the input of this layer {X ₁ , X ₂ ,...X _M }; Kmeans clustering all the elements in the i-th batch X _i to obtain N=2 ⁿ cluster centers and the number of samples of this cluster {(c _i1 ,cnt _i1 ), (c _i2 , Cnt _i2 ),...(c _iN ,cnt _iN )}; for each cluster center of M batches, Kmeans clustering is performed again, and finally N cluster centers are obtained as input quantization values {c ₁ , c ₂ ,...c _N } class.

According to a second aspect of the present invention, a system for accelerating neural network computation using non-uniform quantization and lookup tables is provided. The system may include: a network parameter quantization unit for non-uniform quantization of parameters of each layer of the neural network; an input quantization unit for non-uniform quantization of input of each layer of the neural network; Multiply all the quantized values of the parameters of each layer with all the quantized values of the input of each layer to construct a lookup table for each layer; the main processing unit is used for forward calculation of the neural network. Wherein, when performing the forward calculation of the neural network, the main processing unit searches for the result of the multiplication operation from the lookup table of the layer for the multiplication operation of the parameters of each layer and the input, and calculates the layer by layer until all is completed.

In a system according to the second aspect of the invention, the input quantization unit may be configured to perform non-uniform quantization of the input to each layer of the neural network by a calibrated method.

More specifically, the input quantization unit may be configured to perform calibration by dividing the calibration data into M batches; performing a forward calculation of the network for each batch to obtain an input of this layer {X ₁ , X ₂ ,...X _M }; Kmeans clustering all the elements in the i-th batch X _i to obtain N=2 ⁿ cluster centers and the number of samples of this cluster {(c _i1 ,cnt _I1 ),(c _i2 ,cnt _i2 ),...(c _iN ,cnt _iN )}; for each cluster center of M batches, another Kmeans clustering is performed, and finally N cluster centers are obtained as input quantization The value {c ₁ , c ₂ , ... c _N }.

According to a third aspect of the present invention, a computer readable medium is provided for recording instructions executable by a processor, when executed by a processor, causing a processor to perform acceleration of a neural field using a non-uniform quantization and lookup table A method of network computing includes the following operations: non-uniform quantization of parameters of each layer of the neural network; non-uniform quantization of inputs of each layer of the neural network; by arranging all quantized values of each layer of parameters All the quantized values of the input of one layer are multiplied to construct a lookup table for each layer; when performing the forward calculation of the neural network, the multiplication operation of the parameters and inputs of each layer is searched from the lookup table of the layer. The result of the multiplication operation is calculated layer by layer until all is completed.

The invention performs non-uniform quantization on the parameters and inputs of the neural network, and further uses a lookup table instead of multiplication, thereby accelerating the calculation of the neural network.

DRAWINGS

The invention will now be described in connection with the embodiments with reference to the accompanying drawings. In the drawing:

FIG. 1 is a process illustrating non-uniform quantization of an input of a neural network using a calibration method.

2 is a block diagram illustrating a system for accelerating neural network calculations using non-uniform quantization and lookup tables in accordance with the present invention.

3 is a flow chart illustrating a method of accelerating neural network computation using non-uniform quantization and lookup tables in accordance with the present invention.

Detailed ways

The drawings are for illustrative purposes only and are not to be construed as limiting. The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.

As noted above, Deep compression only non-uniformly quantifies the parameters of the neural network, such as weights. The invention adopts a non-uniform quantization method. On the basis of the quantization parameters, the calibration method is used to quantify the input of each layer in the network, and a lookup table is further used instead of the multiplication operation to accelerate the calculation of the neural network.

The innovative technology of the present invention will be described in two parts below.

Quantitative input

In the stage of neural network prediction, the parameters of the network have been fixed. Therefore, deep compression can directly perform Kmeans clustering on network parameters, and obtain cluster centers of N categories, which are non-uniformly quantized N quantized values.

Compared to the quantification of parameters, the quantization of the input is much more complicated. The fundamental reason is the uncertainty of the input. The input to each layer in the neural network depends on the output of the previous layer and is ultimately determined by the input of the entire network. Therefore, the input of each layer has a certain randomness. In this paper, the calibration of the neural network is quantified by the method of calibration. The so-called calibration can also be understood as a kind of training. Through a certain amount of calibration samples (training samples), the input N quantized values are calculated to complete the non-uniform quantization of the input. In order to distinguish from the training of the neural network itself, this statement is used later in this paper. In actual operation, N (that is, the number of quantized values) is generally taken as a power of 2, that is, N=2 ⁿ , which is called quantization to n bits, or n-bit quantization.

FIG. 1 is a process illustrating non-uniform quantization of an input of a neural network using a calibration method. Figure 1 shows the quantization process for any layer in a neural network. In Figure 1, we show a 3-bit quantization that is finally quantized to 8 (2 ³ ) quantized values by calibration. Depending on the complexity of the network structure, calibration samples can range from tens to thousands. We divide the calibration data into M batches (labeled as Batch 1, Batch 2, ..., Batch M in Figure 1) and use Kmeans clustering to obtain N cluster centers for each batch, and finally for each batch. The clustering center performs another Kmeans clustering to obtain the final N quantized values. The specific steps are as follows:

1. Perform a forward calculation of the network for each batch, and obtain the input of this layer, which is recorded as {X ₁ , X ₂ , ... X _M };

2. Kmeans clustering all the elements in the i-th batch X _i to obtain N=2 ⁿ cluster centers (centroid) and the number of samples of the cluster (c), denoted as {(c _i1 , Cnt _i1 ), (c _i2 , cnt _i2 ),...(c _iN ,cnt _iN )};

3. For all cluster centers of M batches, Kmeans clustering is performed again, and finally N cluster centers are obtained as input quantized values {c ₁ , c ₂ , ... c _N }.

Accelerating forward computing of neural networks using lookup tables

When the parameters and inputs of the neural network are non-uniformly quantized, the combination of possible multiplications in the neural network is a finite set, and the lookup table becomes a suitable acceleration method.

Taking any layer in the network as an example, assume that the parameters are quantized to 4 bits (16 quantized values), the input is also quantized to 4 bits, and the multiplication in the neural network comes from the phase of the input and parameters. Multiply, so the possible multiplication combinations here are only 16x 16=256. We calculate all the possible multiplications in advance and store them in the lookup table. In the process of forward calculation, we can eliminate the multiplication calculation and directly look up the table to get the multiplication calculation result.

Using non-uniform quantization and lookup table techniques, the overall steps to accelerate neural network forward computation are as follows:

1. Non-uniform quantization of parameters by means of deep compression;

2. Quantitative input, using the calibration data to calculate the input quantized value;

3. Calculate a lookup table for each layer in advance based on the parameters and the input quantized values;

4. In the forward calculation process, for each characteristic of the input, after the input of each layer is quantized, the result of the multiplication is calculated by looking up the table, and the calculation is performed layer by layer until all is completed.

From the above description, a system and method for accelerating neural network computation using non-uniform quantization and lookup tables can be constructed.

As shown in FIG. 2, a system 200 for accelerating neural network computation using non-uniform quantization and lookup tables in accordance with the present invention can include a network parameter quantization unit 201 for parameters (eg, weights, etc.) for each layer of the neural network. Performing non-uniform quantization; input quantization unit 202 for inputting each layer of the neural network (for the first layer, the original input; for the later layers, the output of the previous layer) for non-uniformity Quantization; lookup table 203, constructing a lookup table 203 for each layer by multiplying all quantized values of the parameters of each layer with all quantized values of the input of each layer; main processing unit 204 for performing neural networks Forward calculation. Wherein, when performing the forward calculation of the neural network, the main processing unit 204 searches for the result of the multiplication operation from the lookup table 204 of the layer for the multiplication operation of the parameters of each layer and the input, and calculates the layer by layer until all is completed. .

In system 200, the network parameter quantization unit 201 can be configured to non-uniformly quantize parameters in accordance with a deep compression method. The input quantization unit 202 can then be configured to perform non-uniform quantization of the input to each layer of the neural network by a calibrated method. More specifically, the input quantization unit 202 can be configured to perform calibration by dividing the calibration data into M batches; performing a forward calculation of the network for each batch to obtain an input of this layer { X ₁ , X ₂ ,...X _M }; Kmeans clustering for all elements in the i-th batch X _i to obtain N=2 ⁿ cluster centers and the number of samples of this cluster {(c _i1 , Cnt _i1 ),(c _i2 ,cnt _i2 ),...(c _iN ,cnt _iN )}; for each cluster center of M batches, another Kmeans clustering is performed, and finally N cluster centers are obtained as inputs. Quantitative values {c ₁ , c ₂ , ... c _N }.

As shown in FIG. 3, the method 300 for accelerating neural network computation using non-uniform quantization and lookup tables in accordance with the present invention begins in step S310, where non-uniform quantization of parameters of each layer of the neural network is performed.

Those skilled in the art should understand that, as described above, for the non-uniform quantization of the parameters of each layer of the neural network, a deep compression method may be adopted, but it is not excluded to use other methods to perform non-uniform parameters. Uniform quantization. For a method of deep compression, see S Han, H Mao, WJ Dally; Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding; Fiber, 2015, 56(4): 3-7. Here, by reference, the whole is added.

In step S320, the input of each layer of the neural network is non-uniformly quantized. In step S320, non-uniform quantization of the input to each layer of the neural network can be accomplished by a calibrated method. Specifically, as described above, the calibration can be performed by the following steps:

1. Divide the calibration data into M batches;

2. Perform a forward calculation of the network for each batch to obtain the input {X ₁ , X ₂ , ... X _M } of this layer;

3. Kmeans clustering all the elements in the i-th batch X _i to obtain N=2 ⁿ cluster centers and the number of samples of this cluster {(c _i1 ,cnt _i1 ),(c _i2 ,cnt _I2 ),...(c _iN ,cnt _iN )};

4. For all cluster centers of M batches, Kmeans clustering is performed again, and finally N cluster centers are obtained as input quantized values {c ₁ , c ₂ , ... c _N }.

It will be understood by those skilled in the art that although in a particular embodiment, the quantization of the input is done by a calibrated method, and more specifically by Kmeans clustering and by the specific steps described above, the invention is not Exclude non-uniform quantization of the input using other methods.

Next, in step S330, a lookup table for each layer is constructed by multiplying all quantized values of the parameters of each layer with all quantized values of the input of each layer.

In step S340, when performing the forward calculation of the neural network, for the multiplication operation of the parameters of each layer and the input, the result of the multiplication operation is searched from the lookup table of the layer, and the calculation is performed layer by layer until all is completed.

Thus, method 300 can end.

One of ordinary skill in the art will recognize that the method of the present invention can be implemented as a computer program. As described above in connection with FIG. 3, the method in accordance with the above-described embodiments can execute one or more programs, including instructions to cause a computer or processor to perform the algorithms described in connection with the figures. These programs can be stored and provided to a computer or processor using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of the non-transitory computer readable medium include magnetic recording media (such as floppy disks, magnetic tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROM (Compact Disc Read Only Memory), CD-R, CD-R /W and semiconductor memory (such as ROM, PROM (programmable ROM), EPROM (rewritable PROM), flash ROM and RAM (random access memory)). Further, these programs can be provided to a computer by using various types of transient computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium can be used to provide a program to a computer via a wired communication path such as a wire and an optical fiber or a wireless communication path.

Thus, in accordance with the present invention, a computer program or a computer readable medium for recording instructions executable by a processor, when executed by a processor, causes the processor to perform non-uniform quantization and The lookup table accelerates neural network computation methods, including the following operations: non-uniform quantization of parameters of each layer of the neural network; non-uniform quantization of inputs of each layer of the neural network; by passing all parameters of each layer The quantized value is multiplied with all the quantized values of the input of each layer to construct a lookup table for each layer; when performing the forward calculation of the neural network, the multiplication operation of the parameters and inputs of each layer is performed from the layer Find the result of the multiplication operation in the lookup table, and calculate it layer by layer until it is all completed.

Various embodiments and implementations of the invention have been described above. However, the spirit and scope of the present invention are not limited thereto. Those skilled in the art will be able to make further applications in accordance with the teachings of the present invention, and such applications are within the scope of the present invention.

That is, the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations or modifications of the various forms may be made by those skilled in the art in light of the above description. There is no need and no way to exhaust all of the implementations. Any modifications, substitutions or improvements made within the spirit and scope of the invention are intended to be included within the scope of the appended claims.

Claims

A method for accelerating neural network computation using non-uniform quantization and lookup tables, including:

Non-uniform quantization of parameters of each layer of the neural network;

Non-uniform quantization of the input to each layer of the neural network;

Constructing a lookup table for each layer by multiplying all quantized values of the parameters of each layer with all quantized values of the inputs of each layer;

In the forward calculation of the neural network, for the multiplication of the parameters of each layer and the input, the result of the multiplication operation is searched from the lookup table of the layer, and the calculation is performed layer by layer until all is completed.
The method of claim 1 wherein the non-uniform quantization of the input to each layer of the neural network is accomplished by a calibrated method.
The method of claim 2 wherein the calibration is performed by the following steps:

Divide the calibration data into M batches;

Perform a forward calculation of the network for each batch to obtain the input {X 1 , X 2 , ... X M } of this layer;

Kmeans clustering is performed on all elements in the i-th batch X i to obtain N=2 n cluster centers and the number of samples of this cluster {(c i1 ,cnt i1 ), (c i2 ,cnt i2 ) ,...(c iN ,cnt iN )};

For each cluster center of M batches, Kmeans clustering is performed again, and finally N cluster centers are obtained as input quantized values {c 1 , c 2 , ... c N }.
A system for accelerating neural network computation using non-uniform quantization and lookup tables, including:

a network parameter quantization unit for performing non-uniform quantization on parameters of each layer of the neural network;

Input quantization unit for non-uniform quantization of input of each layer of the neural network;

A lookup table that constructs a lookup table for each layer by multiplying all quantized values of the parameters of each layer with all quantized values of the inputs of each layer;

a main processing unit for performing forward calculation of a neural network,

Wherein, when performing the forward calculation of the neural network, the main processing unit searches for the result of the multiplication operation from the lookup table of the layer for the multiplication operation of the parameters of each layer and the input, and calculates the layer by layer until all is completed.
The system of claim 4, wherein the input quantization unit is configured to perform non-uniform quantization of input to each layer of the neural network by a calibrated method.
The system of claim 5 wherein said input quantization unit is configured to calibrate by:

Divide the calibration data into M batches;

Perform a forward calculation of the network for each batch to obtain the input {X 1 , X 2 , ... X M } of this layer;

Kmeans clustering is performed on all elements in the i-th batch X i to obtain N=2 n cluster centers and the number of samples of this cluster {(c i1 ,cnt i1 ), (c i2 ,cnt i2 ) ,...(c iN ,cnt iN )};

For each cluster center of M batches, Kmeans clustering is performed again, and finally N cluster centers are obtained as input quantized values {c 1 , c 2 , ... c N }.
A computer readable medium for recording instructions executable by a processor, when executed by a processor, causing a processor to perform a method of accelerating neural network computation using a non-uniform quantization and lookup table, comprising:

Non-uniform quantization of parameters of each layer of the neural network;

Non-uniform quantization of the input to each layer of the neural network;

Constructing a lookup table for each layer by multiplying all quantized values of the parameters of each layer with all quantized values of the inputs of each layer;

In the forward calculation of the neural network, for the multiplication of the parameters of each layer and the input, the result of the multiplication operation is searched from the lookup table of the layer, and the calculation is performed layer by layer until all is completed.