CN112200311A

CN112200311A - 4-bit quantitative reasoning method, device, equipment and readable medium

Info

Publication number: CN112200311A
Application number: CN202010980722.3A
Authority: CN
Inventors: 王曦辉
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2021-01-08

Abstract

The invention discloses a 4-bit quantitative reasoning method, which comprises the following steps: training to generate a pseudo-quantization model, and merging the normalized layers in the pseudo-quantization model into a convolutional layer; performing equivalence transformation and requantization on the data type of the weight parameter to convert the pseudo quantization model into a quantization model; folding and merging the constants based on the quantization model to generate an inference model with an output data type of int 4; and performing inference calculation based on the inference model. The invention also discloses a device for 4-bit quantitative reasoning, computer equipment and a readable storage medium. The invention realizes 4-bit reasoning on the GPU, avoids the problem that data cannot be stored due to lack of data types such as int4 in the CPU, simultaneously enables the model volume to be 1/8 of the original model, reduces occupied memory during reasoning, and changes the occupied memory into the original 1/8, thereby greatly accelerating the process of reasoning and calculation.

Description

4-bit quantitative reasoning method, device, equipment and readable medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a readable medium for 4-bit quantization inference.

Background

Neural Network models generally occupy a large disk space, for example, the model file of AlexNet exceeds 200MB, the model contains millions of parameters, and most of the space is used for storing the parameters of the model. These parameters are of the floating-point type and it is difficult for common compression algorithms to compress their space.

The internal calculation of a general model adopts floating point number calculation, the calculation of the floating point number consumes relatively large calculation resources (space and cpu/gpu time), if the internal calculation of the model can adopt other simple numerical value types to carry out calculation under the condition of not influencing the accuracy rate of the model, the calculation speed is greatly improved, the consumed calculation resources are greatly reduced, and the method is particularly important for mobile equipment.

Quantization techniques are thus introduced, quantization, i.e. compressing the original network by reducing the number of bits required to represent each weight. In the prior art, the quantitative perception training provided in the tenserflow of google. At present, most training and reasoning frameworks support int8 quantization, 8-bit quantization models can be compressed to 1/4, and the running speed of the network can be greatly improved. Compared with 8-bit quantization, 4-bit quantization can continuously compress the volume of the model by 1 time on the basis of 8 bits, and the running speed is improved by 50%. However, since 4 bits can only represent 16 numbers at most, the classification accuracy of the model is reduced. Therefore, most training and reasoning frameworks do not support the quantification of int4 at present. Because the mainstream inference framework does not support the inference of int4, int4 cannot be used in model inference to accelerate the inference process of the network. In addition, the absence of int4 in the cpu, which is a data type, also causes practical operational difficulties. The quantization algorithm for int4 mostly stays at the theoretical level and is difficult to use.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method, an apparatus, a device, and a readable medium for 4-bit quantitative inference, which implement 4-bit inference on a GPU, avoid the problem that data cannot be stored due to lack of data type int4 in a CPU, and make a model volume 1/8 of an original model, so that memory occupied during inference is reduced and changed into original 1/8, thereby greatly accelerating the process of inference calculation.

Based on the above object, an aspect of the embodiments of the present invention provides a method for 4-bit quantization inference, including the following steps: training to generate a pseudo-quantization model, and merging the normalized layers in the pseudo-quantization model into a convolutional layer; performing equivalence transformation and requantization on the data type of the weight parameter to convert the pseudo quantization model into a quantization model; folding and merging the constants based on the quantization model to generate an inference model with an output data type of int 4; and performing inference calculation based on the inference model.

In some embodiments, equivalently transforming and re-quantizing the data types of the weight parameters comprises: converting the data type of the weight parameter from the agent 4 to int 4; the weighting parameters for the int4 data type are re-quantized.

In some embodiments, folding and merging the constants based on the quantization model comprises: and combining the inverse quantization operation of the upper layer and the quantization operation of the lower layer between the continuous convolutions in the quantization model.

In some embodiments, folding and merging the constants based on the quantization model comprises: and folding and combining the constants of the shortcut branches in the quantization model.

In some embodiments, performing inferential computations based on an inference model includes: pad operation is carried out on the convolved input values, and the data types of the input values are converted into int 4; carrying out convolution operation on input values with data types of int4, combining 8 convolution results with data types of int4 to generate a new convolution result with data types of int32, and storing the new convolution result; carrying out inverse quantization operation on the stored convolution result; and converting the convolution result after inverse quantization into data output with the data type of int 4.

In some embodiments, dequantizing the stored convolution result includes: after converting the convolution result into a floating point number, the result is multiplied by the inverse quantization coefficient and added with the convolution offset.

In some embodiments, converting the dequantized convolution result to a data output of data type int4 includes: the convolution result is clipped, the range of the convolution result is limited to-8 to 7, and the convolution result is converted into a data output of data type int 4.

In another aspect of the embodiments of the present invention, a device for 4-bit quantization inference is further provided, including: the initial module is used for training and generating a pseudo-quantization model and merging the normalized layers in the pseudo-quantization model into the convolutional layers; the equivalent transformation module is configured for carrying out equivalent transformation and requantization on the data type of the weight parameter so as to convert the pseudo quantization model into a quantization model; the constant folding module is configured for folding and folding the constant based on the quantization model to generate an inference model with an output data type of int 4; and the reasoning calculation module is configured for carrying out reasoning calculation based on the reasoning model.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects: 4-bit reasoning on the GPU is realized, the problem that data cannot be stored due to the fact that data types such as int4 are lacked in the CPU is solved, meanwhile, the model volume is 1/8 of the original model, occupied memory during reasoning is reduced, the memory is changed into original 1/8, and the process of reasoning and calculation is greatly accelerated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a diagram of an embodiment of a method for 4-bit quantization inference provided by the present invention;

FIG. 2 is a diagram of an embodiment of an apparatus for 4-bit quantization inference provided by the present invention;

FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;

FIG. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above objects, a first aspect of an embodiment of the present invention proposes an embodiment of a method for 4-bit quantization inference. Fig. 1 is a schematic diagram illustrating an embodiment of a method for 4-bit quantization inference provided by the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s01, training to generate a pseudo-quantization model, and merging normalization layers in the pseudo-quantization model into a convolutional layer;

s02, carrying out equivalent transformation and requantization on the data type of the weight parameter so as to convert the pseudo quantization model into a quantization model;

s03, folding and merging the constant based on the quantization model to generate an inference model with an output data type of int 4; and

and S04, performing inference calculation based on the inference model.

In the embodiment, the method is mainly divided into two steps of generating an inference engine and using the inference engine to perform inference calculation. In the stage of generating the inference engine, the equivalent transformation of the model, the combination of the constants, the pre-calculation of the constants, the pre-processing and the storage of the weight parameters are mainly carried out, and the calculation structure of the inference engine is stored in the form of a calculation graph, so that the subsequent direct loading of the inference engine is facilitated, and the real inference operation is carried out. In the actual reasoning calculation stage using the reasoning engine, the optimized reasoning engine is directly loaded on the GPU, so that the redundant calculation of a loading model is omitted, and the calculation is more efficient. Unlike offline quantization, training quantization requires simulating the effects of quantization operations in training, and by training, the model learns and adapts to errors caused by quantization operations, thereby improving quantization accuracy. Training Quantization is therefore also referred to as Quantization-aware Training (QAT), meaning that Training has realized that this model will be converted into a Quantization model. The model quantization can solve the problems of large parameter quantity, large calculation quantity, large memory occupation and the like of the conventional convolutional neural network, and has the advantages of compressing parameters for the neural network, improving the speed, reducing the memory occupation and the like.

In this embodiment, the pseudo quantization model generated by training may be directly loaded on the pseudo quantization model generated by the current mainstream depth learning framework, and then the analysis processing is performed. The training method for generating the pseudo quantization model can be referred to as the quantization perception training method of google. Folding the normalization layer batchnorm layer, and combining the batchnorm layer into the convolution layer saves the calculated amount, and the folded convolved weight parameters are changed, so that the network precision is reduced if the original quantization parameters are continuously used, and new weight parameters need to be re-quantized.

In this embodiment, the re-quantization process is to perform maximum value quantization on the weights according to the convolved output channels, calculate new weight quantization coefficients, and keep the convolution input quantization coefficients unchanged, so as to derive the convolution calculated inverse quantization coefficients.

In some embodiments of the invention, equivalently transforming and re-quantizing the data types of the weight parameters comprises: converting the data type of the weight parameter from the agent 4 to int 4; the weighting parameters for the int4 data type are re-quantized.

In this embodiment, a data type int4 is generally adopted for the weight parameter, a data type int4 is generally adopted for the input of the convolution instead of int4, and because relu is used before the input of the convolution, a negative number does not occur, and the representation range of data can be increased by using the uint4, so that a better precision effect is obtained. However, if int4 is used as the weight and int4 is used as the input, the multiplication between int4 and int4 is not supported for both CUDA wmma and Tensor core. Therefore, model transformation of the convolution input data type from the agent 4 to int4 needs to be completed in the stage of generating the inference engine. From the convolution quantization computation process, the following equivalent transformation involving data type conversion can be derived:

y＝W*x+b

＝(quant2·W*quant1·x)/(quant1·quant2)+b

＝(quant2·W*quant1·(x-8)+quant2·W*quant1·8+quant1·quant2·b)/(quant1·quant2)

＝(W'*quant1·(x-8)+b')/requant。

among these, W '═ quant2 · W, b' ═ quant2 · W × quant1 · 8+ quant1 · quant2 · b, and requant ═ quant1 · quant2 are constants, and the results can be calculated in advance in the inference engine generation stage.

In this embodiment, first, in order to ensure the mathematical equivalence in calculation, pad operation needs to be performed on the input of convolution. Quantizing the weights: w 'equal to quant2 · W, the weight W' is clipped to be limited to-8 and 7, and data is prevented from overflowing the presentation range of int 4. Because the inference engine stage is generated, the calculation is all in the cpu, and the cpu does not have data type int4, the W' mandatory type is converted into int32, the weight is only effective at the lower 4 bits after the mandatory type conversion, the high bits are all 0, then the weight is grouped by the data of 8 int32, and only the data of the lower 4 bits are taken to form a new data of int32, so that the space occupied by data storage is reduced, and the gpu data is convenient to operate in the inference process. Since the bias also changes in the equivalence transformation, a new bias is calculated in advance according to the formula b' ═ quant2 · W × quant1 · 8+ quant1 · quant2 · b. After the dequantization operation is shifted to convolution, the dequantization coefficient equal to quant1 and quant2 is calculated in advance, and in the subsequent step, folding and combination of constants are continued, so that the calculation amount is further saved.

In some embodiments of the invention, folding and merging the constants based on the quantization model comprises: and combining the inverse quantization operation of the upper layer and the quantization operation of the lower layer between the continuous convolutions in the quantization model.

In this embodiment, the folding of the constant has two benefits: one-time matrix multiplication operation among convolutions in the reasoning process can be reduced; after the convolution is calculated, the output of the convolution can be converted from the original int32 output to int4 output, and the data bandwidth transmitted to the next layer is greatly reduced.

In this embodiment, the dequantization operation after convolution calculation may be merged with the quantization operation of the next layer, and the dequantization coefficient newly generated after merging is quant _ fuse — quant, so that the quantization operation of the convolution input of the next layer may be omitted, the input of the convolution of the next layer also becomes int4, and this constant merging may cause the convolution offset bias to change: bias is equal to quant bias.

In some embodiments of the invention, folding and merging the constants based on the quantization model comprises: and folding and combining the constants of the shortcut branches in the quantization model.

In this embodiment, the quantization operation of the next layer convolution is first advanced to the add operation, and then the quantization and inverse quantization are combined, where the combined inverse quantization coefficient is: the constant _ fuse1 and the constant _ fuse2 and the constant _ fuse2 are respectively equal to nt and equal to nt1 and nt _ fuse2 and equal to nt 6356, so that the quantization operation of the convolution input in the next layer can be omitted, and the output of the convolution in the previous layer becomes int4, and it is noted that the constant combination causes the convolution bias to change: bias 1 bias.

In some embodiments of the invention, performing inferential computations based on an inference model includes: pad operation is carried out on the convolved input values, and the data types of the input values are converted into int 4; carrying out convolution operation on input values with data types of int4, combining 8 convolution results with data types of int4 to generate a new convolution result with data types of int32, and storing the new convolution result; carrying out inverse quantization operation on the stored convolution result; and converting the convolution result after inverse quantization into data output with the data type of int 4.

In this embodiment, the convolution inference computation process using the inference engine is implemented entirely in the GPU. Firstly, pad operation is carried out on input x of convolution; quantizing the input after pad: x '═ quant1 · (x-8), then the input x' is clipped between-8 and 7, data is prevented from overflowing the representation range of int4, and finally forced type conversion is carried out to int 4; the weights that have been saved as int4 are loaded and a convolution operation is performed, which may use some cuda optimization techniques. Furthermore the result of the convolution calculation is stored using int 32; carrying out inverse quantization operation: the convolution calculation result is converted into a floating point number, and then multiplied by an inverse quantization coefficient. Then adding the bias of convolution; and (4) limiting the result after inverse quantization to be between-8 and 7, preventing data from overflowing the representation range of int4, and finally performing forced type conversion to int 4.

In some embodiments of the invention, dequantizing the stored convolution results comprises: after converting the convolution result into a floating point number, the result is multiplied by the inverse quantization coefficient and added with the convolution offset.

In some embodiments of the present invention, converting the dequantized convolution result to a data output of data type int4 comprises: the convolution result is clipped, the range of the convolution result is limited to-8 to 7, and the convolution result is converted into a data output of data type int 4.

It should be particularly noted that, the steps in the embodiments of the 4-bit quantization inference method described above can be mutually intersected, replaced, added, and deleted, so that these methods of reasonably arranging and combining the transforms to the 4-bit quantization inference shall also belong to the scope of the present invention, and shall not limit the scope of the present invention to the embodiments.

In view of the above object, according to a second aspect of the embodiments of the present invention, an apparatus for 4-bit quantization inference is provided. Fig. 2 is a schematic diagram of an embodiment of an apparatus for 4-bit quantization inference provided by the present invention. As shown in fig. 2, the embodiment of the present invention includes the following modules: an initial module S11 configured to train and generate a pseudo-quantization model, and incorporate normalized layers in the pseudo-quantization model into the convolutional layer; an equivalence transformation module S12 configured to perform equivalence transformation and requantization on the data type of the weight parameter to convert the pseudo quantization model into a quantization model; the constant folding module S13 is configured to fold and fold the constants based on the quantization model to generate an inference model with an output data type of int 4; and an inference calculation module S14 configured for performing inference calculation based on the inference model.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, an embodiment of the present invention includes the following means: at least one processor S21; and a memory S22, the memory S22 storing computer instructions S23 executable on the processor, the instructions when executed by the processor implementing the steps of the above method.

The invention also provides a computer readable storage medium. FIG. 4 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer readable storage medium stores S31 a computer program that, when executed by a processor, performs the method as described above S32.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for 4-bit quantization inference can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for 4-bit quantization inference, comprising the steps of:

training to generate a pseudo-quantization model, and merging normalization layers in the pseudo-quantization model into a convolutional layer;

performing equivalence transformation and requantization on the data type of the weight parameter to convert the pseudo quantization model into a quantization model;

folding and merging the constants based on the quantitative model to generate an inference model with an output data type of int 4; and

and carrying out reasoning calculation based on the reasoning model.

2. The method of 4-bit quantization inference of claim 1, wherein equivalently transforming and re-quantizing the data types of the weight parameters comprises:

converting the data type of the weight parameter from the agent 4 to int 4;

the weighting parameters for the int4 data type are re-quantized.

3. The method of 4-bit quantization inference of claim 1, wherein folding a constant based on the quantization model comprises:

and combining the inverse quantization operation of the upper layer and the quantization operation of the lower layer between the continuous convolutions in the quantization model.

4. The method of 4-bit quantization inference of claim 1, wherein folding a constant based on the quantization model comprises:

and folding and combining the constants of the short branches in the quantization model.

5. The method of 4-bit quantized reasoning according to claim 1, wherein performing reasoning calculations based on the reasoning model comprises:

pad operation is carried out on the convolved input values, and the data types of the input values are converted into int 4;

carrying out convolution operation on input values with data types of int4, combining 8 convolution results with data types of int4 to generate a new convolution result with data types of int32, and storing the new convolution result;

carrying out inverse quantization operation on the stored convolution result;

and converting the convolution result after inverse quantization into data output with a data type of int 4.

6. The method of 4-bit quantization inference of claim 5, wherein dequantizing the stored convolution results comprises:

and converting the convolution result into a floating point number, multiplying the floating point number by an inverse quantization coefficient, and adding the convolution offset.

7. The method of 4-bit quantization inference according to claim 5, wherein converting the convolution result after dequantization into a data output with data type int4 comprises:

and slicing the convolution result, limiting the range of the convolution result to be-8 to 7, and converting the convolution result into a data output with a data type of int 4.

8. An apparatus for 4-bit quantization inference, comprising:

the initial module is used for training and generating a pseudo-quantization model and merging the normalized layers in the pseudo-quantization model into a convolutional layer;

the equivalent transformation module is configured to perform equivalent transformation and requantization on the data type of the weight parameter so as to convert the pseudo quantization model into a quantization model;

the constant folding module is configured for folding and folding the constant based on the quantization model to generate an inference model with an output data type of int 4; and

and the reasoning calculation module is configured for carrying out reasoning calculation based on the reasoning model.

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of any of the methods 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.