CN117273092A

CN117273092A - Model quantization method and device, electronic equipment and storage medium

Info

Publication number: CN117273092A
Application number: CN202311234994.9A
Authority: CN
Inventors: 郭敬明; 张克俭; 刘彦; 华远志; 柴菁
Original assignee: Shanghai Suiyuan Intelligent Technology Co ltd
Current assignee: Shanghai Suiyuan Intelligent Technology Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2023-12-22

Abstract

The embodiment of the invention discloses a model quantization method, a device, electronic equipment and a storage medium, which comprise the following steps: determining a current target quantization layer of a current target quantization block in the target quantization model; reasoning is carried out on the target quantization model, and reasoning activation data of the current target quantization layer is obtained; quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file to obtain quantization data of the current target quantization layer; calculating quantization errors of a current target quantization layer according to the reasoning activation data and the quantization activation data; under the condition that each quantized block of the target quantized model is determined to complete the quantization process, the quantization parameters of each quantized block are updated according to the quantization errors of the quantized layers in each quantized block. The technical scheme of the embodiment of the invention can improve the applicability, flexibility and high efficiency of the model quantization method.

Description

Model quantization method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer application, in particular to a model quantization method, a model quantization device, electronic equipment and a storage medium.

Background

Model quantization is one of the most common methods in model compression, and many possible neural network quantization strategies and schemes have emerged in recent years. However, among the numerous model quantization methods, it is currently generally accepted that INT8 (8-bit signed integer) quantization is performed because it does not significantly reduce accuracy while speeding up reasoning. However, the manner of INT8 quantization is relatively weak for model compression, and still does not have excellent performance in some applications, while compared with INT4 quantization, which can further reduce storage space and accelerate computation, but at the same time is generally accompanied by serious degradation of model performance and even risk of overflow.

However, the existing INT4 quantization method often needs to perform quantization perception training on the neural network in advance, which is time-consuming and complex, and requires huge computational and time costs. Meanwhile, the existing INT4 quantization method is not suitable for quantization of some specific types of neural network models such as LLM (Large Language Model, large-scale language model) and the like, and particularly cannot achieve performance/precision balance in a self-adaptive manner in the quantization process, so that multi-precision calculation of each layer can be flexibly realized. It can be seen that the existing INT4 quantization method has low applicability and flexibility.

Disclosure of Invention

The embodiment of the invention provides a model quantization method, a device, electronic equipment and a storage medium, which can improve the applicability, flexibility and high efficiency of the model quantization method.

According to an aspect of the present invention, there is provided a model quantization method including:

determining a current target quantization layer of a current target quantization block in the target quantization model;

reasoning the target quantization model to obtain reasoning activation data of the current target quantization layer; wherein the inference activation data comprises inference input data and/or inference output data;

quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file to obtain quantization data of the current target quantization layer; the quantization data of the current target quantization layer comprises quantization weight and quantization activation data of the current target quantization layer; the activation data comprise input data and/or output data of the current target quantization layer; the quantized activation data comprises quantized input data and/or quantized output data;

calculating quantization errors of the current target quantization layer according to the inference activation data and the quantization activation data;

And under the condition that each quantized block of the target quantized model is determined to finish the quantization process, updating the quantization parameters of each quantized block according to the quantization error of a quantized layer in each quantized block.

According to another aspect of the present invention, there is provided a model quantization apparatus comprising:

the current target quantization layer determining module is used for determining a current target quantization layer of a current target quantization block in the target quantization model;

the inference activation data acquisition module is used for inferring the target quantization model to obtain inference activation data of the current target quantization layer; wherein the inference activation data comprises inference input data and/or inference output data;

the quantization data acquisition module is used for quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file to obtain quantization data of the current target quantization layer; the quantization data of the current target quantization layer comprises quantization weight and quantization activation data of the current target quantization layer; the activation data comprise input data and/or output data of the current target quantization layer; the quantized activation data comprises quantized input data and/or quantized output data;

The quantization error calculation module is used for calculating the quantization error of the current target quantization layer according to the reasoning activation data and the quantization activation data;

and the quantization parameter updating module is used for updating the quantization parameters of each quantization block according to the quantization error of the quantization layer in each quantization block under the condition that each quantization block of the target quantization model is determined to finish the quantization process.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model quantization method according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the model quantization method according to any one of the embodiments of the present invention when executed.

In the embodiment of the invention, after the current target quantization layer of the current target quantization block in the target quantization model is determined in the model quantization process, the target quantization model is inferred, so that the inferred activation data of the current target quantization layer is obtained. Further, the weight parameters and the activation data of the current target quantization layer are quantized according to the model quantization configuration file, so that the quantization weight and the quantization activation data equivalent data of the current target quantization layer are obtained, and the quantization error of the current target quantization layer is calculated according to the inference activation data and the quantization activation data. Correspondingly, under the condition that each quantization block of the target quantization model is determined to finish the quantization process, the quantization parameters of each quantization block are updated according to the quantization errors of the quantization layers in each quantization block, so that the problems of low efficiency, poor flexibility, poor applicability and the like of the existing model quantization method are solved, and the applicability, flexibility and high efficiency of the model quantization method can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a model quantization method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a neural network model decoder according to an embodiment of the present invention;

FIG. 3 is a flowchart of a model quantization method according to a second embodiment of the present invention;

fig. 4 is a flow chart of a model quantization method according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram showing the effect of compressing a quantization weight of 4bits into a uint32 element or a uint64 element according to a second embodiment of the present invention;

FIG. 6 is a flow chart of a model operation method according to a third embodiment of the present invention;

FIG. 7 is a schematic diagram of a decompression process according to a third embodiment of the present invention;

FIG. 8 is a schematic diagram of a calculation flow of an operator in a target quantization model running process according to the third embodiment of the present invention;

fig. 9 is a schematic diagram of a model quantization apparatus according to a third embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and "object" in the description of the present invention and the claims and the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a model quantization method provided in an embodiment of the present invention, where the embodiment is applicable to a case of fast quantization for a single-layer network structure of a current block of a model, and the method may be performed by a model quantization device, where the device may be implemented by software and/or hardware, and may be generally integrated in an electronic device, where the electronic device may be a terminal device or a server device, so long as the electronic device can be used to process a model quantization process, and the embodiment of the present invention does not limit a specific device type of the electronic device. Accordingly, as shown in fig. 1, the method includes the following operations:

s110, determining a current target quantization layer of a current target quantization block in the target quantization model.

The target quantization model may be a neural network model that needs quantization processing. The current target quantization block may be a block in the target quantization model that is currently required to be quantized. The current target quantization layer may be a layer structure in the current target quantization block that is currently performing quantization operations.

In an embodiment of the present invention, the target quantization model may alternatively be a network model having an encoder and/or decoder structure. By way of example, LLM models, transducer models (converter models), visual network models, etc. having an encoder and/or decoder structure may be used as target quantization models. A plurality of blocks may be included in the target quantization model. A block may describe a single layer in a model, a component made up of multiple layers, or the entire model itself. Taking a decoder as an example, each block may include a decoder layer, and the decoder layer may include self-attention (self-attention) and MLP (Multi-layer Perceptron) modules.

In order to improve the flexibility and quantization efficiency of model quantization, the target quantization model can be independently quantized according to a block-by-block structure. Furthermore, for each block structure requiring quantization processing, the calculation precision of each layer can be adaptively adjusted, and each layer structure in the block structure is independently quantized without depending on other layer structures. Therefore, when quantizing the target quantization model, the blocks of the target quantization model may be sequentially traversed for quantization.

Alternatively, the layer structure with multiplication units in the block, i.e. the layer structure weighted by weight, may be sequentially determined as the current target quantization layer. Fig. 2 is a schematic diagram of a neural network model decoder according to an embodiment of the present invention. In a specific example, as shown in fig. 2, each linear (fully connected or dense) layer in the decoder structure, such as q_linear, k_linear, v_linear, and matmul (function for tensor matrix multiplication) structure, may be sequentially traversed to determine the current target quantization layer to be quantized.

Meanwhile, before the target quantization model is quantized, a model quantization configuration file applicable to the target quantization model can be configured. The model quantization profile may be used to have all the configuration information required by the target quantization model in the quantization process, including but not limited to the number of weight quantization bits, whether per channel (different channels use different quantization parameters), whether symmetric quantization, and the number of input/output quantization bits, etc. Alternatively, each layer structure may be configured with one model quantization profile, or the entire target quantization model may be configured with one model quantization profile, which is not limited by the number and effectiveness range of the model quantization profiles in the embodiment of the present invention.

It should be noted that, the process of quantizing the target quantization model may be implemented off-line, for example, based on offline quantization performed by a CPU (Central Processing Unit ), which is not limited by the embodiment of the present invention.

S120, reasoning the target quantization model to obtain reasoning activation data of the current target quantization layer.

The inference activation data may include inference input data (inputs) obtained by performing an inference operation before quantizing the current target quantization layer of the current target quantization block, and/or inference output data (outputs) obtained by performing an inference operation before quantizing the current target quantization layer of the current target quantization block.

After determining the current target quantization layer, before reasoning about the target quantization model, a correlation function for counting the input data and the output data, such as a forward_hook function, may be first registered, by which inputs and outputs of the linear layer may be counted. After the function registration is completed, a subset can be selected from the verification set of the model to serve as a calibration set (calibration dataset), and the target quantization model is input into the calibration set to start the reasoning process before quantization, so that the original accuracy of the target quantization model of the current target quantization layer, such as fp16 or fp32, and other types of inputs and/or outputs, can be obtained to serve as reasoning activation data of the current target quantization layer.

S130, quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file to obtain quantization data of the current target quantization layer.

Wherein the quantization data of the current target quantization layer includes quantization weights and quantization activation data of the current target quantization layer. The activation data comprise input data and/or output data of the current target quantization layer; the quantized activation data comprises quantized input data and/or quantized output data.

The quantization weight may be a weight after quantization processing is performed on a weight parameter of the current target quantization layer. The quantization activation data may include quantization input data and/or quantization output data, the quantization input data may be data obtained by quantizing input data of the current target quantization layer, and the quantization output data may be data obtained by quantizing output data of the current target quantization layer.

Correspondingly, after the inference input data and/or the inference output data before quantization are obtained through the inference process, specific configuration information for quantizing the weight parameters and the activation data of the current target quantization layer can be determined according to the model quantization configuration file, such as the bits number, whether per channel and whether symmetric quantization are performed during the quantization of the weight parameters, the input and/or output quantization bits number and the like can be determined during the quantization of the activation data, and further the weight parameters and the activation data of the current target quantization layer are quantized according to the weight parameters and the specific configuration information of the activation data of the current target quantization layer, so that the quantization data of the current target quantization layer is obtained.

S140, calculating the quantization error of the current target quantization layer according to the reasoning activation data and the quantization activation data.

Wherein the quantization error can be used to measure the loss of accuracy of the quantization operation on the target quantization model.

In the embodiment of the invention, after each layer structure to be quantized of the target quantization model block is quantized, the quantization error of the layer can be calculated according to the inference activation data and the quantization activation data of the layer. Alternatively, when quantization is required for the output data and the weight parameters, only the quantization error of the output data may be calculated; when the input data, the output data, and the weight parameters need to be quantized at the same time, quantization errors of the input data and the output data can be calculated, respectively.

Alternatively, the quantization error calculated by each layer structure in the current target quantization block may be sequentially recorded in the quantization error list of the current target quantization block. Alternatively, each layer structure in the current target quantization block may establish a quantization error list separately. The quantization error list may be used to record all types of quantization errors.

And S150, under the condition that each quantized block of the target quantized model is determined to complete the quantization process, the quantization parameters of each quantized block are updated according to the quantization errors of the quantization layers in each quantized block.

Wherein the quantization block may be a block in the target quantization model for which the quantization operation has been completed. The quantization layer may be a layer in the quantization block in which the quantization operation has been completed. The quantization parameters, i.e. the relevant parameters of the quantization operation, may for example include, but are not limited to, the number of bits of the weight quantization and the quantization scale factor, as well as the number of bits of the quantization activation data and the quantization scale factor, etc.

In the embodiment of the invention, after the current target quantization block finishes the quantization operation, the output data quantized by the current target quantization block can be used as the input data of the next block to be quantized, the quantization error is transferred to the next layer, the next block to be quantized is updated to the current target quantization block, and the quantization operation is repeatedly performed on the current target quantization block until all the blocks to be quantized finish the quantization operation. It will be appreciated that after quantization of each layer structure is completed, quantization parameters of the layer structure may be correspondingly stored.

It should be noted that, through the above quantization process, low-bit quantization of the target quantization model, such as lower-precision information of 4 bits or 3 bits, 2 bits, etc., can be realized.

Correspondingly, after all layers to be quantized of each block in the target quantization model are quantized, all quantization errors can be sorted, unreasonable quantization errors are screened out, and quantization processes are carried out again on layer structures corresponding to the unreasonable quantization errors, so that quantization parameters of the layer structures corresponding to the unreasonable quantization errors are updated until the quantization errors of all the layer structures are determined to be kept in a reasonable range, and finally reasonable quantization parameters of all quantization layers of the target quantization model are obtained.

Therefore, the model quantization mode is used for independently processing the quantization process layer by layer for each layer to be quantized in each block in the model by taking the block as a unit, and the quantization result of other layers is not needed to be relied on, so that the quantization efficiency of the model can be effectively improved, the model quantization mode is suitable for a neural network model with an encoder and/or a decoder, and the applicability, flexibility and high efficiency of the model quantization method are improved.

Example two

Fig. 3 is a flowchart of a model quantization method according to a second embodiment of the present invention, and fig. 4 is a flowchart of a model quantization method according to a second embodiment of the present invention, where the method is implemented based on the above embodiment, and in the present embodiment, various specific alternative implementations of the subsequent operations of quantizing the weight parameter and the activation data of the current target quantization layer, calculating the quantization error of the current target quantization layer, updating the quantization parameters of each quantization block, and updating the quantization parameters of each quantization block are provided. Accordingly, as shown in fig. 3 and fig. 4, the method of this embodiment may include:

s210, determining a current target quantization layer of a current target quantization block in the target quantization model.

S220, reasoning the target quantization model to obtain reasoning activation data of the current target quantization layer.

Wherein the inference activation data comprises inference input data and/or inference output data.

S230, quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file to obtain quantization data of the current target quantization layer.

The quantization data of the current target quantization layer comprises quantization weight and quantization activation data of the current target quantization layer; the activation data comprise input data and/or output data of the current target quantization layer; the quantized activation data comprises quantized input data and/or quantized output data.

In an optional embodiment of the present invention, the quantizing the weight parameter and the activation data of the current target quantization layer according to the model quantization configuration file to obtain quantized data of the current target quantization layer may include: determining a weight parameter quantization threshold according to the weight parameter of the current target quantization layer; calculating a first quantization scale factor of the weight parameter according to the weight parameter quantization threshold; determining a first associated quantization parameter according to the model quantization configuration file, and quantizing a weight parameter of the current target quantization layer according to the first associated quantization parameter and the first quantization scale factor to obtain the quantization weight; determining a second quantization scale factor of activation data of the current target quantization layer; and determining a second associated quantization parameter according to the model quantization configuration file, and quantizing the activation data of the current target quantization layer according to the second associated quantization parameter and the second quantization scale factor to obtain the quantized activation data.

Wherein the weight parameter quantization threshold may be a threshold for quantization reference of the weight parameter. The first quantization scale factor may be a quantization scale factor applicable to the weight parameter. The first associated quantization parameter may be a parameter that is determined from the model quantization profile and is used for quantizing the weight parameter. The second quantization scale factor may be a quantization scale factor applicable to the activation data. The second associated quantization parameter may be a parameter that is determined from the model quantization profile and is used to reference quantization of the activation data. It should be noted that, the parameter types included in the second quantization scale factor and the second associated quantization parameter need to be determined according to the data types included in the activation data. For example, when only input data is included in the activation data, the second quantization scale factor and the second associated quantization parameter may include only relevant parameters of the input data, such as scale and associated quantization parameters of the input data. When only the output data is included in the activation data, the second quantization scale factor and the second associated quantization parameter may include only relevant parameters of the output data, such as scale and associated quantization parameters of the output data. When the activation data includes both input data and output data, the second quantization scale factor and the second associated quantization parameter may include both related parameters of the input data and the output data, such as scale and associated quantization parameters of the input data, and scale and associated quantization parameters of the output data.

Specifically, when the weight parameter and the activation data of the current target quantization layer are quantized according to the model quantization configuration file, the weight parameter quantization threshold may be first determined according to the weight parameter of the current target quantization layer, for example, the absolute value maximum value of the weight parameter in the current target quantization layer may be counted, and the absolute value maximum value of the weight parameter in the current target quantization layer may be set as the weight parameter quantization threshold in the current target quantization layer. Alternatively, as shown in fig. 4, the weighting parameters in the current target quantization layer may be statistically absmax by partition (absolute maximum quantization, maximum absolute value). Furthermore, scale of the weight parameters may be calculated as the first quantization scale factor according to the weight parameter quantization threshold in combination with the associated parameters in the model quantization profile. For example, if the number of quantization bits of the weight parameter is 8 bits according to the associated parameters of the model quantization configuration file, the quantization threshold of the weight parameter may be divided by 127 to obtain scale of the weight parameter; if the quantization bits number of the weight parameter is 4 bits according to the association parameter of the model quantization configuration file, dividing the quantization threshold of the weight parameter by 7 to obtain scale of the weight parameter.

After the first quantization scale factor is obtained, a first associated quantization parameter used for quantizing the weight parameter, such as the quantization bits number, the perchannel, group number, and the symmetric quantization of the weight parameter, can be determined according to the model quantization configuration file. Furthermore, the quantization of the weight parameter of the current target quantization layer according to the first associated quantization parameter and the first quantization scale factor may include, but is not limited to, performing operations such as round (rounding operation) and clip (limiting maximum and minimum) on the obtained result after dividing the weight parameter by the first quantization scale factor scale to obtain a quantization weight of the quantized set bit number, such as 4bits weight. Correspondingly, after the quantization of the weight parameter is completed, a weight inverse quantization process may be performed, for example, after the quantized weight is multiplied by the first quantization scale factor scale, the weight parameter of the original precision, for example, fp16 weight, may be obtained.

Because each linear layer has a group of input data and/or output data, each sample data also correspondingly generates a group of input data and/or output data, the absolute value maximum value of the input data and/or output data can be counted, and the second quantization scale factor scale suitable for the activation data of the current target quantization layer is calculated according to the absolute value maximum value of the input data and/or output data. It should be noted that, the second quantization scale factor scale may include a quantization scale factor scale corresponding to the input data and/or the output data. That is, the second quantization scale factor scale may include a scale of data, such as a scale of input data or a scale of output data, or may include both a scale of input data and a scale of output data. Similarly, after determining the second quantization scale factor scale, a second associated quantization parameter used for quantizing the activation data, such as the number of quantization bits of the activation data, whether to perform symmetric quantization, etc., may be determined according to the model quantization configuration file. Similarly, the second associated quantization parameter may include an associated quantization parameter corresponding to the input data and/or the output data. That is, the second associated quantization parameter may include an associated quantization parameter of data, such as an associated quantization parameter of input data or an associated quantization parameter of output data, or may include both an associated quantization parameter of input data and an associated quantization parameter of output data.

Furthermore, the quantization of the activation data of the current target quantization layer according to the second associated quantization parameter and the second quantization scale factor may specifically include, but is not limited to, dividing the activation data by the corresponding second quantization scale factor scale, and performing operations such as round and clip on the obtained result to obtain quantized input data and/or quantized output data with a set number of bits after quantization, such as 8bits inputs/outputs.

It should be noted that, the quantization process in the embodiment of the present invention may be set according to actual requirements. For example, only the output data may be quantized, not the input data and the weight parameters, and at this time, the output data obtained by inference may be directly quantized to obtain quantized output data.

Optionally, the input data and/or the weights may be quantized, but the output data is not quantized, where the input data and/or the weights may be quantized directly to obtain quantized input data and/or quantized weights.

Optionally, if the output data needs to be quantized, and meanwhile, the input data and/or the weights need to be quantized, the input data and/or the weights may be quantized first to obtain quantized input data and/or quantized weights, and then the input data and/or the weights are dequantized to obtain dequantized input data and/or dequantized weights, so that the output data is quantized according to the dequantized input data and/or dequantized weights. Specifically, if only the input data is quantized, the result of inverse-quantization of the input data and the weight parameter (not quantized) may be multiplied, and then the result of multiplication may be quantized to obtain quantized output data. If only the weight parameter is quantized, the quantized output data may be obtained by multiplying the input data (not quantized) and the result of the inverse quantization weight, and then quantizing the result of the multiplication. If the weight parameter and the input data are quantized at the same time, the result of inverse quantization of the input data and the result of inverse quantization of the weight can be multiplied, and the quantized output data can be obtained by quantizing the multiplied result.

In a specific example, in the process of performing quantization processing, when the activation data includes only the input data, the input data may be divided by a scale of the input data, and operations such as round and clip may be performed on the obtained result to obtain quantized input data with a set number of bits after quantization. When the activation data only includes the output data, the output data may be divided by the scale of the output data, and then the result may be subjected to operations such as round and clip to obtain quantized output data with a set number of quantized bits. When the activation data includes both input data and output data, operations such as round and clip can be performed on the input data divided by the scale of the input data to obtain quantized input data with a set number of quantized bits, and operations such as round and clip can be performed on the obtained result divided by the scale of the output data to obtain quantized output data with a set number of quantized bits.

Correspondingly, after the quantization of the activation data is completed, an activation data inverse quantization process may be performed, for example, after the quantized activation data is multiplied by the second quantization scale factor scale, inverse quantized activation data, that is, input data and/or output data with original precision, such as fp16inputs/outputs, may be obtained.

The number of quantization bits of the activation data may be different from the weight parameter, and the number of quantization bits of the input data and the output data included in the activation data may be different. For example, the number of quantization bits of the weight parameter may be 4, and the number of quantization bits of the activation data may be 8 or 16, which is not limited in the embodiment of the present invention.

Correspondingly, if the activation data comprise input data and output data of the current target quantization layer; the quantization activation data includes quantized input data and quantized output data, and a subsequent flow of the model quantization method may include the operations of:

s240, performing inverse quantization processing on the quantized weight quantized by the current target quantization layer to obtain inverse quantization weight, and performing inverse quantization processing on quantized input data quantized by the current target quantization layer to obtain inverse quantized input data.

S250, calculating a product value of the inverse quantization input data and the inverse quantization weight to obtain an output data intermediate value; and carrying out quantization processing and inverse quantization processing on the output data intermediate value to obtain inverse quantized output data.

The inverse quantization weight may be data obtained by performing inverse quantization processing on the quantization weight quantized by the current target quantization layer. The dequantized input data may be data obtained by dequantizing quantized input data quantized by the current target quantization layer. The dequantized output data may be data obtained by performing quantization processing and dequantization processing on an intermediate value of the output data. When only the output data is quantized, the dequantized output data may be data obtained by directly performing quantization on the original output data to obtain a quantized result, and then performing dequantization on the quantized result. When the input data and/or the weight data are quantized, the dequantized output data may be data obtained by performing quantization processing on intermediate values of the output data to obtain a quantized result, and performing dequantization processing on the quantized result.

In the embodiment of the invention, when calculating the quantization error of the current target quantization layer, if the activation data comprises the input data and the output data of the current target quantization layer; the quantization activation data includes quantized input data and quantized output data, and represents that the input data and the output data are quantized at the same time, at this time, inverse quantization processing can be performed on the quantized weight of the current target quantization layer, that is, the quantized weight is multiplied by the first quantization scale factor, so as to obtain an inverse quantization weight. Meanwhile, the quantized input data quantized by the current target quantization layer can be subjected to inverse quantization processing to obtain inverse quantized input data, namely, the quantized input data is multiplied by a second quantization scale factor corresponding to the input data to obtain the inverse quantized input data. In addition, the product value of the inverse quantization input data and the inverse quantization weight can be calculated, and after the output data intermediate value is obtained; and performing quantization and inverse quantization on the output data intermediate value, namely performing operations such as round and clip on the obtained result to obtain a quantized output data intermediate value with a set bit number after quantization after dividing the output data intermediate value by a corresponding second quantization scale factor, and performing inverse quantization on the quantized output data intermediate value by multiplying the quantized output data intermediate value by the second quantization scale factor corresponding to the output data to obtain inverse quantized output data.

S260, calculating the difference value of the reasoning input data and the inverse quantization input data to obtain the quantization error of the input data, and calculating the difference value of the reasoning output data and the inverse quantization output data to obtain the quantization error of the output data.

In the embodiment of the invention, after the quantization weight and the quantized activation data are subjected to inverse quantization, the quantization error of the current target quantization layer can be calculated. Specifically, quantization errors under different quantization configurations may be counted. By way of example, the different quantization configurations may be, for example, whether inputs are quantized, whether outputs are quantized, etc. in the case of quantization weight. For example, if in the case of quantization weight, both inputs and outputs participate in quantization, quantization errors may be statistically calculated for inputs and outputs, respectively. Alternatively, the quantization error may be calculated by using a difference between the data after inverse quantization and the data before quantization, for example, calculating a difference between the inferred input data and the inversely quantized input data to obtain a quantization error of the input data, and calculating a difference between the inferred output data and the inversely quantized output data to obtain a quantization error of the output data. Alternatively, the difference may be characterized by an index type such as SNR (Signal-to-noise ratio) or MSE (Mean Square Error ). For example, SNR or MSE, etc., may be calculated for the inverse quantized activation data and the inferred activation data as quantization errors for the current target quantization layer.

Alternatively, after the quantization error calculation of the current target quantization layer is completed, the quantization error of the current target quantization layer may be added to the quantization error list for storage. Meanwhile, quantization parameters of the current target quantization layer, such as weight_scale (first quantization scale factor corresponding to quantization weight)/weight_bits (bit number of quantization weight), input_scale (second quantization scale factor corresponding to quantization input data)/input_bits (bit number of quantization input data), output_scale (second quantization scale factor corresponding to quantization output data)/output_bits (bit number of quantization output data), and the like, can be saved. Correspondingly, after the quantization parameters of the current target quantization layer are successfully stored, the occupied video memory of the current target quantization layer can be released.

In order to further improve the quantization efficiency, each quantization scale factor may be unified into one quantization scale factor for storage. For example, a first quantization scale factor corresponding to the quantization weight, a second quantization scale factor corresponding to the quantized input data, and a second quantization scale factor corresponding to the quantized output data may be integrated into one quantization scale factor for storage. And when the integrated quantization scale factors are used for quantization operation and inverse quantization operation, the quantization scale factors corresponding to each data can be respectively determined through the integrated quantization scale factors to perform a data calculation process. For example, the quantization and inverse quantization processes of the weight parameters can be performed by determining the quantization scale factors corresponding to the weight parameters through the integrated quantization scale factors.

So far, the quantization process of the current target quantization layer is completed. If the current target quantization block still has a layer structure needing quantization, the next layer structure to be quantized can be updated into the current target quantization layer, and the operation of reasoning the target quantization model is returned to be executed until the fact that all the layer structures needing quantization of the current target quantization block are determined to complete the determination and storage process of quantization parameters is determined.

After all layer structures of the current target quantization block that need quantization process store the corresponding quantization parameters, in order to actually implement quantization process for the current target quantization block by using the quantization parameters, related hook functions, such as a forward_pre_hook function and a forward_hook function, may be registered, for quantizing inputs and outputs of the linear layer in the current target quantization block. Furthermore, according to the stored quantization parameters, such as weight_scale/weight_bits of the quantization weights stored correspondingly by each layer mechanism in the current target quantization block, input_scale/input_bits of the quantized input data, output_scale/output_bits of the quantized output data, etc., the quantitive/dequantizing is added to the weight/inputs/outputs in the current target quantization block, and the second reasoning is performed on the target quantization model according to the quantization parameters, so as to obtain the quantized output data output of the current target quantization block. After obtaining output data output quantized by the current target quantized Block, the occupied video memory of the current target quantized Block quantity can be released, the output data output quantized by the current target quantized Block is assigned to the inputs of the next Block to be quantized, quantization errors are transferred to the next layer, the next Block to be quantized is updated into the current target quantized Block, the operation of determining the current target quantized layer of the current target quantized Block in the target quantized model is carried out, and the quantization operation is repeatedly carried out on the updated current target quantized Block until all Block structures in the target quantized model are determined to complete the quantization process.

S270, judging whether each quantization block of the target quantization model completes the quantization process, if yes, executing S280, otherwise, returning to executing S210.

S280, updating the quantization parameters of each quantization block according to the quantization errors of the quantization layers in each quantization block.

In an optional embodiment of the invention, the updating the quantization parameter of each quantization block according to the quantization error of the quantization layer in each quantization block may include: determining a quantization error threshold; determining a quantization layer with quantization error larger than the quantization error threshold value in each quantization block as a quantization layer to be updated; modifying quantization configuration parameters of the quantization layer to be updated in the model quantization configuration file, determining the quantization layer to be updated as the current target quantization layer, and returning to execute the operation of quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file until the quantization error of the quantization layer to be updated is determined to be smaller than or equal to the quantization error threshold; storing quantization parameters of a quantization layer in each quantization block; the quantization parameters comprise quantization scale factors of weight parameters and bits after quantization of the weight parameters, and quantization scale factors of activation data and bits after quantization of the activation data.

The quantization error threshold can be used for screening out unreasonable quantization errors, and specific values of the quantization error threshold can be set according to actual requirements. The quantization layer to be updated may be a layer structure requiring re-updating of quantization parameters.

In the embodiment of the invention, after all blocks in the target quantization model complete the quantization process, the quantization parameters with unreasonable quantization can be screened for quantization again. Specifically, a quantization error threshold may be first determined, all quantization errors in the quantization error list are compared with the quantization error threshold, quantization errors with values exceeding the quantization error threshold are screened out, and a quantization layer with each quantization error greater than the quantization error threshold is determined as a quantization layer to be updated. Further, the quantization configuration parameters of the quantization layer to be updated in the model quantization configuration file are modified, for example, the quantization block number is increased for the quantization weight, the quantization bits number is increased, and the accuracy of the quantization activation data is improved. After the modification of the quantization configuration parameters of the quantization layer to be updated is completed, the quantization layer to be updated can be determined to be the current target quantization layer, and the operation of quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file is returned to be executed until the quantization error of the quantization layer to be updated is determined to be smaller than or equal to the quantization error threshold value. When the quantization error of the quantization layer to be updated is determined to be smaller than or equal to the quantization error threshold, it indicates that the quantization layer to be updated has completed reasonable quantization operation, and at this time, quantization parameters of the quantization layer in each quantization block, that is, weight_scale/weight_bits, input_scale/input_bits, and output_scale/output_bits, may be saved as final quantization parameters.

In a specific example, as shown in fig. 4, the model quantization process may specifically include the following procedures:

step one, traversing Block quantization in sequence:

step two, sequentially quantizing layers with weight weights of multiplication units in single blocks, constructing single-layer quantization layers, and configuring quantization config (namely a model quantization configuration file) of the single-layer quantization layers, wherein the quantization config can comprise, but is not limited to, the number of weight quantization bits, whether per channel, whether symmetric quantization, the number of input/output quantization bits and the like.

And thirdly, registering forward_hook for counting inputs and outputs of the linear layer.

Step four, first reasoning: model reasoning is performed through calibration dataset (calibration set) to obtain inputs and outputs of the linear layer of actual original precision (such as fp16 or fp 32).

And fifthly, counting the maximum value of the weight absolute value as a quantization threshold, for example, counting absmax of the weight according to the partition, and calculating scale of the quantization weight according to the quantization threshold and the quantization config.

Step six, weight Quant: and quantizing the weight according to the quantized config, specifically dividing the weight by the scale of the quantized weight, and obtaining quantized 4bits weight through operations such as round and clip.

Step seven, weight Dequat: the weight is multiplied by scale of the quantized weight to obtain fp16 weight, and the weight obtained by inverse quantization of the weight can be used for subsequent calculation of quantization error.

In the eighth step, since each linear layer has a group of inputs/outputs, and each sample data also generates a group of inputs/outputs correspondingly, the maximum absolute value of inputs/outputs can be counted.

And step nine, activating Quant (activated data quantization), namely dividing Activation data Activation by scale of quantized activated data according to a quantized config file, and obtaining quantized inputs/outputs through operations such as round and clip. Quantized inputs/outputs are typically 4bits, 8bits, or fp16, and may be different from weight.

Step ten, activation DeQuant (inverse quantization of the activation data), the quantized activation data inputs/outputs are multiplied by the corresponding scale to obtain fp16 inputs/outputs. The scales corresponding to inputs and outputs may be the same or different.

Step eleven, counting quantization errors error of quantization weights/inputs/outputs under different quantization configurations (i.e. whether to quantize inputs and whether to quantize outputs in the case of quantization weights, etc.), including but not limited to SNR and MSE, etc., and adding the quantization errors to the layer quantization error list error_list.

Step twelve, preserving quantization parameters of single layer: weight_scale/weight_bits, input_scale/input_bits, output_scale/output_bits.

Thirteenth, releasing the layer to occupy the video memory.

And fourteen, repeating the second step and the thirteenth step to finish the quantization of all layers in the single Block.

Fifteen, selecting quantization inputs/outputs according to quantization config.

Step sixteen, register forward_pre_hook and forward_hook for quantizing inputs and outputs of the linear layer.

Seventeenth step, second reasoning: according to quantization configuration, layer quantization parameters (weight_scale/weight_bits, input_scale/input_bits, output_scale/output_bits) obtained by first reasoning calculation are added with Quant/Dequat (i.e. quantization and inverse quantization) to weight/inputs/outputs, and reasoning is carried out according to the quantization parameters to obtain the output of block.

Eighteenth, releasing the Block to occupy the video memory.

Nineteenth, the output of the Block is assigned to the input of the next Block, the quantization error is transferred to the next layer, and step one-step eighteen is repeated (calculate the quantization parameter of the next Block, and quantize the weight/input/output of the Block).

And twenty, after all the blocks are quantized, sorting is carried out according to the statistics error_list, and layers exceeding a threshold error are selected.

Twenty-one, modifying the layer quantization config, including but not limited to increasing the quantization block number, increasing the quantization bits number, using higher precision in inputs/outputs, etc. during weight quantization. The layer was re-quantized and error calculated.

Twenty-two, judging whether the error of the layer is larger than a threshold value error, if so, repeating twenty-two steps; otherwise, the step twenty-two is performed.

Twenty-third, updating and storing the finally reasonable and usable quantization parameters of the screened layer: weight_scale/weight_bits, input_scale/input_bits, and output_scale/output_bits.

According to the technical scheme, the video memory occupation is reduced in the quantization process according to the block layer-by-layer quantization model, special functions (such as mean square error and the like) are introduced to compensate quantization precision when quantization errors are calculated, and the weight quantization data types and input/output quantization data types of each layer can be adaptively adjusted according to the errors. Furthermore, according to the quantization error, the quantization data types of input activation data of different dots (dot functions are mainly used for dot product of vectors and multiplication of matrixes) can be adaptively adjusted through a hook registration mechanism, so that dot calculation is converted into int4, int8, int4 fp16, int8 or the like through floating point multiplication calculation, and an inference scheme of an optimal acceleration model is flexibly configured according to a hardware architecture. Meanwhile, the above model quantization method can configure the accuracy of the Output activation data, quantize the Output to reduce Output IO (Input/Output). In the quantization process, the activation data can be calculated by floating point calculation, so that the reasoning accuracy of the model is not lost.

The model quantization method does not need to carry out quantization training on the model, can be realized in a hardware form or a firmware form, can be flexibly adapted to model structures such as a Decoder and an Encoder through self-adaptive adjustment of each layer of calculation precision, is particularly suitable for LLM models such as GPT (generated Pre-training transducer model), OPT (Open Pre-trained Transformer Language Models, open Pre-training transducer language model) and llama (Large Language Model Meta AI, meta AI large language model), and greatly improves quantization efficiency compared with a model quantization method based on quantization training. Through post-training quantization (Post Training Quantization), matrix multiplication of the model can be converted from original precision fp32/fp16 to int4, fp16, int8 or int8 and the like, the weight occupied storage space of the model is reduced to 1/8, the calculation memory is reduced, matched int4 2D calculation force and 1D calculation force are configured, and the reasoning performance of the model can be greatly improved. Through the verification of a llama model and an OPT model in a GPT model, after the W4A16 operator is quantized by the model quantization method, the precision is reduced by less than 1%, the model parameter is reduced to 1/4 of an fp16 model, and the latency is reduced by more than 3x relative to fp 16; after the mixing precision of the W4A16/W4A8 operator is expanded, the quantized model performance can be further improved.

In an optional embodiment of the present invention, after updating the quantization parameter of each quantization block according to the quantization error of the quantization layer in each quantization block, the method may further include: determining a target data type supported by a hardware structure of target equipment running the target quantization model; and compressing the quantization weight of the target quantization model according to the target data type supported by the hardware structure of the target equipment.

The target device may be a device running the target quantization model after the quantization processing, and may be any type of terminal device, such as a smart phone, an intelligent detection system terminal, a tablet computer, a personal computer, or the like. The method and the device can also be a server device, so long as the requirement of running the target quantization model after quantization processing exists, and the embodiment of the invention does not limit the specific device type of the target device. The target data type may be a data format type that the hardware structure of the target device may support.

In order to further save Memory occupation on a chip of the running model device and reduce loading time of the model, before the quantized target quantization model is loaded to the target device to run, quantization weights of the target quantization model may be compressed first. It is understood that the hardware structure of some types of electronic devices may support limited types of data formats, e.g., some devices may support the int4 data format and some devices may not support the int4 data format. Therefore, when compressing the quantization weights of the target quantization model, the target data type supported by the hardware structure of the target device running the target quantization model can be determined, and the compression processing is performed on the quantization weights of the target quantization model according to the target data type supported by the hardware structure of the target device, so as to adapt to efficient decompression operation when different target devices run.

In an optional embodiment of the present invention, the compressing the quantization weights of the target quantization model according to the target data type supported by the hardware structure of the target device may include: under the condition that the target data type supported by the hardware structure of the target device comprises int4, determining the first quantization weight in the current quantization weight set to be compressed and the bit number of each quantization weight in the current quantization weight set to be compressed; under the condition that the first quantization weight is positive, performing left shift processing on the current quantization weight to be compressed according to the bit of each quantization weight, and then performing bit-wise or processing by combining the previous compressed quantization weight result of the current quantization weight to be compressed to obtain the current compression quantization weight; and under the condition that the first quantization weight is determined to be negative, carrying out complement processing on the first quantization weight according to the bit number of each quantization weight, carrying out left shift processing on the current quantization weight to be compressed, then carrying out low-order complement processing to obtain the complement quantization weight, and carrying out bitwise and processing on the complement quantization weight by combining the previous compressed quantization weight result of the current quantization weight to be compressed to obtain the current compression quantization weight.

The current quantization weight set to be compressed may be a set of quantization weights that currently need to be compressed. For example, the current quantization weight set to be compressed may include 8 quantization weights to be compressed. The embodiment of the invention does not limit the number of quantization weights included in the quantization weight group to be compressed currently.

Alternatively, if the hardware structure of the target device supports the data format of int4, when the quantized weights are quantized to 4bits or less, save loading may be performed by compressing (pack) the quantized weights into uint32 or uint 64. Fig. 5 is a schematic diagram of an effect of compressing a quantization weight of 4bits into a uint32 element or a uint64 element according to a second embodiment of the present invention. Illustratively, as shown in FIG. 5, taking 4bits weight as an example, the process of compressing the weight of 8 int4 into 1 uint32 data specifically includes the following operations: all quantization weights are divided into groups of 8 each, and compression processing is performed in units of groups. Correspondingly, if the quantization weight of the first int4 in the current quantization weight set to be compressed is positive, the quantization weight of the nth int4 can be shifted left by 4 x (N-1) bits in sequence, and then the bit is taken with the compression-processed weight set result of the nth-1 quantization weight in the current quantization weight set to be compressed, so as to obtain the weight data of the uint32 after pack. Correspondingly, if the quantized weight of the first int4 in the current quantized weight set to be compressed is negative, the quantized weight of the first int4 is stored as int32 according to the complement, the quantized weight of the nth int4 can be sequentially shifted left by 4 x (N-1) bits, 1 is supplemented at a low level, and then the quantized weight is bit-wise fetched with the result of the compressed weight set of the nth-1 quantized weight in the current quantized weight set to obtain the weight data of the uint32 after pack.

Alternatively, if the hardware structure of the target device does not support the data format of int4, the quantization weights may be converted to int8 for saving. Specifically, two groups of quantized weights of 8 int4 are subjected to an inter (one by one) operation, and are arranged into weight data of 8 uint8 formats, so that weight data of uint64 is formed, and the weight data are used for efficient decompression processing and parallel calculation in a run stage.

In an optional embodiment of the present invention, after the compressing the quantization weights of the target quantization model according to the target data type matched by the target device, the method may further include: and loading the quantized data of the current quantized layer of the target quantized model and the target quantized reference parameters in the process of operating the target quantized model by target equipment. And according to quantized input data and quantized results of quantization weights in the quantized data, carrying out data calculation in the current quantized layer by combining the target quantized reference parameters.

The target quantization model is obtained by performing quantization processing by the model quantization method according to any one of the embodiments. The current quantization layer may be a layer structure in which data calculation is currently being performed during the process of the target device running the target quantization model. The target quantization reference parameter may be a quantization parameter used for decompressing quantized data of a current quantization layer, and may be a quantization scale factor, for example. Alternatively, the target quantization reference parameter may be a first quantization scale factor corresponding to the quantization weight, and a second quantization scale factor of the activation data of each layer.

In the embodiment of the present invention, when the target quantization model performs quantization processing by using the model quantization method described in any one of the embodiments, and compression processing is performed on the quantized weight obtained by quantization, the quantized weight may be loaded into the target device to run an application. In the process of operating the target quantization model by the target device, the quantization data of the current quantization layer of the target quantization model and the target quantization reference parameters, for example, the compressed quantization weight, scale parameters corresponding to the activation data and the like, can be loaded, and calculation of each layer is sequentially performed. Alternatively, the hardware structure of the target device may support any type of data type, such as the int4 data format, etc.

In an optional embodiment of the present invention, the performing data calculation in the current quantization layer according to the quantized input data and the quantized result of the quantization weight in the quantized data in combination with the target quantization reference parameter may include: under the condition that the quantized input data and the quantized result of the quantized weight are both int4, performing four-bit matrix multiplication on the quantized input data and the quantized weight, and performing product operation on the matrix multiplication result and the target quantized reference parameter to obtain current quantized layer output data; under the condition that the quantized result of the quantized input data is determined to be int8 and the quantized result of the quantized weight is determined to be int4, decompressing the quantized weight into the data type of int8, performing eight-bit integer multiplication on the quantized input data and the decompressed quantized weight, and performing product operation on the multiplication result and the target quantized reference parameter to obtain current quantized layer output data; under the condition that the quantized result of the quantized input data is fp16 and the quantized result of the quantized weight is int4, decompressing the quantized weight into the data type of int8, performing product operation on the decompressed quantized weight and the target quantized reference parameter, performing matrix multiplication on the quantized input data and the quantized weight after the product operation, and taking the matrix multiplication result as current quantized layer output data; and taking the current quantized layer output data or a quantized result of the current quantized layer output data as quantized input data of a next quantized layer, and outputting the quantized input data into the next quantized layer.

The current quantization layer output data may be data output by the current quantization layer.

Fig. 6 is a schematic flow chart of a model operation method provided by the third embodiment of the present invention, fig. 7 is a schematic flow chart of decompression provided by the third embodiment of the present invention, and fig. 8 is a schematic flow chart of a certain operator calculation in a target quantization model operation process provided by the third embodiment of the present invention. In a specific example, as shown in fig. 6, 7 and 8, when the target device runs the target quantization model and performs data calculation in the current quantization layer of the target quantization model, the number of bits of the quantization result weight of the quantization input data input and the quantization weight may be determined. If it is determined that the quantized input data input and the quantized result weight of the quantized weight are both int4, and the target device supports the int4 data format, then the input and the weight may be multiplied by a four-bit matrix through an int4 matrix multiplication instruction, such as an nvidia mma (inflight multiplication instruction) instruction, and the product operation is performed on the matrix multiplication result and the scale parameter, so as to obtain the output data of the current quantized layer.

If it is determined that the quantized result of the quantized input data input is int8 and the quantized result weight of the quantization weight is int4, the weight may be decompressed twice first to decompress the weight into the data type of int 8. Illustratively, when decompressing the weight of int4, the lower 4bits of the weight of int4 may be shifted left by 4bits and then shifted right by 4 bits. Further, the input of the quantized input data and the decompressed quantized weight are multiplied by an eight-bit matrix through an int8 matrix multiplication instruction, and the multiplication result of the matrix and scale parameters are multiplied to obtain output data of the current quantized layer.

If it is determined that the quantized result of the quantized input data input is fp16 and the quantized result weight of the quantization weight is int4, the weight may be decompressed by a shift and bit operation first to be decompressed into the data type of int 8. Further, the decompressed quantized weight and scale parameters are multiplied to obtain the weight of fp16, matrix multiplication is performed on the quantized input data input of fp16 and the weight of fp16, and the matrix multiplication result is used as output data of the current quantized layer.

In the above example, the scale parameter may be a scale parameter obtained by integrating the scale parameter of the quantization weight, the scale parameter of the quantized input data, and the scale parameter of the quantized output data. Meanwhile, each instruction can be realized by a hardware instruction.

Correspondingly, the output data of the current quantization layer can be output to the next layer input, and the next quantization layer is updated to the current quantization layer, so that the operation of multiplying the quantization input data and the quantization weight by the four-bit matrix is performed in the current quantization layer under the condition that the quantization results of the quantization input data and the quantization weight are both determined to be int4, until all layer structures complete the data calculation process.

It should be noted that, if the relevant configuration information of quantization is configured for the current quantization layer output data in the model quantization configuration file, after the current quantization layer output data is obtained, the quantization result of the current quantization layer output data may be used as the quantization input data of the next quantization layer and output to the next quantization layer after the quantization processing is performed on the current quantization layer output data according to the model quantization configuration file.

According to the technical scheme, compression processing is carried out on the quantization weights, and decompression processing is carried out on the quantization weights by combining hardware instructions in the process of operating the target quantization model, so that the collaborative design of software and hardware is realized, and the target quantization model algorithm can be efficiently executed on hardware, so that higher efficient hardware performance is exerted.

The model quantization method provided by the embodiment of the invention does not need to quantize a training model, can be realized in a hardware form or a firmware form, reduces the occupation of a video memory in the quantization process according to the model block layer by layer, is flexibly suitable for models of GPT, OPT, llama and the like, and can greatly improve the efficiency, applicability and flexibility of model quantization.

It should be noted that any permutation and combination of the technical features in the above embodiments also belong to the protection scope of the present invention.

Example III

Fig. 9 is a schematic diagram of a model quantization apparatus according to a third embodiment of the present invention, as shown in fig. 9, where the apparatus includes: a current target quantization layer determination module 310, an inference activation data acquisition module 320, a quantization data acquisition module 330, a quantization error calculation module 340, and a quantization parameter update module 350, wherein:

a current target quantization layer determining module 310, configured to determine a current target quantization layer of a current target quantization block in the target quantization model;

the inference activation data acquisition module 320 is configured to infer the target quantization model to obtain inference activation data of the current target quantization layer; wherein the inference activation data comprises inference input data and/or inference output data;

the quantized data obtaining module 330 is configured to quantize the weight parameter and the activation data of the current target quantized layer according to a model quantized configuration file, so as to obtain quantized data of the current target quantized layer; the quantization data of the current target quantization layer comprises quantization weight and quantization activation data of the current target quantization layer; the activation data comprise input data and/or output data of the current target quantization layer; the quantized activation data comprises quantized input data and/or quantized output data;

A quantization error calculation module 340, configured to calculate a quantization error of the current target quantization layer according to the inference activation data and the quantization activation data;

and a quantization parameter updating module 350, configured to update the quantization parameter of each quantization block according to the quantization error of the quantization layer in each quantization block when it is determined that each quantization block of the target quantization model completes the quantization process.

Optionally, the quantized data acquisition module 330 is specifically configured to: determining a weight parameter quantization threshold according to the weight parameter of the current target quantization layer;

calculating a first quantization scale factor of the weight parameter according to the weight parameter quantization threshold;

determining a first associated quantization parameter according to the model quantization configuration file, and quantizing a weight parameter of the current target quantization layer according to the first associated quantization parameter and the first quantization scale factor to obtain the quantization weight;

determining a second quantization scale factor of activation data of the current target quantization layer;

and determining a second associated quantization parameter according to the model quantization configuration file, and quantizing the activation data of the current target quantization layer according to the second associated quantization parameter and the second quantization scale factor to obtain the quantized activation data.

Optionally, if the activation data includes input data and output data of the current target quantization layer; the quantization activation data includes quantized input data and quantized output data, and the quantization error calculation module 340 is specifically configured to: performing inverse quantization processing on the quantized weight of the current target quantization layer to obtain an inverse quantized weight;

Performing inverse quantization processing on quantized input data quantized by the current target quantization layer to obtain inverse quantized input data;

calculating the product value of the inverse quantization input data and the inverse quantization weight to obtain an output data intermediate value;

performing quantization processing and inverse quantization processing on the output data intermediate value to obtain inverse quantized output data;

calculating the difference value between the reasoning input data and the inverse quantization input data to obtain a quantization error of the input data;

and calculating the difference value between the reasoning output data and the inverse quantization output data to obtain the quantization error of the output data.

Optionally, the quantization parameter updating module 350 is specifically configured to: determining a quantization error threshold;

determining a quantization layer with quantization error larger than the quantization error threshold value in each quantization block as a quantization layer to be updated;

modifying quantization configuration parameters of the quantization layer to be updated in the model quantization configuration file, determining the quantization layer to be updated as the current target quantization layer, and returning to execute the operation of quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization configuration file until the quantization error of the quantization layer to be updated is determined to be smaller than or equal to the quantization error threshold;

Storing quantization parameters of a quantization layer in each quantization block; the quantization parameters comprise quantization scale factors of weight parameters and bits after quantization of the weight parameters, and quantization scale factors of activation data and bits after quantization of the activation data.

Optionally, the model quantization apparatus further includes a quantization weight compression processing module, configured to: determining a target data type supported by a hardware structure of target equipment running the target quantization model;

and compressing the quantization weight of the target quantization model according to the target data type supported by the hardware structure of the target equipment.

Optionally, the quantization weight compression processing module is specifically configured to: under the condition that the target data type supported by the hardware structure of the target device comprises int4, determining the first quantization weight in the current quantization weight set to be compressed and the bit number of each quantization weight in the current quantization weight set to be compressed;

under the condition that the first quantization weight is positive, performing left shift processing on the current quantization weight to be compressed according to the bit of each quantization weight, and then performing bit-wise or processing by combining the previous compressed quantization weight result of the current quantization weight to be compressed to obtain the current compression quantization weight;

And under the condition that the first quantization weight is determined to be negative, carrying out complement processing on the first quantization weight according to the bit number of each quantization weight, carrying out left shift processing on the current quantization weight to be compressed, then carrying out low-order complement processing to obtain the complement quantization weight, and carrying out bitwise and processing on the complement quantization weight by combining the previous compressed quantization weight result of the current quantization weight to be compressed to obtain the current compression quantization weight.

Optionally, the model quantization apparatus further comprises a model data calculation module for: loading quantized data of a current quantized layer of the target quantized model and target quantized reference parameters in the process that the target equipment operates the target quantized model;

and according to quantized input data and quantized results of quantization weights in the quantized data, carrying out data calculation in the current quantized layer by combining the target quantized reference parameters.

Optionally, the model data calculation module is specifically configured to: under the condition that the quantized input data and the quantized result of the quantized weight are both int4, performing four-bit matrix multiplication on the quantized input data and the quantized weight, and performing product operation on the matrix multiplication result and the target quantized reference parameter to obtain current quantized layer output data;

Under the condition that the quantized result of the quantized input data is determined to be int8 and the quantized result of the quantized weight is determined to be int4, decompressing the quantized weight into the data type of int8, performing eight-bit integer multiplication on the quantized input data and the decompressed quantized weight, and performing product operation on the multiplication result and the target quantized reference parameter to obtain current quantized layer output data;

under the condition that the quantized result of the quantized input data is fp16 and the quantized result of the quantized weight is int4, decompressing the quantized weight into the data type of int8, performing product operation on the decompressed quantized weight and the target quantized reference parameter, performing matrix multiplication on the quantized input data and the quantized weight after the product operation, and taking the matrix multiplication result as current quantized layer output data;

and taking the current quantized layer output data or a quantized result of the current quantized layer output data as quantized input data of a next quantized layer, and outputting the quantized input data into the next quantized layer.

The model quantization device can execute the model quantization method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be referred to the model quantization method provided in any embodiment of the present invention.

Since the model quantization apparatus described above is an apparatus capable of executing the model quantization method in the embodiment of the present invention, a person skilled in the art can understand the specific implementation of the model quantization apparatus in the embodiment of the present invention and various modifications thereof based on the model quantization method described in the embodiment of the present invention, so how the model quantization apparatus implements the model quantization method in the embodiment of the present invention will not be described in detail herein. The apparatus used by those skilled in the art to implement the model quantization method in the embodiments of the present invention is within the scope of the protection sought herein.

Example IV

Fig. 10 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 10, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as model quantization methods.

In some embodiments, the model quantization method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the model quantization method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the model quantization method in any other suitable way (e.g., by means of firmware).

Alternatively, the model quantization method may include the operations of:

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

Claims

1. A method of model quantization, comprising:

2. The method according to claim 1, wherein quantizing the weight parameters and the activation data of the current target quantization layer according to the model quantization profile to obtain the quantization data of the current target quantization layer comprises:

determining a weight parameter quantization threshold according to the weight parameter of the current target quantization layer;

3. The method of claim 1, wherein if the activation data comprises input data and output data of the current target quantization layer; the quantization activation data includes quantization input data and quantization output data, and the calculating quantization errors of the current target quantization layer according to the inference activation data and the quantization activation data includes:

Performing inverse quantization processing on the quantized weight of the current target quantization layer to obtain an inverse quantized weight;

and calculating an input-output data quantization error according to the reasoning input-output data and the inverse quantization input-output data.

4. The method of claim 1, wherein said updating the quantization parameter of each quantization block based on the quantization error of the quantization layer in each quantization block comprises:

determining a quantization error threshold;

5. The method of claim 4, further comprising, after said updating the quantization parameter of each quantization block based on the quantization error of the quantization layer in each quantization block:

determining a target data type supported by a hardware structure of target equipment running the target quantization model;

6. The method of claim 5, wherein compressing quantization weights of the target quantization model according to a target data type supported by a hardware structure of the target device, comprises:

under the condition that the target data type supported by the hardware structure of the target equipment comprises int4, determining the first quantization weight in the current quantization weight set to be compressed and the bit number of each quantization weight in the current quantization weight set to be compressed;

7. The method according to claim 5 or 6, further comprising, after said compressing quantization weights of said target quantization model according to said target data type matched by said target device:

loading quantized data of a current quantized layer of the target quantized model and target quantized reference parameters in the process that the target equipment operates the target quantized model;

8. The method according to claim 7, wherein said performing data calculation in the current quantization layer in combination with the target quantization reference parameter according to the quantization result of the quantization input data and the quantization weight in the quantization data comprises:

under the condition that the quantized input data and the quantized result of the quantized weight are both int4, performing four-bit matrix multiplication on the quantized input data and the quantized weight, and performing product operation on the matrix multiplication result and the target quantized reference parameter to obtain current quantized layer output data;

under the condition that the quantized result of the quantized input data is determined to be int8 and the quantized result of the quantized weight is determined to be int4, decompressing the quantized weight into the data type of int8, multiplying the quantized input data by the decompressed quantized weight by an eight-bit integer matrix, and performing product operation on the multiplied result and the target quantized reference parameter to obtain output data of the current quantized layer;

9. A model quantization apparatus, comprising:

the inference activation data acquisition module is used for inferring the target quantization model to obtain inference activation data of the current target quantization layer;

10. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model quantization method of any one of claims 1-8.

11. A computer readable storage medium storing computer instructions for causing a processor to implement the model quantization method of any one of claims 1-8 when executed.