CN116451769A

CN116451769A - Quantization method of language model and electronic equipment

Info

Publication number: CN116451769A
Application number: CN202310395701.9A
Authority: CN
Inventors: 冷静文; 郭聪; 唐嘉铭; 过敏意
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-07-18

Abstract

The invention provides a quantization method of a language model and electronic equipment, wherein the method comprises the following steps: determining a quantized scale factor; identifying normal and outliers based on the neural network parameters and the scale factors; two adjacent positions in tensor of the neural network model are paired, when abnormal values appear in pairing, pruning is carried out on the other value in pairing, and an abnormal value identifier is configured for the pairing; pairs with outliers are quantized. The invention can realize rapid quantification of the large language model, and ensures the accuracy and performance of the model while accelerating the reasoning speed and reducing the cost required by operation.

Description

Quantization method of language model and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence algorithms, in particular to the technical field of neural network model quantization.

Background

Artificial intelligence algorithms are gradually applied to various fields in real life as deep learning algorithms are continuously developed and advanced. In recent years, a large language model (large language model, LLM) based on a transducer has been developed particularly rapidly, and has been greatly successful in many natural language processing fields such as machine translation, information retrieval, question-answering, abstract, and the like. However, large language models tend to achieve precision improvement by increasing the model size. According to incomplete statistics, the size of the language model is enlarged by 240 times every two years, and the hardware calculation power is only increased by 3.1 times every two years on average, so that a huge gap exists between the two, and great challenges are brought to the deployment and practical application of the large language model. For example, the latest transducer-based LLM OPT-175B has 1750 hundred million parameters and cannot even be deployed on a GPU H100 with 80GB of video memory on a single card.

Neural network model quantization is an effective method which is friendly to hardware, can promote the neural network reasoning speed and reduce the operation cost. The method uses the data type with smaller memory expenditure to compress the size of the neural network model and accelerate the reasoning speed of the model on hardware at the cost of negligible or smaller decline of the neural network identification accuracy. However, as the scale of the large language model increases, the performance of the language model is greatly affected by some abnormal values with very large values occurring in the operation process, so that when many quantization methods previously proposed are used on the large language model, a non-negligible decrease in recognition accuracy is brought about, and the recognition accuracy of the neural network cannot be ensured.

The existing technology capable of effectively quantifying a large language model is mainly a mixed precision quantification mode based on outlier sparsity. The technology records all abnormal values by using sparse matrix storage modes such as a Coordinate List and the like by utilizing the sparse characteristic that the number of abnormal values in the operation process in the neural network is small. In the quantization process, a higher-precision data type is used for representing an abnormal value with a larger value, and a lower-precision data type is used for representing a mixed-precision quantization mode of a normal value, so that a better compression effect on a large language model can be obtained, and meanwhile, the recognition accuracy and performance of the model are maintained.

However, this technique has a problem that it needs to maintain an additional sparse coordinate list, or needs to use data types with different accuracies for the outlier and the normal value, which can bring unaligned and random memory accesses, and this will seriously affect the reasoning speed of the large language model on the hardware. In addition, for this technique using sparse lists, the change of the memory access mode will not be suitable for the Graphics Processor (GPU) and Tensor Processor (TPU) which are widely used in the current neural network model reasoning, and it needs to take a large hardware overhead to modify to be used.

The existing quantization technology has a plurality of problems for a large language model, and the speed of reasoning of the large language model cannot be efficiently improved while the performance of the large language model is ensured. In the context of the rapid increase of the computational power demands of hardware, the quantization technology with high efficiency and less influence on the model accuracy is needed to be developed.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a method for quantizing a language model and an electronic device, which can implement rapid quantization of a large language model, and ensure accuracy and performance of the model while accelerating the reasoning speed and reducing the cost required for operation.

To achieve the above and other related objects, the present invention provides a method for quantizing a language model, comprising: determining a quantized scale factor; identifying normal and outliers based on the neural network parameters and the scale factors; two adjacent positions in tensor of the neural network model are paired, when abnormal values appear in pairing, pruning is carried out on the other value in pairing, and an abnormal value identifier is configured for the pairing; pairs with outliers are quantized.

In one embodiment of the present invention, the determining the quantized scale factor includes: taking the 3 sigma value of the neural network parameter as the quantized initial scale factor; searching to obtain a plurality of scale factors to be determined based on the paired coding quantization errors with abnormal values; the scale factor to be determined, which codes the smallest quantization error, is determined as the optimal scale factor as the scale factor identifying normal and abnormal values.

In an embodiment of the invention, the pairing each two adjacent positions in the tensor of the neural network model includes: the values of two adjacent positions are normal values, and integer quantization is adopted; one value of two adjacent positions is an abnormal value, the other is a normal value, pruning is carried out on the normal value, and the pruning value is recorded as a clipping value; the values of two adjacent positions are abnormal values, and pruning is carried out on the abnormal values with small values.

In an embodiment of the present invention, the configuring the outlier identifier for the pairing includes: configuring one end of an integer range of values as an outlier marker and shifting the end out of the integer range of values; in the pairing in which there is an outlier, the clipping value is replaced with the outlier marker.

In one embodiment of the present invention, the quantifying the pairing in the presence of outliers includes: converting a floating point data type into a floating point data type with an offset:

sign×(1<<mb+mantissa)<<(exponent+bias)；

wherein sign represents a sign, exponents represents an exponent, bias represents a bias, mantissa represents a mantissa, and mb represents the number of bits of the mantissa.

To achieve the above and other related objects, the present invention also provides an electronic device including a memory for storing a computer program; a processor for executing the computer program to implement the steps of the method for quantifying a language model as described above.

In an embodiment of the invention, the processor is a graphics processor; wherein, the element dot product unit in each tensor calculation core of the graphics processor is correspondingly embedded and configured with a decoder, an adder and a shifter.

In an embodiment of the invention, the processor is a tensor processor; wherein a decoder is configured at the boundary of the systolic array of the tensor processor; the size of the ripple array is n multiplied by m, and the number of the decoders is n+m; n and m are the number of rows and columns of the systolic array, respectively.

In one embodiment of the present invention, the decoder includes: an abnormal value judging unit for judging whether an abnormal value exists in the pairing of the input data; a normal value decoder that outputs a normal value in the pairing to the original code; an outlier decoder decodes outliers in the pairing into exponent-integer pairing values.

In one embodiment of the invention, for floating point data types with offsets, the decoder decodes the exponent by an exponent versus offset relationship; the radix is decoded by a relation of the radix and the input data.

As described above, the language model quantization method and the electronic device of the present invention have the following advantages:

the invention can realize rapid quantification of a large language model, accelerate the reasoning speed, reduce the cost required by operation, ensure the accuracy and performance of the model, achieve the aim of approaching the original accuracy and greatly improving the reasoning speed on a common language model, treat abnormal values in a neural network model with low hardware cost, realize high-performance improvement and push the limit of 4-bit quantification of the neural network to a new advanced level. The invention can be effectively integrated into the existing hardware processor, such as a GPU tensor calculation core and a TPU pulse array, and provides high-efficiency support for deploying a large-scale neural network model on various devices.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating a method for quantifying a language model according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing the outlier-clipping matching process in the quantization method of the language model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of outlier-clipping value pair coding in a quantization method of a language model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a graphics processor according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a tensor processor according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an outlier-clipping pair decoder according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an outlier decoder according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an overall implementation of a method for quantizing a language model according to an embodiment of the present application.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

The embodiment aims to provide a quantization method and electronic equipment for a language model, which are a quantization method for perceiving abnormal values friendly to hardware and efficient hardware design and realization, and can realize rapid quantization for a large language model, so that the reasoning speed is accelerated, the cost required by operation is reduced, and meanwhile, the accuracy and performance of the model are ensured.

The principle and the implementation of the language model quantization method and the electronic device of the present invention will be described in detail below, so that those skilled in the art can understand the language model quantization method and the electronic device of the present invention without creative effort.

FIG. 1 is a flow chart illustrating a method for quantifying a language model according to an embodiment of the present application; as shown in fig. 1, the present embodiment provides a quantization method of a language model, the language model being a large language model (large language model, LLM), the method including the following steps S100 to S400.

Step S100, determining a quantized scale factor;

step S200, identifying normal values and abnormal values based on the neural network parameters and the scale factors;

step S300, two adjacent positions in tensor of the neural network model are paired, when abnormal values appear in pairing, pruning is carried out on the other value in the pairing, and an abnormal value identifier is configured for the pairing;

step S400, quantifying the pairing with the outlier.

The quantization method of the language model of the present embodiment is also called OVP coding by encoding Outlier-clipping Pair (Outlier-Victim Pair). Outliers are identified and aligned memory accesses are maintained by outlier-clipping pair encoding, after which they are accurately quantized with low precision using adaptively biased floating point data types.

The following describes the above steps S100 to S400 of the language model quantization method of the present embodiment in detail.

Step S100, a quantized scale factor is determined.

The present embodiment employs an algorithm that minimizes Mean Square Error (MSE) to determine the quantized scale factor (i.e., a threshold that distinguishes between outliers and normal values). On the one hand, this may make the quantization error (MSE) smaller, since a smaller threshold may result in more outlier-clipping pairs. But on the other hand it also increases the proportion of outlier-outlier pairs, however if the number of outlier-outlier pairs is too large, pruning of outliers resulting may increase the MSE. The invention can reasonably control the ratio of the outlier-outlier pairing so that the ratio cannot be too high, thereby improving the prediction accuracy of the model.

In this embodiment, the optimal scaling factor, that is, the threshold between the outlier and the normal value, is obtained for each layer of neural network parameters by a scaling factor selection algorithm. After the optimal scaling factor is obtained, taking the neural network parameter and the optimal scaling factor as the input of an outlier-clipping value pairing coding algorithm, obtaining a quantization result of pairing coding, and outputting the quantized quantization result.

Specifically, in this embodiment, the determining the quantized scale factor includes: taking the 3 sigma value of the neural network parameter as the quantized initial scale factor; searching to obtain a plurality of scale factors to be determined based on the paired coding quantization errors with abnormal values; the scale factor to be determined, which codes the smallest quantization error, is determined as the optimal scale factor as the scale factor identifying normal and abnormal values.

The method of this embodiment is mainly directed to Post-training quantization (Post-training Quantization) quantization method, which does not require gradient descent-based fine-tuning, and is therefore most suitable for quantization of large language models, because fine-tuning of large language models requires consuming very much hardware resources and considerable time. In the embodiment, only one batch of data in the training set is needed to select the scale factors, so that the quantization of the large language model can be completed, and the method is very efficient and convenient. Inspired by the 3 sigma theorem, in the method of this embodiment, the 3 sigma value of the tensor is taken as the initial scale factor of quantization, and then an optimal scale factor with the minimum MSE is searched within a certain specific range of the initial value, and this method works well in practical experiments. For Quantization-aware Training (Quantization-aware Training) that requires fine tuning, fine tuning can be performed by a straight-through estimator (straight-through-through estimator) method, so as to obtain a suitable scale factor.

Specifically, in the scaling factor selection algorithm, the 3σ value of the neural network parameter is calculated as an initial scaling factor, and then a new scaling factor to be determined is selected in a specific range by using the scaling factor search algorithm. The outlier-clipping pair codes of the neural network parameters are recalculated using the scale factor, and the quantization error encoded using the scale factor is estimated using the Mean Square Error (MSE) and the presently best result is updated. And continuously cycling the process to search until all the undetermined scale factors are explored, and outputting the scale factor with the smallest error, namely the optimal scale factor obtained by searching.

Step S200, identifying normal values and abnormal values based on the neural network parameters and the scale factors.

And screening normal values and abnormal values in the neural network parameters based on the scale factors, wherein the scale factors are thresholds for distinguishing the abnormal values from the normal values, and the normal values and the abnormal values are identified by comparing the neural network parameters with the thresholds.

And step S300, two adjacent positions in tensors of the neural network model are paired, when abnormal values appear in pairing, the other value in the pairing is pruned, and an abnormal value identifier is configured for the pairing.

In this embodiment, the pairing each two adjacent positions in the tensor of the neural network model includes:

1) The values of two adjacent positions are normal values, and integer quantization is adopted;

2) One value of two adjacent positions is an abnormal value, the other is a normal value, pruning is carried out on the normal value, and the pruning value is recorded as a clipping value;

3) The values of two adjacent positions are abnormal values, and pruning is carried out on the abnormal values with small values.

According to experiments, pruning the outlier will have a very large effect on the accuracy of the neural network model, even leading to failure of the neural network model, whereas subtracting the normal value of the same proportion will have only a very small effect on the accuracy of the neural network model. As the abnormal value has sparsity, most of values adjacent to the abnormal value are normal values, and experiments show that the shearing of the values adjacent to the abnormal value can only have a small influence on the accuracy of the neural network model.

As shown in fig. 2: in this embodiment, first, two adjacent positions in the tensor of the neural network model are paired, and there are three cases of numbers on the two positions:

(1) The values at both positions are normal values, e.g. (1.5,2.6) paired. For the case that both are normal values, pruning is not required, and only simple quantization is required. For normal values, this embodiment typically uses 4-bit integer (INT 4) and 8-bit integer (INT 8) quantization.

(2) One of the values is an outlier and the other is a normal value, e.g., (17.6,4.2) pairing. The present embodiment prunes the normal values in the pair, referred to as clipping values, to provide more coding space for the outliers therein.

(3) Both values are outlier pairs, e.g., (30.7, 20.1). For pairs with both outliers, the present embodiment will retain a larger outlier and prune the smaller outlier.

Therefore, in this embodiment, only normal value-normal value pairs and outlier-clipping value pairs remain in the tensor of the neural network model. The two pairing modes are then coded.

To distinguish from normal value-normal value pairs, a special identifier needs to be provided for outlier-clipping value pairs. This unique identifier cannot appear in the normal value-normal value, which means that a number needs to be excluded from the encoding of the normal value.

Specifically, in this embodiment, the configuring the outlier identifier for the pairing includes: one end of the integer range of values is configured as an outlier marker and the end is shifted out of the integer range of values.

In this embodiment, the abnormal value markers are referred to in table 1 below.

Table 1 outlier identifier design

As shown in table 1, it is assumed that the quantization of normal values is performed using signed int4 (4-bit integer). Original int4 may represent [ -8,7]Integers within the range of 1000 ₂ A value of-8 is indicated. First, 1000 will be ₂ As outlier identifier and remove 1000 from int4 ₂ So that the encoding range thereof will become [ -7,7]. Furthermore, the pair-wise encoding of the present invention can be extended to higher precision quantization (e.g., 8 bits). Likewise, an 8-bit normal value code also requires elimination of a number. For example, int8 may represent [ -128,127]The present invention uses 10000000 as the outlier identifier of int8, and narrows its range to [ -127,127]。

In this embodiment, in the pairing in which the outlier exists, the clipping value is replaced with the outlier marker. That is, for outlier-clipping value pairs, the clipping value therein is replaced with an identifier. As shown in FIG. 3, normal values are quantized normally, and for a clipped value would be replaced with a 4-bit binary 1000 ₂ And outliers will be quantized by outlier-specific data types using a special outlier quantization scheme.

Step S400, quantifying the pairing with the outlier.

In this embodiment, the quantifying the pairing with the outlier includes: converting a floating point data type into a floating point data type with an offset:

sign×(1<<mb+mantissa)<<(exponent+bias)；

Since outliers are typically very large in absolute value and exist in a wide range, the present embodiment uses floating point based data for quantization. In order to be compatible with the coding mode of the normal value, the present embodiment proposes a mode of converting the floating point representation into the fixed point representation. Meanwhile, in order to prevent the encoding range of the abnormal value from overlapping with the encoding range of the normal value and causing waste, the numerical representation capability of the data type can be improved to the greatest extent under the same bit length, and the embodiment provides a new data type called an adaptive offset floating point data type. The key idea is to add an appropriate offset to the exponent so that all encoded values can skip the interval where normal values are located, thereby providing a larger encoding range for outliers. That is, the present embodiment quantizes an outlier using an adaptive offset floating point data type suitable for quantization of the outlier, and can accurately quantize the outlier with a low number of bits.

In order to be compatible with the normal value coding mode and avoid the occurrence of decimal numbers, the floating point coding mode is firstly converted into the fixed point coding mode with an index. In addition, since fixed-point representations are more friendly to hardware implementation than floating-point representations, this may result in faster operation speeds and reduced hardware overhead. The original floating point data type with bias (bias) consists of sign, exponent and mantissa (mantissa) and can be expressed by the following formula:

sign×2 ^{exponent+bias} ×1.mantissa

wherein the bias (bias) is a constant. Provided that 0101 ₂ Is a 1-bit sign bit (0), 2-bit exponent bit (10), and 1-bit mantissa bit (1), and is biased to 0. Then it can be obtained that its sign is positive and its exponent is 10 ₂ ＝2 ₁₀ Its mantissa is 1.1 ₂ ＝1.5 ₁₀ Thus the final true value is 2 ²⁺⁰ ×1.5＝6。

Further, the present embodiment converts floating point data types to floating point data types with offsets by:

sign×(1<<mb+mantissa)<<(exponent+bias)；

Since this involves only shift operations and does not have decimal numbers, such a fixed point number representation is very friendly and efficient for hardware implementation. In this embodiment, an adaptive offset floating point data type of E2M1 is used, that is, a floating point data type having 1 bit sign, 2 bit exponent (exponents) and 1 bit mantissa (mantissa), and an adaptive offset is added thereto. All true values and their corresponding encodings with a bias of 0 are shown in table 2.

Table 2 coding and true values for an adaptively biased floating point data type biased at 0

Binary representation	(symbol)	Index number	Radix number	True value
					0000	1	0	0	0
0001	1	0	3	3<<0＝3
					001x	1	1	2,3	2<<1＝4,3<<1＝6
010x	1	2	2,3	2<<2＝8,3<<2＝12
					011x	1	3	2,3	2<<3＝16,3<<3＝24
1000 (identifier)	-1	0	0	-0
					1001	-1	0	3	-3<<0＝-3
101x	-1	1	2,3	-2<<1＝-4,-3<<1＝-6
					110x	-1	2	2,3	-2<<2＝-8,-3<<2＝-12
111x	-1	3	2,3	-2<<3＝-16,-3<<3＝-24

It is apparent that the encoding range of the floating point data type biased to 0 overlaps with the normal value. Therefore, this embodiment uses an adaptively biased floating point data type, in which case the bias is set to 2, then its non-zero value encoding range would be extended 4 times from {3, … } to {12, … } by an absolute value that is complementary to the normal value range of 4-bit int {1, … } so that all encoded values can be used for quantizing outliers. In this embodiment, for 8-bit quantization, an adaptive offset floating point data type of E4M3 is used, i.e., a floating point data type having 1-bit sign bit, 4-bit exponent bit, and 3-bit mantissa bit.

Furthermore, the present embodiment also excludes outlier identifiers from the encoding of this data type, otherwise outlier-clipping pairs will not be distinguishable. 1000 may be disabled directly for outliers to avoid collisions with outlier identifiers.

As can be seen from the above, the outlier-clipping pair coding algorithm of the present embodiment uses the input scale factor to identify the outlier and the normal value, uses the adaptive offset floating point data type to quantize the outlier, and uses the 4-bit integer to quantize the normal value. When an outlier appears in one pairing, it prunes the other value and embeds an outlier identifier, thus completing outlier-clipping pairing encoding.

The embodiment also provides an electronic device, including: a memory for storing a computer program; a processor for executing the computer program to implement the steps of the method for quantifying a language model as described above. Since the steps of the quantization method of the language model have been described in detail, they are not described in detail herein.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

The processor is (Central Processing Unit ). The memory is connected to the processor via the system bus and performs communication with each other, the memory is used for storing a computer program, and the processor is used for running the computer program so that the processor executes the quantization method of the language model. The memory may comprise random access memory (Random Access Memory, RAM) and may also comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

Furthermore, the present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the language model quantization method. The above detailed description of the method for quantizing the language model is omitted here.

In this embodiment, as shown in fig. 4, the processor may be a graphics processor; wherein, the element dot product unit in each tensor calculation core of the graphics processor is correspondingly embedded and configured with a decoder, an adder and a shifter.

The following describes how the method of the present embodiment is applied to the tensor computation core architecture of the GPU in fig. 4.

As shown in fig. 4, this embodiment uses a graphics (Turing) architecture of a GPU as a basis, which has 68 Streaming Multiprocessors (SMs) with 8 tensor computational cores (544 total) per SM. Each tensor computation core has 8 FEDP (four element dot product) units. Thus, there are 68×8×2×8×4= 34,816 16-bit floating-point multipliers. The Turing architecture natively supports hybrid precision computing: for example, RTX 2080Ti using the Turn-around architecture provides computing power for 16-bit floating point operations, 8-bit int, and 4-bit int of 107.6, 215.2, and 430.3TOPS, respectively. Thus, as shown in FIG. 4, the tensor computation core may support both 8-bit 8EDP (eight element dot product) and 4-bit 16EDP (16 element dot product). Where Int4 is the mantissa of the 4-bit integer, exp4 represents the 4-bit exponent, and the calculation is done by multiplication and right shift.

This embodiment can be easily embedded in a GPU employing SIMD architecture. Only a 4-bit outlier-clipping pair (OVP) decoder (right-most diagram of fig. 4) needs to be embedded for each 16EDP, and in order to support the new data types proposed in this embodiment, each 16EDP needs to be added with an adder and a shifter. Similarly, this embodiment also designs an 8-bit decoder for an 8EDP unit.

In this embodiment, as shown in fig. 5, the processor may also be a tensor processor; wherein a decoder is configured at the boundary of the systolic array of the tensor processor; the size of the ripple array is n multiplied by m, and the number of the decoders is n+m; n and m are the number of rows and columns of the systolic array, respectively.

The following describes how the method of the present embodiment is applied to the systolic array of the tensor processor in fig. 5.

Integration of this embodiment in a systolic array is shown in fig. 5. Systolic arrays use the same outlier-clipping pair (OVP) decoder design as the GPU. However, unlike a GPU, only the decoder needs to be placed at the boundary of the systolic array, which can save a significant portion of the decoder compared to a GPU. Assuming an array size of n×m, only n+m decoders are required, and n×m are not required. This is an advantage of systolic arrays over SIMD architecture of GPUs. Also as shown in the multiply-add unit of fig. 4, the new data type of this embodiment may also be supported by a systolic array Processing Element (PE) that adds only one extra adder and shifter, where Int4 is the mantissa of the 4-bit integer, exp4 represents the exponent of 4 bits, with the computation done by multiplication and right shifting. And only one extra adder needs to be added for every fourth PE to support higher precision quantization (e.g., int 8).

To support outlier-clipping pair decoding, the present embodiment designs a new decoder and it is easily embedded into existing accelerators. In this embodiment, the decoder includes: an abnormal value judging unit for judging whether an abnormal value exists in the pairing of the input data; a normal value decoder that outputs a normal value in the pairing to the original code; an outlier decoder decodes outliers in the pairing into exponent-integer pairing values.

FIG. 6 shows a 4-bit decoder that reads two 4-bit values (values) of 1 byte (8-bit) per cycle, which is the smallest addressable memory unit in many hardware architectures, and is just an outlier-clipping pair. If there is an outlier identifier 1000 in the pairing, the decoder will decode it to 0 and decode the outlier with the outlier decoder, which directly outputs the original encoding. The outlier decoder decodes the input values into Exp-Int pairs (exponent, mantissa pairs) to accommodate the multiply-add units in the GPU and TPU architectures.

Similarly, this embodiment also designs an 8-bit decoder, which only needs to expand the decoding bit width in fig. 5 to 8 bits.

In this embodiment, for floating point data types with offsets, the decoder decodes the exponent by an exponent versus offset relationship; the radix is decoded by a relation of the radix and the input data.

Specifically, as shown in fig. 7, for a 4-bit signed adaptive offset floating point data type, it is assumed that it is x= (b) ₂ b ₁ b ₀ ) ₂ Its exponent and radix can be decoded using the following formulas:

index = bias + (b ₂ b ₁ ) ₂

For example, when offset is 2, 0101 ₂ Will be decoded to 48 ₁₀ . Because its index is 2 ₁₀ +10 ₂ ＝4 ₁ A base number of 11 ₂ 03 ₁₀ . Thus, its value can be obtained as 3<<4=48. Fig. 7 shows a design of a 4-bit signed adaptive offset floating point data type decoder. The design of the 8-bit outlier-clipping pair decoder is similar, only the decoded bit width in fig. 7 needs to be extended to 8 bits.

After decoding the outliers and normal values, they are both converted into uniform index-integer pairs. As shown in fig. 5, to support the decoded exponent-integer pair calculation, a shifter and an adder are added to the multiply-add (multiply and accumulate) unit. For example, two exponent-integer (Exp-Int) pairs < a, b > and < c, d >, where a and b are exponents and d, b is an integer, then < a, b > represents:

<a,b>＝b<<a

thus, the result after multiplication can be (bxd) < (a+c) = < a+c, bxd >.

Next, description will be made as to how the present embodiment supports calculation of the mixing accuracy. For the tensor computation unit of the GPU, it natively supports operations of mixed precision. While for a systolic array of TPU, this embodiment can use 4 processing units (PEs) to natively support 8bit computation. For a value x of int8, it is first decomposed into upper 4 bits and lower 4 bits, as follows:

x＝(h _x ＜＜4)+l _x ＝<4,h _x >+<0,l _x >

the number x, y of two int8 can then be multiplied using the following method:

thus, 4 processing units (PEs) can be used to calculate the multiplication of two int8 type numbers.

Likewise, multiplication of 8-bit adaptive offset floating point data types may be supported in the same manner. For a number z of 8-bit adaptive offset floating point data type, it is first decoded into an exponent e _z And an integer i _z . For i _z It is also divided into i _z ＝(h _z ＜＜4)+l _z Then z=can be obtained<4+e _z ,h _z >+<e _z ,l _z >. Thus, the same approach can be used to calculate a multiplication of 8-bit adaptive offset floating point data types, which is only one exponent e more than int8 _z 。

For a 4-bit tensor computational core, the instruction employed by the Turn architecture is mma.s32.s4.s4.s32. The four operands are matrices D (int 32), a (int 4), B (int 4), and C (int 32), d=a×b+c. To support computation on the GPU, this embodiment designs a new instruction mmaovp:

wherein mmaovp represents an MMA (matrix multiply add) instruction using OVP encoding, wherein the first s32 represents the address operand of the calculation result in the original 32-bit integer format; the format of the two multipliers in the multiply add instruction is the 4-bit integer code int4 in the OVP code, which 4-bit integer code has been deleted from-8, indicating the range is [ -7, +7]; the second s32 represents that the addend in the multiply-add instruction is the original 32-bit integer. The last s4 represents the original 4-bit integer, which represents the offset value (bias) in the outlier floating-point encoding, which defaults to the E2M1 format.

Due to the memory alignment design of the data type, the present embodiment maintains the original programming interface of the GPU. The original int type mma instruction can be replaced by the instruction (such as mmaovp) based on the embodiment, so that the quantization frame supporting the embodiment can be easily constructed. Therefore, the frame of the present embodiment has overall and practical applicability, which is also the most significant advantage of the present embodiment.

Fig. 8 is a schematic diagram illustrating a method for quantifying a language model and an overall implementation process of an electronic device according to an embodiment of the present application. As shown in fig. 8, first, the optimal scaling factor, that is, the threshold value of the outlier and the normal value is obtained by the scaling factor selection algorithm for each layer of the neural network parameters. After the optimal scaling factor is obtained, taking the neural network parameter and the optimal scaling factor as the input of an outlier-clipping value pairing coding algorithm, obtaining a quantization result of pairing coding, and outputting the quantized quantization result.

In the scale factor selection algorithm, the 3 sigma value of the neural network parameter is calculated as an initial scale factor, and then a new scale factor to be determined is selected in a specific range by utilizing the scale factor search algorithm. The outlier-clipping pair codes of the neural network parameters are recalculated using the scale factor, and the quantization error encoded using the scale factor is estimated using the Mean Square Error (MSE) and the presently best result is updated. And continuously cycling the process to search until all the undetermined scale factors are explored, and outputting the scale factor with the smallest error, namely the optimal scale factor obtained by searching.

Outlier-clipping pair coding algorithms use the input scale factors to identify outliers and normal values, and use the adaptive offset floating point data type to quantize outliers and the 4-bit integer to quantize normal values. When an outlier appears in one pairing, it prunes the other value and embeds an outlier identifier, thus completing outlier-clipping pairing encoding.

In summary, the embodiment can implement rapid quantization on a large language model, accelerate the reasoning speed, reduce the cost required by operation, ensure the accuracy and performance of the model, achieve the approach to the original accuracy and the maximum reasoning speed improvement on the common language model, process the abnormal value in the neural network model with low hardware cost, implement high performance improvement, and push the limit of 4-bit quantization of the neural network to a new advanced level. Moreover, the embodiment can be effectively integrated into the existing hardware processor, such as a GPU tensor calculation core and a TPU pulse array, and provides efficient support for deploying a large-scale neural network model on various devices. Therefore, the embodiment effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims of this invention, which are within the skill of those skilled in the art, be included within the spirit and scope of this invention.

Claims

1. A method for quantifying a language model, comprising: comprising the following steps:

determining a quantized scale factor;

identifying normal and outliers based on the neural network parameters and the scale factors;

two adjacent positions in tensor of the neural network model are paired, when abnormal values appear in pairing, pruning is carried out on the other value in pairing, and an abnormal value identifier is configured for the pairing;

pairs with outliers are quantized.

2. The method for quantizing a language model according to claim 1, wherein: the determining the quantized scale factor includes:

taking the 3 sigma value of the neural network parameter as the quantized initial scale factor;

searching to obtain a plurality of scale factors to be determined based on the paired coding quantization errors with abnormal values;

the scale factor to be determined, which codes the smallest quantization error, is determined as the optimal scale factor as the scale factor identifying normal and abnormal values.

3. The method for quantizing a language model according to claim 1, wherein: the pairing each adjacent two positions in the tensor of the neural network model comprises:

the values of two adjacent positions are normal values, and integer quantization is adopted;

one value of two adjacent positions is an abnormal value, the other is a normal value, pruning is carried out on the normal value, and the pruning value is recorded as a clipping value;

the values of two adjacent positions are abnormal values, and pruning is carried out on the abnormal values with small values.

4. A method for quantizing a language model according to claim 3, wherein: said configuring an outlier identifier for the pairing comprises:

configuring one end of an integer range of values as an outlier marker and shifting the end out of the integer range of values;

in the pairing in which there is an outlier, the clipping value is replaced with the outlier marker.

5. The method for quantizing a language model according to claim 1 or 4, wherein: the quantifying the pairing in the presence of outliers includes:

converting a floating point data type into a floating point data type with an offset:

sign×(1＜＜mb+mantissa)＜＜(exponent+bias)；

6. An electronic device, characterized in that: comprising a memory for storing a computer program; a processor for executing the computer program to implement the steps of the method for quantifying a language model according to any one of claims 1 to 5.

7. The electronic device of claim 6, wherein: the processor is a graphics processor; wherein, the element dot product unit in each tensor calculation core of the graphics processor is correspondingly embedded and configured with a decoder, an adder and a shifter.

8. The electronic device of claim 6, wherein: the processor is a tensor processor; wherein a decoder is configured at the boundary of the systolic array of the tensor processor; the size of the ripple array is n multiplied by m, and the number of the decoders is n+m; n and m are the number of rows and columns of the systolic array, respectively.

9. The electronic device according to claim 6 or 7, characterized in that: the decoder includes:

an abnormal value judging unit for judging whether an abnormal value exists in the pairing of the input data;

a normal value decoder that outputs a normal value in the pairing to the original code;

an outlier decoder decodes outliers in the pairing into exponent-integer pairing values.

10. The electronic device of claim 9, wherein: for floating point data types with offsets, the decoder decodes the exponent by an exponent versus offset relationship; the radix is decoded by a relation of the radix and the input data.