CN111814448B

CN111814448B - Pre-training language model quantization method and device

Info

Publication number: CN111814448B
Application number: CN202010636126.3A
Authority: CN
Inventors: 俞凯; 赵梓涵; 刘韫聪; 陈露; 刘奇; 马娆
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2024-01-16
Anticipated expiration: 2040-07-03
Also published as: CN111814448A

Abstract

The invention discloses a pre-training language model quantization method and a device, wherein the pre-training language model quantization method comprises the following steps: performing first fine tuning on the pre-trained language model on a downstream task; clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model; and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining the quantized network. The scheme provided by the embodiment of the application shows that the influence of the improvement of the bottom quantization scheme on the quantization effect is greatly underestimated and ignored; meanwhile, the method also shows that the very good compression effect can be achieved by simple k-means quantization without any skill, and the k-means compression method has very large development space and application prospect.

Description

Pre-training language model quantization method and device

Technical Field

The invention belongs to the field of language model quantization, and particularly relates to a pre-training language model quantization method and device.

Background

In the prior art, some quantization methods related to a pre-training language model have been presented, including 8-bit specific precision quantization and mixed precision quantization based on a hessian matrix.

8-bit specific precision quantization: all layers of the model that need to be quantized are quantized to 8 bits and then trimmed.

Mixing precision quantification based on hessian matrix: the quantization accuracy of each layer is determined by the information of the hessian matrix of the parameters of the layer. The larger the hessian matrix is, the larger the feature value is, the higher the layer quantization precision is, and the lower the layer quantization precision is, otherwise, the higher the layer quantization precision is. Quantization is followed by fine tuning.

The bottom quantization scheme in both the above two methods is linear quantization. That is, each tensor to be quantized individually employs linear quantization: the maximum and minimum values of the parameters in the tensor are found first, and then the range is divided equally into several parts, if quantized into n bits, into 2 ⁿ Parts, i.e. 2 ⁿ Class. Taking the average value of all parameters belonging to each class as the central value of the class, each parameter is replaced by the central value of the class to which it belongs. This tensor is thus replaced by a tensor storing the central value of each class and a tensor storing the class to which each parameter belongs.

The inventors found in the process of implementing the present application that the existing solution has at least the following drawbacks:

the compression effect of linear quantization is not very good: the performance of the quantization model at low accuracy drops more, which makes the model incompressible to very low accuracy.

Linear quantization is not a good clustering method. The quantized vector does not represent the parameter distribution of the original vector well.

Disclosure of Invention

The embodiment of the invention provides a pre-training language model quantification method and device, which are used for at least solving one of the technical problems.

In a first aspect, an embodiment of the present invention provides a method for quantizing a pre-trained language model, including: performing first fine tuning on the pre-trained language model on a downstream task; clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model; and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining a quantized network.

In a second aspect, an embodiment of the present invention provides a pre-training language model quantization apparatus, including: a first fine tuning module configured to perform a first fine tuning of the pre-trained language model on a downstream task; the clustering compression module is configured to use k-means clustering to cluster the data in the weighting matrix of all embedded layers and all linear layers except the classification layer of the trimmed model, and set the category number to be 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model; and a second fine tuning module configured to perform a second fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining a quantized network.

In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the pre-trained language model quantization method of any of the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the pre-trained language model quantification method of any of the embodiments of the present invention.

The scheme provided by the method and the device of the application shows that the influence of the improvement of the bottom quantization scheme on the quantization effect is greatly underestimated and ignored; meanwhile, the method also shows that the very good compression effect can be achieved by simple k-means quantization without any skill, and the k-means compression method has very large development space and application prospect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for quantifying a pre-training language model according to an embodiment of the present invention;

FIG. 2 is a graph of an algorithm for k-means quantization of a pre-training language model quantization method according to an embodiment of the present invention;

FIG. 3 is a graph showing a comparison of average scores of 8 GLUE tasks that are quantized linearly and k-means on a BERT model according to an embodiment of a pre-training language model quantization method according to the present invention;

FIG. 4 is a diagram of comparing average scores of 8 GLUE tasks quantized by linear and k-means on an ALBERT model according to an embodiment of the pre-trained language model quantization scheme of the present invention;

FIG. 5 is a graph comparing average scores of 8 GLUE tasks with k-means quantized BERT and ALBERT models for a specific embodiment of a pre-trained language model quantization scheme according to an embodiment of the present invention;

FIG. 6 is a graph showing a comparison of performance of BERT and ALBERT models with k-means quantization, each value being indicative of a percentage of the average score of the quantization model compared to the score of the full-precision model, according to an embodiment of the pre-trained language model quantization scheme of the present invention;

FIG. 7 is a block diagram of a pre-training language model quantization apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flowchart of an embodiment of a method for pre-training language model quantization of the present application is shown, which may be suitable for performing allocation processing on a request of a user, and is not limited herein.

As shown in fig. 1, in step 101, a pre-trained language model is first fine-tuned on a downstream task;

in step 102, the data in the weighting matrix of all embedded layers and all linear layers except the classification layer of the trimmed model is clustered by using k-means clustering, and the number of categories is set to 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model;

in step 103, the quantized model is subjected to a second fine tuning on the downstream task under the condition of maintaining quantization, and finally a quantized network is obtained.

In this embodiment, by for each selected task, the following experiments will be performed in sequence: fine tuning pre-trained models of downstream tasks (e.g., BERT and ALBERT); quantifying the specific task model; the quantization model is fine-tuned. The performance of the resulting model is then tested on the validation set for each selected task.

To avoid the impact of other tricks, we apply two quantization schemes (linear and k-means) only following a fixed precision quantization strategy, without using any tricks. We quantify the ownership weights of the embedded layer and the fully connected layers (except the classification layer). For each weight vector, after quantization, it will be represented by the corresponding cluster index vector and mean vector, and each parameter of the weight vector will be replaced by the mean of the cluster to which it belongs.

After model quantization, we fine-tune the model on the corresponding downstream task while maintaining quantization. For forward traversal, we reconstruct each quantization layer from its clustered index vector and mean vector. For backward pass, we update the quantization parameters by training the mean vector while the rest of the parameters are updated normally. More specifically, the gradient of each parameter in the mean vector is calculated as the average of the gradients of the parameters belonging to the corresponding cluster. The mean vector is then updated by the same back propagation method.

In some alternative embodiments, the clustering data in the weight matrix of all embedded layers and all linear layers of the trimmed model except the classification layer using k-means clustering includes: partitioning the data into 2 using k-means++ initialization ^k Clusters (where k is as defined for n above) and is 2 ^k Cluster initialization 2 ^k The average value; classifying each data into a nearest cluster according to the relation between the data and each mean value; after each data is classified, updating the corresponding average value to the average value of all the data of the cluster; and repeatedly reclassifying each data and updating the mean value until convergence is met or a preset maximum iteration round number is reached.

In some alternative embodiments, the partitioning of the data into 2 using k-means++ initialization ^k Clusters and is said 2 ^k Cluster initialization 2 ^k The mean value comprises: selecting a random data from the data seed as a first mean; distributing the residual data as the possibility of the next average value according to the minimum distance from the existing average value, and selecting the next average value according to the possibility of the next average value; repetition likelihood calculation and meanSelecting until all 2 are generated ^k And (5) an average value.

Further optionally, the preset maximum number of iteration rounds is set to 3.

In other alternative embodiments, when the quantized network performs forward computation, the original weight matrix is restored through the stored class of each data and the average value of each class, that is, each data is replaced by the average value of the corresponding class; and updating network parameters, particularly a quantized weight matrix, by using a gradient descent method when the quantized network is calculated backwards, wherein the gradient of the element in the same category is averaged, and each average value is updated by using the gradient as the average value of the category.

Further optionally, the pre-trained language model is BERT (Bidirectional Encoder Representation from Transformers) or ALBERT.

The following description is given to better understand the aspects of the present application by describing some of the problems encountered by the inventor in carrying out the present invention and one specific embodiment of the finally-determined aspects.

In order to improve the compression rate of the pre-trained language model or improve the performance of the compressed model, most of the existing work is mainly realized by introducing other skills, such as variable precision compression, grouping compression and the like, and the method has limited improvement effect or simultaneously leads to the increase of operation time consumption by tens of times. The improvement that can be brought about by the improvement of the underlying quantization scheme is greatly underestimated, so that few attempts have been made in this respect.

By changing the quantization scheme of the bottom layer from linear clustering to k-means clustering, the grouping rationality is greatly improved, and the pre-training language model can be compressed to be less than 15% of the original size and still maintain more than 90% of the original model performance.

The method comprises the following specific steps:

1) Fine-tuning the pre-trained language model on a specific downstream task;

2) With k-means clustering, all other embeddings of the model except the classification layerClustering the data in the weight matrix of the layers and the linear layers, wherein the category number is 2 ⁿ (wherein n is the number of bits occupied by each data of the compressed target model), initializing by using a k-means++ initializing method, and setting the maximum iteration round number of the k-means method to be 3;

3) And fine-tuning the quantized model again on the corresponding downstream task under the condition of maintaining quantization, and finally obtaining a quantization network.

In addition, when the quantized network performs forward calculation, the original weight matrix is restored through the stored category of each data and the average value of each category, namely each data is replaced by the average value of the corresponding category; in the backward calculation, the gradient descent method is used for updating network parameters, particularly the quantized weight matrix, and the gradient of the element in the same class is averaged to update each average value as the gradient of the average value in the class.

The scheme shows that the influence of the improvement of the bottom quantization scheme on the quantization effect is greatly underestimated and ignored; meanwhile, the method also shows that the very good compression effect can be achieved by simple k-means quantization without any skill, and the k-means compression method has very large development space and application prospect.

The following describes the process of implementing the embodiments of the present application, and some experimental procedures and corresponding experimental data in the process, so that those skilled in the art can better understand the technical solutions of the present application.

Recently, pre-trained language models like BERT have shown excellent performance over a variety of natural language processing tasks. However, the application of these models is limited due to the large space they require. One widely studied and effective way to reduce the size of the network is quantization. However, most of the efforts focused on BERT quantization use a more elementary linear clustering method as the quantization scheme, and few efforts have been made to improve the quantization scheme. This greatly limits the performance of quantization. Here we achieved k-means quantization and compared its performance with linear quantization on a fixed precision quantization of BERT. By comparison, we verify that the improved boosting effect on performance of the underlying quantization scheme is greatly underestimated, and that k-means quantization has great potential for development. Furthermore, we also compared the performance of the two quantization schemes on the ALBERT model to explore the robustness differences for quantization between different pre-trained models.

Keyword: k-means quantization, linear quantization, pre-training language model, GLUE.

Introduction to 1

Pretrained self-attention mechanism-based models (transducers) have recently achieved optimal performance on various Natural Language Processing (NLP) tasks such as sequence tags and sentence classification. Among them, the BERT model based on the transducer architecture has attracted more attention due to its excellent performance and versatility. However, the memory and computational consumption of these models is prohibitive. Even relatively small versions of the BERT model (e.g., BERT-base model) contain over 1 hundred million parameters. The feature of over-parameterization makes deployment of BERT models on resource-constrained devices such as smartphones and robots challenging. Therefore, compressing these models is an important need in the industry.

One popular and efficient method for model compression is quantization. To reduce the size of the model, the quantization represents the parameters of the model with fewer bits than the original 32 bits. By using proper hardware, the memory occupation can be greatly reduced by quantization, and the calculation speed is increased. There is much work in the computer vision field focusing on quantization models, while much less work is done on NLP. The experimental work of the transducer quantification successfully quantifies the transducer model to 8 or 4 bits while maintaining comparable performance. However, to our knowledge, there are only two published works on BERT quantification. One of these documents applies 8-bit fixed-precision linear quantization to the BERT model and achieves a compression rate of 4x with little degradation in precision. Another document improves quantization performance by group-wise hybrid-precision linear quantization of Hessian matrices based on parameter tensors.

However, for the underlying quantization scheme, most of the above-mentioned convertors quantization works, especially BERT quantization works, use linear clustering, which is a main clustering method. Although it can be handled quickly and easily, the quantization result does not represent the original data distribution well. While some BERT quantization works achieve higher compression rates without improving the quantization scheme, the methods of packet quantization they develop are quite time consuming and add significant delay. While it is believed that replacing linear clustering with a better clustering approach may improve the performance of the quantization model, the effect of the quantization scheme improvement is underestimated. Therefore, here we explore the effect of simply improving the quantization scheme from linear clustering to k-means clustering and compare the performance of both schemes. Furthermore, to see the impact on other pre-trained language models, we also compared the two quantization schemes described above for the improved model of the almbert model, the BERT.

In general, we applied k-means and linear quantization on BERT and ALBERT and tested their performance on the glut task set. In this way we verify that a simple improvement of the quantization scheme can lead to a great improvement of performance, and that simple k-means clustering has great potential as BERT quantization scheme. Furthermore, we have shown that the number of k-means iteration rounds plays an important role in k-means quantization. By further comparison, we found that ALBERT is less robust in quantization than BERT, because parameter sharing reduces redundancy of parameters.

2 background: BERT and ALBERT

In this section, we briefly introduce the architecture of the BERT and ALBERT models and point out the version of the model we use in the experiments.

2.1 BERT

The BERT model is a special transform-based pre-training network. They consist mainly of an embedded layer, an encoder block and an output layer. There is no decoder block in the BERT model. Each encoder block contains one self-attention layer (including three parallel linear layers corresponding to queries, keys and values) and 3 feed-forward layers (each containing one linear layer).

For each self-attention layer, BERT further improves its performance using multi-head technology Can be used. For each self-attention header, there are 3 weight matrices W _q ，W _k And W is _v Wherein W is _q ，W _k ，W _v (h is the number of heads in each self-attention layer). Order theRepresenting the input of the corresponding self-attention layer. Thus, the output of the self-attention head is calculated as follows:

then, for each self-attention layer, the outputs of all its self-attention heads are sequentially connected to generate the outputs of the respective layers.

Specifically, in our work we performed the following experiment using the BERT-base-uncased version of the BERT model, which has 12 encoder blocks, with 12 self-attention heads per self-attention layer.

2.2 ALBERT

Compared to BERT, ALBERT makes three major improvements. First, the ALBERT model decomposes the parameters of the embedded layer into the product of two smaller matrices. Second, they employ cross-layer parameter sharing to improve parameter efficiency. These two improvements can significantly reduce the total number of parameters and make the model more efficient. In addition, parameter sharing may also stabilize network parameters. Third, they replace the next sentence prediction (NSP, next-sentence prediction) penalty with a sentence-sequential prediction (SOP) penalty at the time of pre-training. This concentrates the model on modeling inter-sentence consistency, rather than topic prediction, and improves the performance of multi-sentence coding tasks.

Specifically, here we use the ALBERT-base-v2 version of the ALBERT model, which also has 12 encoder blocks (all parameters are shared between layers), with 12 self-attention headers per self-attention layer.

3 theory of methods

In this section, we first introduce the quantization process in the experiment, and then explain the two quantization schemes we use in detail.

3.1 Summary of the invention

To compare the linear and k-means quantization schemes on a transducer-based pre-trained model, we tested the performance of the quantization model on different downstream tasks. Specifically, for each selected task, the following experiments will be performed in sequence: fine tuning the pre-training models (BERT and ALBERT) on downstream tasks; quantifying the specific task model; the quantization model is fine-tuned. The performance of the resulting model is then tested on the validation set for each selected task.

3.2 Linear quantization

Let us assume that we need to quantize the vector v to k bits (k-bit quantization). We first search for its minimum v _min And maximum v _max . Then the range v _min ，v _max ]Divided into 2 ^k The clusters are:

define the function Q≡as

Its value is between 0 and 2 k-1. Thus each parameter v _i All belong to Q (v) _i ) And a cluster. v _i Will be Q (v) _i ) The mean of the individual clusters replaces, i.e. the mean of all the parameters belonging to it. Thus, the quantization function is:

wherein 1{ state } is equal to 1 when the sentence in the bracket is true, and is otherwise 0.

3.3 K-means quantization

Let us assume that we need to quantize the vector v to k bits (k-bit quantization). For k-means quantization, we partition vector v into 2 using k-means clustering and k-means++ initialization ^k Clusters.

We first use the k-means++ initialization method for each cluster (c_1, c_2,., c_2 ^k ) Initialize the mean (μ_1, μ_2,) μ_2 ^k ). Then, each parameter v _i Classified into its nearest cluster. After classifying all parameters in v, the mean will be updated to the mean of all parameters belonging to them, respectively. The reclassifying of the parameters and updating of the mean value are then repeated until the convergence condition is met or the maximum number of iteration rounds is reached. In addition, the process of the k-means++ initialization method is as follows: firstly, selecting a random parameter from a vector v as a first average value; then, other parameters are distributed as the possibility of the next average value according to the minimum distance from all the existing average values, and the next average value is selected according to the possibility; finally, the likelihood calculation and mean selection are repeated until all 2's are generated ^k And (5) a centroid. For specific algorithm please refer to fig. 2.

To reduce the amount due toThe improvement of the transformation scheme and the resulting efficiency reduction, we set the maximum iteration round of k-means clustering to 3. After k-means clustering is completed, we use the obtained cluster number vector as a cluster index vector, and the mean value of each cluster as a corresponding mean value vector. Each parameter v _i Will be replaced by the mean of the cluster to which it belongs.

4 experiment

In this section, we first introduce the data set we use in the experiment, then explain the details of our experiments performed on BERT and ALBERT, and finally show our experimental results and the corresponding discussion.

4.1 Data set

We tested the performance of the quantization model on a generic language understanding assessment (glut) task set. The method comprises NLU tasks such as question answering, emotion analysis, text implication and the like. Specifically, we used 8 tasks (QNLI, coLA, RTE, SST-2, MRPC, STS-B, MNLI and QQP) to test the performance of different quantization schemes. The evaluation index of each task is as follows: coLA is Matthews correlation coefficient (mcc); QNLI, RTE, SST2 and MNLI are correct rates (acc); MRPC and QQP are the correctness (acc) and F ₁ Scoring; STS-B is the Pearson and Spearman correlation coefficient (corr). We follow the default partitioning of the dataset. The data set may be downloaded here: https:// fluenchmark.

4.2 Experimental details

Prior to quantization, we fine-tuned the BERT-base-uncased version of the BERT model over 8 tasks using an Adam optimizer (initial learning rate of 5e-5, linear update). For the ALBERT model, we first fine-tune the ALBERT-base-v2 model on QNLI, coLA, SST-2, MNLI and QQP. Trimming is then performed on RTE, MRPC and STSB based on the trim results of MNLI. We use a linearly updated Adam optimizer to fine tune ALBERT and search {1e-5, 2e-5, 3e-5, 4e-5, 5e-5} for the initial learning rate of each task.

Table 1. Fixed precision linear quantization results for BERT on the glut task set.

Table 2. Fixed precision k-means quantization results for BERT on the glut task set.

Table 3. Fixed precision linear quantization results for ALBERT on the glut task set.

Table 4. Fixed precision k-means quantization results for ALBERT on the glut task set.

Fig. 3 shows a comparison of the average scores of 8 glut tasks with linear and k-means quantization on the BERT model.

Fig. 4 shows the average score of 8 glut tasks comparing linear and k-means quantization on ALBERT model.

After quantization, we will further fine tune the quantization model of the corresponding task. In particular, the learning rate of the quantized layer is multiplied by 10 times (e.g., 5e-4 for all quantized BERT models), while the learning rate of the other layers remains unchanged.

4.3 Experimental results and discussion

We focus mainly on 1-5 bit fixed precision quantization. Tables 1 and 2 show the results of the linear and k-means quantization of BERT, respectively, and fig. 3 shows a further comparison between the average scores of the two experiments. Similarly, the results and comparisons of ALBERT are shown in table 3, table 4 and fig. 4, respectively.

4.3.1 BERT

The quantization scheme improves the resulting boost. As shown in table 1, table 2 and fig. 3, although the model performs poorly with a lower number of bits regardless of the quantization scheme employed, the model quantized with k-means performed significantly better than the model quantized with linear quantization over all 8 tasks and their averages when the same number of bits was used. From the average performance of 8 tasks, we can achieve a performance reduction of 1-5 bits quantization from (38.8%, 34.7%,27.6%,17.1%, 4.8%) to (28.6%, 3.94%,0.9%,0.3%, -0.2%) compared to full precision, respectively, by merely improving the quantization scheme from linear average to k-average. The results show that a significant performance improvement can be achieved by merely improving the quantization scheme, which indicates that the room for improvement of the quantization scheme is greatly underestimated. To further illustrate this, we repeated several experiments using a block linear quantization scheme, which is an improvement based on linear quantization and has higher performance than simple linear quantization. The results are shown in table 5. Compared to the performance of packet linear quantization, simple k-means quantization can achieve higher performance or comparable performance while saving a lot of time.

Potential for k-means quantization. As shown in table 2, the model can be compressed well simply using k-means quantization with a fixed precision strategy, and the quantized model can still be compressed well even at some particularly low number of quantization bits. For example, on task RTE, using a model quantized to 3 bits with k-means quantization would only result in a 2.16% performance degradation. For most tasks, including QNLI, SST-2, MRPC, STS-B, MNLI and QQP, the performance of the quantization model only drops significantly when compressed to 1 bit. Notably, these results are achieved by simple k-means quantization with a maximum iteration round of only 3, without any other skills, indicating that k-means quantization has great potential for development.

Table 5. Comparison between k-means quantization and group linear quantization on bert. The rightmost column is the average time-consuming of k-means quantization compared to the linear quantization of packets on RTE and MRPC. ( In the grouping quantization, each matrix is divided into different groups, and each group is quantized separately. For forward traversal, the model needs to reconstruct each quantization group for each layer separately, rather than directly reconstructing the entire weight matrix for each quantization layer. This explains why packet quantization is very time consuming. Specifically, in our group quantization experiment, we divide each matrix into 128 groups. )

4.3.2 ALBERT

In general, two main conclusions drawn from the BERT experiment remain true. As shown in table 3, table 4 and fig. 4, we can also see the great improvement brought by the quantization scheme improvement and the great potential for k-means quantization. However, some anomalous results are worth discussing.

The effect of k-means iteration round number. The first set of outlier results is from the 1-bit quantization of QNLI, MRPC and STS-B. Although the results of k-means quantization are generally better than linear quantization, these three sets of results do not follow this rule. We believe that this is because the distribution of parameters is so complex that k-means do not give good clustering results with only 3 iterations. To verify the theory and further investigate the effect of the number of iteration rounds, we repeated experiments on these anomalous results, expanding the maximum number of iteration rounds to 5, 10 and 20. The corresponding results are shown in table 6. The more the number of iteration rounds, the better the effect of k-means quantization and eventually the more than the result of linear quantization. However, the problem of overfitting still exists, and as the maximum number of iteration rounds increases from 10 to 20, the quantization performance of both QNLI and STS-B drops significantly. Therefore, in k-means quantization, the maximum number of iterative rounds of k-means is also an important hyper-parameter that needs to be carefully searched.

Table 6. 1-bit quantization performance at different k-means maximum iteration rounds on albert.

Fig. 5 shows a comparison of the average scores of 8 glut tasks of the k-means quantized BERT and ALBERT models.

FIG. 6 shows a performance comparison of the k-means quantized BERT and ALBERT models. Each value refers to the percentage of the average score of the quantization model compared to the score of the full precision model.

Copa 0 and MRPC 68.4. Another set of outlier results are linear quantifications from CoLA and MRPC, which are two classification tasks. We find that after fine tuning, the quantization model always outputs a "1".0 and 68.4 are determined only by the data distribution on the validation set. In other words, after the model is quantized to 1-5 bits by linear quantization, the model is almost dead, and it is difficult to train on both tasks. Furthermore, we further performed experiments on two tasks to quantize the model to a higher number of bits, and found that the performance of the quantized model was no longer 0 and 68.4 starting from quantization to 6 bits.

Comparison between BERT and ALBERT. Furthermore, we compared the performance of k-means quantization of BERT and ALBERT, and the results are shown in fig. 5 and 6. The performance of ALBERT has been reduced to 93.4% and 72.5% after k-means 4-bit and 3-bit quantization, respectively, compared to 96.1% where BERTBERT retains its original performance after k-means 2-bit quantization. Thus, ALBERT is less robust with respect to quantization (in our work, robustness to quantization means the ability to quantize to a lower number of bits while maintaining high performance). Considering that the main improvement of ALBERT of BERT is parameter sharing, while quantization can also be regarded as intra-layer parameter sharing, we speculate that parameter sharing and quantization have similar effects, which means that redundant information removed by parameter sharing and quantization partially overlaps. It is considered that after parameter sharing, ALBERT removes a lot of redundant information (total number of parameters is reduced from 108M to 12M) compared to BERT, and thus further application of quantization on ALBERT will easily damage useful information, thus resulting in poor robustness of ALBERT to quantization. From another point of view, however, parameter sharing has greatly reduced the number of parameters and thus can also be considered as a model compression method. Furthermore, considering that the performance of full-precision ALBERT is better than that of 4-bit and 3-bit BERT models that occupy similar memory in GPUs, parameter sharing can even achieve better compression performance than quantization without any skill. However, as a compression method, parameter sharing has a non-negligible disadvantage: it can only reduce memory consumption, while most other compression methods can reduce both memory consumption and computational consumption (i.e., real time overhead).

Conclusion 5

Here we compared k-means quantization and linear quantization on BERT and ALBERT models and reached three main conclusions. First, we find that the performance of the model quantized with k-means is significantly better than the model quantized using linearity. Only the quantization scheme of the bottom layer is improved, so that huge performance improvement can be realized. Second, using k-means quantization can compress the model to a relatively low number of bits and maintain high performance even with a simple fixed-precision compression strategy and without any other skills. This suggests that k-means quantization has great potential for development. Third, the number of iterative rounds of k-means plays an important role in the performance of the quantization model and should be carefully determined. Furthermore, by comparing the k-means quantization results of BERT and ALBERT, we find that ALBERT is less robust to quantization than BERT. This suggests that parameter sharing and quantization have some similar effect. Thus, further application of quantization on models where extensive parameter sharing is applied will be more likely to damage useful information, resulting in significant performance degradation.

Referring to fig. 7, a block diagram of a pre-training language model quantization apparatus according to an embodiment of the invention is shown.

As shown in fig. 7, the pre-training language model quantization apparatus 700 includes a first fine tuning module 710, a cluster compression module 720, and a second fine tuning module 730.

Wherein the first fine tuning module 710 is configured to perform a first fine tuning of the pre-trained language model on a downstream task; a cluster compression module 720 configured to cluster the data in the weighting matrix of all the embedded layers and all the linear layers except the classification layer of the trimmed model by using k-means clustering, and set the number of categories to 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model; and a second fine tuning module 730 configured to perform a second fine tuning of the quantized model on the downstream task while maintaining quantization,and finally obtaining the quantized network.

In some alternative embodiments, the cluster compression module 720 is further configured to: partitioning the data into 2 using k-means++ initialization ^k Clusters and is said 2 ^k Cluster initialization 2 ^k The average value; classifying each data into a nearest cluster according to the relation between the data and each mean value; after each data is classified, updating the corresponding average value to the average value of all the data of the cluster; and repeatedly reclassifying each data and updating the centroid until convergence is met or a preset iteration number is reached.

It should be understood that the modules depicted in fig. 7 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are equally applicable to the modules in fig. 7, and are not described here again.

It should be noted that the modules in the embodiments of the present application are not limited to the solution of the present application, for example, the receiving module may be described as a module that receives a speech recognition request. In addition, the related functional modules may be implemented by a hardware processor, for example, the receiving module may also be implemented by a processor, which is not described herein.

In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium having stored thereon computer-executable instructions for performing the pre-trained language model quantization method of any of the method embodiments described above;

as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

performing first fine tuning on the pre-trained language model on a downstream task;

clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 ⁿ Wherein n is the occupied area of each data of the compressed target modelA number of bits;

and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining the quantized network.

The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from the use of the pre-trained language model quantization means, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes a memory remotely located with respect to the processor, the remote memory being connectable to the pre-trained language model quantification apparatus through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the pre-trained language model quantification methods described above.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 8, where the device includes: one or more processors 810, and a memory 820, one processor 810 being illustrated in fig. 8. The apparatus for pre-training the language model quantization method may further include: an input device 830 and an output device 840. Processor 810, memory 820, input device 830, and output device 840 may be connected by a bus or other means, for example in fig. 8. Memory 820 is the non-volatile computer-readable storage medium described above. The processor 810 performs various functional applications of the server and data processing, i.e., implements the pre-trained language model quantization method of the method embodiment described above, by running non-volatile software programs, instructions, and modules stored in the memory 820. The input device 830 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the pre-trained language model quantization device. The output device 840 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.

As an embodiment, the electronic device is applied to a pre-training language model quantization apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:

clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model;

The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.

(3) Portable entertainment device: such devices may display and play multimedia content. Such devices include audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture in that the server is provided with high-reliability services, and therefore, the server has high requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like.

(5) Other electronic devices with data interaction function.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of pre-training language model improvement, wherein the pre-training language model is BERT or ALBERT, the method comprising:

2. The method of claim 1, wherein clustering data in the weight matrix of all embedded layers and all linear layers of the trimmed model except the classification layer using k-means clustering comprises:

partitioning the data into 2 using k-means++ initialization ^k Clusters and is said 2 ^k Cluster initialization 2 ^k The average value;

classifying each data into a nearest cluster according to the relation between the data and each mean value;

after each data is classified, updating the corresponding average value to the average value of all the data of the cluster;

and repeatedly reclassifying each data and updating the mean value until convergence is met or a preset maximum iteration round number is reached.

3. The method of claim 2, wherein the partitioning the data into 2 with k-means++ initialization ^k Clusters and is said 2 ^k Cluster initialization 2 ^k The mean value comprises:

selecting a random data from the data seed as a first mean;

distributing the residual data as the possibility of the next average value according to the minimum distance from the existing average value, and selecting the next average value according to the possibility of the next average value;

repeating the probability calculation and the mean selection until the generation All 2 ^k And (5) an average value.

4. The method of claim 2, wherein the preset maximum number of iteration rounds is 3.

5. The method according to claim 1, wherein the quantized network restores the original weight matrix by the class of each data and the average value of each class, that is, each data is replaced by the average value of the corresponding class;

when the quantized network is calculated in the backward direction, a gradient descent method is used for updating network parameters, particularly a quantized weight matrix, gradients of elements in the same category are averaged, and each average value is updated by using the gradients as the average value of the category.

6. A pre-training language model improvement apparatus, wherein the pre-training language model is BERT or ALBERT, the apparatus comprising:

a first fine tuning module configured to perform a first fine tuning of the pre-trained language model on a downstream task;

the clustering compression module is configured to use k-means clustering to cluster the data in the weighting matrix of all embedded layers and all linear layers except the classification layer of the trimmed model, and set the category number to be 2 ⁿ Wherein n is the bit number occupied by each data of the compressed target model;

And the second fine tuning module is configured to perform second fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtain a quantized network.

7. The apparatus of claim 1, wherein the cluster compression module is further configured to:

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.