CN111814448B - Pre-training language model quantization method and device - Google Patents
Pre-training language model quantization method and device Download PDFInfo
- Publication number
- CN111814448B CN111814448B CN202010636126.3A CN202010636126A CN111814448B CN 111814448 B CN111814448 B CN 111814448B CN 202010636126 A CN202010636126 A CN 202010636126A CN 111814448 B CN111814448 B CN 111814448B
- Authority
- CN
- China
- Prior art keywords
- quantization
- data
- model
- average value
- quantized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013139 quantization Methods 0.000 title claims abstract description 202
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000012549 training Methods 0.000 title claims abstract description 28
- 230000006835 compression Effects 0.000 claims abstract description 24
- 238000007906 compression Methods 0.000 claims abstract description 24
- 230000006872 improvement Effects 0.000 claims abstract description 21
- 238000003064 k means clustering Methods 0.000 claims abstract description 18
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000000638 solvent extraction Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 17
- 238000011161 development Methods 0.000 abstract description 6
- 238000002474 experimental method Methods 0.000 description 15
- 238000011002 quantification Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 239000003292 glue Substances 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000015556 catabolic process Effects 0.000 description 3
- 238000006731 degradation reaction Methods 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 230000002547 anomalous effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000011423 initialization method Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 241000761389 Copa Species 0.000 description 1
- 102100030851 Cortistatin Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a pre-training language model quantization method and a device, wherein the pre-training language model quantization method comprises the following steps: performing first fine tuning on the pre-trained language model on a downstream task; clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 n Wherein n is the bit number occupied by each data of the compressed target model; and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining the quantized network. The scheme provided by the embodiment of the application shows that the influence of the improvement of the bottom quantization scheme on the quantization effect is greatly underestimated and ignored; meanwhile, the method also shows that the very good compression effect can be achieved by simple k-means quantization without any skill, and the k-means compression method has very large development space and application prospect.
Description
Technical Field
The invention belongs to the field of language model quantization, and particularly relates to a pre-training language model quantization method and device.
Background
In the prior art, some quantization methods related to a pre-training language model have been presented, including 8-bit specific precision quantization and mixed precision quantization based on a hessian matrix.
8-bit specific precision quantization: all layers of the model that need to be quantized are quantized to 8 bits and then trimmed.
Mixing precision quantification based on hessian matrix: the quantization accuracy of each layer is determined by the information of the hessian matrix of the parameters of the layer. The larger the hessian matrix is, the larger the feature value is, the higher the layer quantization precision is, and the lower the layer quantization precision is, otherwise, the higher the layer quantization precision is. Quantization is followed by fine tuning.
The bottom quantization scheme in both the above two methods is linear quantization. That is, each tensor to be quantized individually employs linear quantization: the maximum and minimum values of the parameters in the tensor are found first, and then the range is divided equally into several parts, if quantized into n bits, into 2 n Parts, i.e. 2 n Class. Taking the average value of all parameters belonging to each class as the central value of the class, each parameter is replaced by the central value of the class to which it belongs. This tensor is thus replaced by a tensor storing the central value of each class and a tensor storing the class to which each parameter belongs.
The inventors found in the process of implementing the present application that the existing solution has at least the following drawbacks:
the compression effect of linear quantization is not very good: the performance of the quantization model at low accuracy drops more, which makes the model incompressible to very low accuracy.
Linear quantization is not a good clustering method. The quantized vector does not represent the parameter distribution of the original vector well.
Disclosure of Invention
The embodiment of the invention provides a pre-training language model quantification method and device, which are used for at least solving one of the technical problems.
In a first aspect, an embodiment of the present invention provides a method for quantizing a pre-trained language model, including: performing first fine tuning on the pre-trained language model on a downstream task; clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 n Wherein n is the bit number occupied by each data of the compressed target model; and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining a quantized network.
In a second aspect, an embodiment of the present invention provides a pre-training language model quantization apparatus, including: a first fine tuning module configured to perform a first fine tuning of the pre-trained language model on a downstream task; the clustering compression module is configured to use k-means clustering to cluster the data in the weighting matrix of all embedded layers and all linear layers except the classification layer of the trimmed model, and set the category number to be 2 n Wherein n is the bit number occupied by each data of the compressed target model; and a second fine tuning module configured to perform a second fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining a quantized network.
In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the pre-trained language model quantization method of any of the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the pre-trained language model quantification method of any of the embodiments of the present invention.
The scheme provided by the method and the device of the application shows that the influence of the improvement of the bottom quantization scheme on the quantization effect is greatly underestimated and ignored; meanwhile, the method also shows that the very good compression effect can be achieved by simple k-means quantization without any skill, and the k-means compression method has very large development space and application prospect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for quantifying a pre-training language model according to an embodiment of the present invention;
FIG. 2 is a graph of an algorithm for k-means quantization of a pre-training language model quantization method according to an embodiment of the present invention;
FIG. 3 is a graph showing a comparison of average scores of 8 GLUE tasks that are quantized linearly and k-means on a BERT model according to an embodiment of a pre-training language model quantization method according to the present invention;
FIG. 4 is a diagram of comparing average scores of 8 GLUE tasks quantized by linear and k-means on an ALBERT model according to an embodiment of the pre-trained language model quantization scheme of the present invention;
FIG. 5 is a graph comparing average scores of 8 GLUE tasks with k-means quantized BERT and ALBERT models for a specific embodiment of a pre-trained language model quantization scheme according to an embodiment of the present invention;
FIG. 6 is a graph showing a comparison of performance of BERT and ALBERT models with k-means quantization, each value being indicative of a percentage of the average score of the quantization model compared to the score of the full-precision model, according to an embodiment of the pre-trained language model quantization scheme of the present invention;
FIG. 7 is a block diagram of a pre-training language model quantization apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of an embodiment of a method for pre-training language model quantization of the present application is shown, which may be suitable for performing allocation processing on a request of a user, and is not limited herein.
As shown in fig. 1, in step 101, a pre-trained language model is first fine-tuned on a downstream task;
in step 102, the data in the weighting matrix of all embedded layers and all linear layers except the classification layer of the trimmed model is clustered by using k-means clustering, and the number of categories is set to 2 n Wherein n is the bit number occupied by each data of the compressed target model;
in step 103, the quantized model is subjected to a second fine tuning on the downstream task under the condition of maintaining quantization, and finally a quantized network is obtained.
In this embodiment, by for each selected task, the following experiments will be performed in sequence: fine tuning pre-trained models of downstream tasks (e.g., BERT and ALBERT); quantifying the specific task model; the quantization model is fine-tuned. The performance of the resulting model is then tested on the validation set for each selected task.
To avoid the impact of other tricks, we apply two quantization schemes (linear and k-means) only following a fixed precision quantization strategy, without using any tricks. We quantify the ownership weights of the embedded layer and the fully connected layers (except the classification layer). For each weight vector, after quantization, it will be represented by the corresponding cluster index vector and mean vector, and each parameter of the weight vector will be replaced by the mean of the cluster to which it belongs.
After model quantization, we fine-tune the model on the corresponding downstream task while maintaining quantization. For forward traversal, we reconstruct each quantization layer from its clustered index vector and mean vector. For backward pass, we update the quantization parameters by training the mean vector while the rest of the parameters are updated normally. More specifically, the gradient of each parameter in the mean vector is calculated as the average of the gradients of the parameters belonging to the corresponding cluster. The mean vector is then updated by the same back propagation method.
In some alternative embodiments, the clustering data in the weight matrix of all embedded layers and all linear layers of the trimmed model except the classification layer using k-means clustering includes: partitioning the data into 2 using k-means++ initialization k Clusters (where k is as defined for n above) and is 2 k Cluster initialization 2 k The average value; classifying each data into a nearest cluster according to the relation between the data and each mean value; after each data is classified, updating the corresponding average value to the average value of all the data of the cluster; and repeatedly reclassifying each data and updating the mean value until convergence is met or a preset maximum iteration round number is reached.
In some alternative embodiments, the partitioning of the data into 2 using k-means++ initialization k Clusters and is said 2 k Cluster initialization 2 k The mean value comprises: selecting a random data from the data seed as a first mean; distributing the residual data as the possibility of the next average value according to the minimum distance from the existing average value, and selecting the next average value according to the possibility of the next average value; repetition likelihood calculation and meanSelecting until all 2 are generated k And (5) an average value.
Further optionally, the preset maximum number of iteration rounds is set to 3.
In other alternative embodiments, when the quantized network performs forward computation, the original weight matrix is restored through the stored class of each data and the average value of each class, that is, each data is replaced by the average value of the corresponding class; and updating network parameters, particularly a quantized weight matrix, by using a gradient descent method when the quantized network is calculated backwards, wherein the gradient of the element in the same category is averaged, and each average value is updated by using the gradient as the average value of the category.
Further optionally, the pre-trained language model is BERT (Bidirectional Encoder Representation from Transformers) or ALBERT.
The following description is given to better understand the aspects of the present application by describing some of the problems encountered by the inventor in carrying out the present invention and one specific embodiment of the finally-determined aspects.
In order to improve the compression rate of the pre-trained language model or improve the performance of the compressed model, most of the existing work is mainly realized by introducing other skills, such as variable precision compression, grouping compression and the like, and the method has limited improvement effect or simultaneously leads to the increase of operation time consumption by tens of times. The improvement that can be brought about by the improvement of the underlying quantization scheme is greatly underestimated, so that few attempts have been made in this respect.
By changing the quantization scheme of the bottom layer from linear clustering to k-means clustering, the grouping rationality is greatly improved, and the pre-training language model can be compressed to be less than 15% of the original size and still maintain more than 90% of the original model performance.
The method comprises the following specific steps:
1) Fine-tuning the pre-trained language model on a specific downstream task;
2) With k-means clustering, all other embeddings of the model except the classification layerClustering the data in the weight matrix of the layers and the linear layers, wherein the category number is 2 n (wherein n is the number of bits occupied by each data of the compressed target model), initializing by using a k-means++ initializing method, and setting the maximum iteration round number of the k-means method to be 3;
3) And fine-tuning the quantized model again on the corresponding downstream task under the condition of maintaining quantization, and finally obtaining a quantization network.
In addition, when the quantized network performs forward calculation, the original weight matrix is restored through the stored category of each data and the average value of each category, namely each data is replaced by the average value of the corresponding category; in the backward calculation, the gradient descent method is used for updating network parameters, particularly the quantized weight matrix, and the gradient of the element in the same class is averaged to update each average value as the gradient of the average value in the class.
The scheme shows that the influence of the improvement of the bottom quantization scheme on the quantization effect is greatly underestimated and ignored; meanwhile, the method also shows that the very good compression effect can be achieved by simple k-means quantization without any skill, and the k-means compression method has very large development space and application prospect.
The following describes the process of implementing the embodiments of the present application, and some experimental procedures and corresponding experimental data in the process, so that those skilled in the art can better understand the technical solutions of the present application.
Recently, pre-trained language models like BERT have shown excellent performance over a variety of natural language processing tasks. However, the application of these models is limited due to the large space they require. One widely studied and effective way to reduce the size of the network is quantization. However, most of the efforts focused on BERT quantization use a more elementary linear clustering method as the quantization scheme, and few efforts have been made to improve the quantization scheme. This greatly limits the performance of quantization. Here we achieved k-means quantization and compared its performance with linear quantization on a fixed precision quantization of BERT. By comparison, we verify that the improved boosting effect on performance of the underlying quantization scheme is greatly underestimated, and that k-means quantization has great potential for development. Furthermore, we also compared the performance of the two quantization schemes on the ALBERT model to explore the robustness differences for quantization between different pre-trained models.
Keyword: k-means quantization, linear quantization, pre-training language model, GLUE.
Introduction to 1
Pretrained self-attention mechanism-based models (transducers) have recently achieved optimal performance on various Natural Language Processing (NLP) tasks such as sequence tags and sentence classification. Among them, the BERT model based on the transducer architecture has attracted more attention due to its excellent performance and versatility. However, the memory and computational consumption of these models is prohibitive. Even relatively small versions of the BERT model (e.g., BERT-base model) contain over 1 hundred million parameters. The feature of over-parameterization makes deployment of BERT models on resource-constrained devices such as smartphones and robots challenging. Therefore, compressing these models is an important need in the industry.
One popular and efficient method for model compression is quantization. To reduce the size of the model, the quantization represents the parameters of the model with fewer bits than the original 32 bits. By using proper hardware, the memory occupation can be greatly reduced by quantization, and the calculation speed is increased. There is much work in the computer vision field focusing on quantization models, while much less work is done on NLP. The experimental work of the transducer quantification successfully quantifies the transducer model to 8 or 4 bits while maintaining comparable performance. However, to our knowledge, there are only two published works on BERT quantification. One of these documents applies 8-bit fixed-precision linear quantization to the BERT model and achieves a compression rate of 4x with little degradation in precision. Another document improves quantization performance by group-wise hybrid-precision linear quantization of Hessian matrices based on parameter tensors.
However, for the underlying quantization scheme, most of the above-mentioned convertors quantization works, especially BERT quantization works, use linear clustering, which is a main clustering method. Although it can be handled quickly and easily, the quantization result does not represent the original data distribution well. While some BERT quantization works achieve higher compression rates without improving the quantization scheme, the methods of packet quantization they develop are quite time consuming and add significant delay. While it is believed that replacing linear clustering with a better clustering approach may improve the performance of the quantization model, the effect of the quantization scheme improvement is underestimated. Therefore, here we explore the effect of simply improving the quantization scheme from linear clustering to k-means clustering and compare the performance of both schemes. Furthermore, to see the impact on other pre-trained language models, we also compared the two quantization schemes described above for the improved model of the almbert model, the BERT.
In general, we applied k-means and linear quantization on BERT and ALBERT and tested their performance on the glut task set. In this way we verify that a simple improvement of the quantization scheme can lead to a great improvement of performance, and that simple k-means clustering has great potential as BERT quantization scheme. Furthermore, we have shown that the number of k-means iteration rounds plays an important role in k-means quantization. By further comparison, we found that ALBERT is less robust in quantization than BERT, because parameter sharing reduces redundancy of parameters.
2 background: BERT and ALBERT
In this section, we briefly introduce the architecture of the BERT and ALBERT models and point out the version of the model we use in the experiments.
2.1 BERT
The BERT model is a special transform-based pre-training network. They consist mainly of an embedded layer, an encoder block and an output layer. There is no decoder block in the BERT model. Each encoder block contains one self-attention layer (including three parallel linear layers corresponding to queries, keys and values) and 3 feed-forward layers (each containing one linear layer).
For each self-attention layer, BERT further improves its performance using multi-head technology Can be used. For each self-attention header, there are 3 weight matrices W q ,W k And W is v Wherein W is q ,W k ,W v (h is the number of heads in each self-attention layer). Order theRepresenting the input of the corresponding self-attention layer. Thus, the output of the self-attention head is calculated as follows:
then, for each self-attention layer, the outputs of all its self-attention heads are sequentially connected to generate the outputs of the respective layers.
Specifically, in our work we performed the following experiment using the BERT-base-uncased version of the BERT model, which has 12 encoder blocks, with 12 self-attention heads per self-attention layer.
2.2 ALBERT
Compared to BERT, ALBERT makes three major improvements. First, the ALBERT model decomposes the parameters of the embedded layer into the product of two smaller matrices. Second, they employ cross-layer parameter sharing to improve parameter efficiency. These two improvements can significantly reduce the total number of parameters and make the model more efficient. In addition, parameter sharing may also stabilize network parameters. Third, they replace the next sentence prediction (NSP, next-sentence prediction) penalty with a sentence-sequential prediction (SOP) penalty at the time of pre-training. This concentrates the model on modeling inter-sentence consistency, rather than topic prediction, and improves the performance of multi-sentence coding tasks.
Specifically, here we use the ALBERT-base-v2 version of the ALBERT model, which also has 12 encoder blocks (all parameters are shared between layers), with 12 self-attention headers per self-attention layer.
3 theory of methods
In this section, we first introduce the quantization process in the experiment, and then explain the two quantization schemes we use in detail.
3.1 Summary of the invention
To compare the linear and k-means quantization schemes on a transducer-based pre-trained model, we tested the performance of the quantization model on different downstream tasks. Specifically, for each selected task, the following experiments will be performed in sequence: fine tuning the pre-training models (BERT and ALBERT) on downstream tasks; quantifying the specific task model; the quantization model is fine-tuned. The performance of the resulting model is then tested on the validation set for each selected task.
To avoid the impact of other tricks, we apply two quantization schemes (linear and k-means) only following a fixed precision quantization strategy, without using any tricks. We quantify the ownership weights of the embedded layer and the fully connected layers (except the classification layer). For each weight vector, after quantization, it will be represented by the corresponding cluster index vector and mean vector, and each parameter of the weight vector will be replaced by the mean of the cluster to which it belongs.
After model quantization, we fine-tune the model on the corresponding downstream task while maintaining quantization. For forward traversal, we reconstruct each quantization layer from its clustered index vector and mean vector. For backward pass, we update the quantization parameters by training the mean vector while the rest of the parameters are updated normally. More specifically, the gradient of each parameter in the mean vector is calculated as the average of the gradients of the parameters belonging to the corresponding cluster. The mean vector is then updated by the same back propagation method.
3.2 Linear quantization
Let us assume that we need to quantize the vector v to k bits (k-bit quantization). We first search for its minimum v min And maximum v max . Then the range v min ,v max ]Divided into 2 k The clusters are:
define the function Q≡as
Its value is between 0 and 2 k-1. Thus each parameter v i All belong to Q (v) i ) And a cluster. v i Will be Q (v) i ) The mean of the individual clusters replaces, i.e. the mean of all the parameters belonging to it. Thus, the quantization function is:
wherein 1{ state } is equal to 1 when the sentence in the bracket is true, and is otherwise 0.
3.3 K-means quantization
Let us assume that we need to quantize the vector v to k bits (k-bit quantization). For k-means quantization, we partition vector v into 2 using k-means clustering and k-means++ initialization k Clusters.
We first use the k-means++ initialization method for each cluster (c_1, c_2,., c_2 k ) Initialize the mean (μ_1, μ_2,) μ_2 k ). Then, each parameter v i Classified into its nearest cluster. After classifying all parameters in v, the mean will be updated to the mean of all parameters belonging to them, respectively. The reclassifying of the parameters and updating of the mean value are then repeated until the convergence condition is met or the maximum number of iteration rounds is reached. In addition, the process of the k-means++ initialization method is as follows: firstly, selecting a random parameter from a vector v as a first average value; then, other parameters are distributed as the possibility of the next average value according to the minimum distance from all the existing average values, and the next average value is selected according to the possibility; finally, the likelihood calculation and mean selection are repeated until all 2's are generated k And (5) a centroid. For specific algorithm please refer to fig. 2.
To reduce the amount due toThe improvement of the transformation scheme and the resulting efficiency reduction, we set the maximum iteration round of k-means clustering to 3. After k-means clustering is completed, we use the obtained cluster number vector as a cluster index vector, and the mean value of each cluster as a corresponding mean value vector. Each parameter v i Will be replaced by the mean of the cluster to which it belongs.
4 experiment
In this section, we first introduce the data set we use in the experiment, then explain the details of our experiments performed on BERT and ALBERT, and finally show our experimental results and the corresponding discussion.
4.1 Data set
We tested the performance of the quantization model on a generic language understanding assessment (glut) task set. The method comprises NLU tasks such as question answering, emotion analysis, text implication and the like. Specifically, we used 8 tasks (QNLI, coLA, RTE, SST-2, MRPC, STS-B, MNLI and QQP) to test the performance of different quantization schemes. The evaluation index of each task is as follows: coLA is Matthews correlation coefficient (mcc); QNLI, RTE, SST2 and MNLI are correct rates (acc); MRPC and QQP are the correctness (acc) and F 1 Scoring; STS-B is the Pearson and Spearman correlation coefficient (corr). We follow the default partitioning of the dataset. The data set may be downloaded here: https:// fluenchmark.
4.2 Experimental details
Prior to quantization, we fine-tuned the BERT-base-uncased version of the BERT model over 8 tasks using an Adam optimizer (initial learning rate of 5e-5, linear update). For the ALBERT model, we first fine-tune the ALBERT-base-v2 model on QNLI, coLA, SST-2, MNLI and QQP. Trimming is then performed on RTE, MRPC and STSB based on the trim results of MNLI. We use a linearly updated Adam optimizer to fine tune ALBERT and search {1e-5, 2e-5, 3e-5, 4e-5, 5e-5} for the initial learning rate of each task.
Table 1. Fixed precision linear quantization results for BERT on the glut task set.
Table 2. Fixed precision k-means quantization results for BERT on the glut task set.
Table 3. Fixed precision linear quantization results for ALBERT on the glut task set.
Table 4. Fixed precision k-means quantization results for ALBERT on the glut task set.
Fig. 3 shows a comparison of the average scores of 8 glut tasks with linear and k-means quantization on the BERT model.
Fig. 4 shows the average score of 8 glut tasks comparing linear and k-means quantization on ALBERT model.
After quantization, we will further fine tune the quantization model of the corresponding task. In particular, the learning rate of the quantized layer is multiplied by 10 times (e.g., 5e-4 for all quantized BERT models), while the learning rate of the other layers remains unchanged.
4.3 Experimental results and discussion
We focus mainly on 1-5 bit fixed precision quantization. Tables 1 and 2 show the results of the linear and k-means quantization of BERT, respectively, and fig. 3 shows a further comparison between the average scores of the two experiments. Similarly, the results and comparisons of ALBERT are shown in table 3, table 4 and fig. 4, respectively.
4.3.1 BERT
The quantization scheme improves the resulting boost. As shown in table 1, table 2 and fig. 3, although the model performs poorly with a lower number of bits regardless of the quantization scheme employed, the model quantized with k-means performed significantly better than the model quantized with linear quantization over all 8 tasks and their averages when the same number of bits was used. From the average performance of 8 tasks, we can achieve a performance reduction of 1-5 bits quantization from (38.8%, 34.7%,27.6%,17.1%, 4.8%) to (28.6%, 3.94%,0.9%,0.3%, -0.2%) compared to full precision, respectively, by merely improving the quantization scheme from linear average to k-average. The results show that a significant performance improvement can be achieved by merely improving the quantization scheme, which indicates that the room for improvement of the quantization scheme is greatly underestimated. To further illustrate this, we repeated several experiments using a block linear quantization scheme, which is an improvement based on linear quantization and has higher performance than simple linear quantization. The results are shown in table 5. Compared to the performance of packet linear quantization, simple k-means quantization can achieve higher performance or comparable performance while saving a lot of time.
Potential for k-means quantization. As shown in table 2, the model can be compressed well simply using k-means quantization with a fixed precision strategy, and the quantized model can still be compressed well even at some particularly low number of quantization bits. For example, on task RTE, using a model quantized to 3 bits with k-means quantization would only result in a 2.16% performance degradation. For most tasks, including QNLI, SST-2, MRPC, STS-B, MNLI and QQP, the performance of the quantization model only drops significantly when compressed to 1 bit. Notably, these results are achieved by simple k-means quantization with a maximum iteration round of only 3, without any other skills, indicating that k-means quantization has great potential for development.
Table 5. Comparison between k-means quantization and group linear quantization on bert. The rightmost column is the average time-consuming of k-means quantization compared to the linear quantization of packets on RTE and MRPC. ( In the grouping quantization, each matrix is divided into different groups, and each group is quantized separately. For forward traversal, the model needs to reconstruct each quantization group for each layer separately, rather than directly reconstructing the entire weight matrix for each quantization layer. This explains why packet quantization is very time consuming. Specifically, in our group quantization experiment, we divide each matrix into 128 groups. )
4.3.2 ALBERT
In general, two main conclusions drawn from the BERT experiment remain true. As shown in table 3, table 4 and fig. 4, we can also see the great improvement brought by the quantization scheme improvement and the great potential for k-means quantization. However, some anomalous results are worth discussing.
The effect of k-means iteration round number. The first set of outlier results is from the 1-bit quantization of QNLI, MRPC and STS-B. Although the results of k-means quantization are generally better than linear quantization, these three sets of results do not follow this rule. We believe that this is because the distribution of parameters is so complex that k-means do not give good clustering results with only 3 iterations. To verify the theory and further investigate the effect of the number of iteration rounds, we repeated experiments on these anomalous results, expanding the maximum number of iteration rounds to 5, 10 and 20. The corresponding results are shown in table 6. The more the number of iteration rounds, the better the effect of k-means quantization and eventually the more than the result of linear quantization. However, the problem of overfitting still exists, and as the maximum number of iteration rounds increases from 10 to 20, the quantization performance of both QNLI and STS-B drops significantly. Therefore, in k-means quantization, the maximum number of iterative rounds of k-means is also an important hyper-parameter that needs to be carefully searched.
Table 6. 1-bit quantization performance at different k-means maximum iteration rounds on albert.
Fig. 5 shows a comparison of the average scores of 8 glut tasks of the k-means quantized BERT and ALBERT models.
FIG. 6 shows a performance comparison of the k-means quantized BERT and ALBERT models. Each value refers to the percentage of the average score of the quantization model compared to the score of the full precision model.
Copa 0 and MRPC 68.4. Another set of outlier results are linear quantifications from CoLA and MRPC, which are two classification tasks. We find that after fine tuning, the quantization model always outputs a "1".0 and 68.4 are determined only by the data distribution on the validation set. In other words, after the model is quantized to 1-5 bits by linear quantization, the model is almost dead, and it is difficult to train on both tasks. Furthermore, we further performed experiments on two tasks to quantize the model to a higher number of bits, and found that the performance of the quantized model was no longer 0 and 68.4 starting from quantization to 6 bits.
Comparison between BERT and ALBERT. Furthermore, we compared the performance of k-means quantization of BERT and ALBERT, and the results are shown in fig. 5 and 6. The performance of ALBERT has been reduced to 93.4% and 72.5% after k-means 4-bit and 3-bit quantization, respectively, compared to 96.1% where BERTBERT retains its original performance after k-means 2-bit quantization. Thus, ALBERT is less robust with respect to quantization (in our work, robustness to quantization means the ability to quantize to a lower number of bits while maintaining high performance). Considering that the main improvement of ALBERT of BERT is parameter sharing, while quantization can also be regarded as intra-layer parameter sharing, we speculate that parameter sharing and quantization have similar effects, which means that redundant information removed by parameter sharing and quantization partially overlaps. It is considered that after parameter sharing, ALBERT removes a lot of redundant information (total number of parameters is reduced from 108M to 12M) compared to BERT, and thus further application of quantization on ALBERT will easily damage useful information, thus resulting in poor robustness of ALBERT to quantization. From another point of view, however, parameter sharing has greatly reduced the number of parameters and thus can also be considered as a model compression method. Furthermore, considering that the performance of full-precision ALBERT is better than that of 4-bit and 3-bit BERT models that occupy similar memory in GPUs, parameter sharing can even achieve better compression performance than quantization without any skill. However, as a compression method, parameter sharing has a non-negligible disadvantage: it can only reduce memory consumption, while most other compression methods can reduce both memory consumption and computational consumption (i.e., real time overhead).
Conclusion 5
Here we compared k-means quantization and linear quantization on BERT and ALBERT models and reached three main conclusions. First, we find that the performance of the model quantized with k-means is significantly better than the model quantized using linearity. Only the quantization scheme of the bottom layer is improved, so that huge performance improvement can be realized. Second, using k-means quantization can compress the model to a relatively low number of bits and maintain high performance even with a simple fixed-precision compression strategy and without any other skills. This suggests that k-means quantization has great potential for development. Third, the number of iterative rounds of k-means plays an important role in the performance of the quantization model and should be carefully determined. Furthermore, by comparing the k-means quantization results of BERT and ALBERT, we find that ALBERT is less robust to quantization than BERT. This suggests that parameter sharing and quantization have some similar effect. Thus, further application of quantization on models where extensive parameter sharing is applied will be more likely to damage useful information, resulting in significant performance degradation.
Referring to fig. 7, a block diagram of a pre-training language model quantization apparatus according to an embodiment of the invention is shown.
As shown in fig. 7, the pre-training language model quantization apparatus 700 includes a first fine tuning module 710, a cluster compression module 720, and a second fine tuning module 730.
Wherein the first fine tuning module 710 is configured to perform a first fine tuning of the pre-trained language model on a downstream task; a cluster compression module 720 configured to cluster the data in the weighting matrix of all the embedded layers and all the linear layers except the classification layer of the trimmed model by using k-means clustering, and set the number of categories to 2 n Wherein n is the bit number occupied by each data of the compressed target model; and a second fine tuning module 730 configured to perform a second fine tuning of the quantized model on the downstream task while maintaining quantization,and finally obtaining the quantized network.
In some alternative embodiments, the cluster compression module 720 is further configured to: partitioning the data into 2 using k-means++ initialization k Clusters and is said 2 k Cluster initialization 2 k The average value; classifying each data into a nearest cluster according to the relation between the data and each mean value; after each data is classified, updating the corresponding average value to the average value of all the data of the cluster; and repeatedly reclassifying each data and updating the centroid until convergence is met or a preset iteration number is reached.
It should be understood that the modules depicted in fig. 7 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are equally applicable to the modules in fig. 7, and are not described here again.
It should be noted that the modules in the embodiments of the present application are not limited to the solution of the present application, for example, the receiving module may be described as a module that receives a speech recognition request. In addition, the related functional modules may be implemented by a hardware processor, for example, the receiving module may also be implemented by a processor, which is not described herein.
In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium having stored thereon computer-executable instructions for performing the pre-trained language model quantization method of any of the method embodiments described above;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
performing first fine tuning on the pre-trained language model on a downstream task;
clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 n Wherein n is the occupied area of each data of the compressed target modelA number of bits;
and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining the quantized network.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from the use of the pre-trained language model quantization means, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes a memory remotely located with respect to the processor, the remote memory being connectable to the pre-trained language model quantification apparatus through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the pre-trained language model quantification methods described above.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 8, where the device includes: one or more processors 810, and a memory 820, one processor 810 being illustrated in fig. 8. The apparatus for pre-training the language model quantization method may further include: an input device 830 and an output device 840. Processor 810, memory 820, input device 830, and output device 840 may be connected by a bus or other means, for example in fig. 8. Memory 820 is the non-volatile computer-readable storage medium described above. The processor 810 performs various functional applications of the server and data processing, i.e., implements the pre-trained language model quantization method of the method embodiment described above, by running non-volatile software programs, instructions, and modules stored in the memory 820. The input device 830 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the pre-trained language model quantization device. The output device 840 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
As an embodiment, the electronic device is applied to a pre-training language model quantization apparatus, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:
performing first fine tuning on the pre-trained language model on a downstream task;
clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 n Wherein n is the bit number occupied by each data of the compressed target model;
and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining the quantized network.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.
(3) Portable entertainment device: such devices may display and play multimedia content. Such devices include audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture in that the server is provided with high-reliability services, and therefore, the server has high requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like.
(5) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (8)
1. A method of pre-training language model improvement, wherein the pre-training language model is BERT or ALBERT, the method comprising:
performing first fine tuning on the pre-trained language model on a downstream task;
clustering data in weight matrixes of all embedded layers and all linear layers except the classification layer of the trimmed model by using k-means clustering, and setting the category number to be 2 n Wherein n is the bit number occupied by each data of the compressed target model;
and performing secondary fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtaining the quantized network.
2. The method of claim 1, wherein clustering data in the weight matrix of all embedded layers and all linear layers of the trimmed model except the classification layer using k-means clustering comprises:
partitioning the data into 2 using k-means++ initialization k Clusters and is said 2 k Cluster initialization 2 k The average value;
classifying each data into a nearest cluster according to the relation between the data and each mean value;
after each data is classified, updating the corresponding average value to the average value of all the data of the cluster;
and repeatedly reclassifying each data and updating the mean value until convergence is met or a preset maximum iteration round number is reached.
3. The method of claim 2, wherein the partitioning the data into 2 with k-means++ initialization k Clusters and is said 2 k Cluster initialization 2 k The mean value comprises:
selecting a random data from the data seed as a first mean;
distributing the residual data as the possibility of the next average value according to the minimum distance from the existing average value, and selecting the next average value according to the possibility of the next average value;
repeating the probability calculation and the mean selection until the generation All 2 k And (5) an average value.
4. The method of claim 2, wherein the preset maximum number of iteration rounds is 3.
5. The method according to claim 1, wherein the quantized network restores the original weight matrix by the class of each data and the average value of each class, that is, each data is replaced by the average value of the corresponding class;
when the quantized network is calculated in the backward direction, a gradient descent method is used for updating network parameters, particularly a quantized weight matrix, gradients of elements in the same category are averaged, and each average value is updated by using the gradients as the average value of the category.
6. A pre-training language model improvement apparatus, wherein the pre-training language model is BERT or ALBERT, the apparatus comprising:
a first fine tuning module configured to perform a first fine tuning of the pre-trained language model on a downstream task;
the clustering compression module is configured to use k-means clustering to cluster the data in the weighting matrix of all embedded layers and all linear layers except the classification layer of the trimmed model, and set the category number to be 2 n Wherein n is the bit number occupied by each data of the compressed target model;
And the second fine tuning module is configured to perform second fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, and finally obtain a quantized network.
7. The apparatus of claim 1, wherein the cluster compression module is further configured to:
partitioning the data into 2 using k-means++ initialization k Clusters and is said 2 k Cluster initialization 2 k The average value;
classifying each data into a nearest cluster according to the relation between the data and each mean value;
after each data is classified, updating the corresponding average value to the average value of all the data of the cluster;
and repeatedly reclassifying each data and updating the mean value until convergence is met or a preset maximum iteration round number is reached.
8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010636126.3A CN111814448B (en) | 2020-07-03 | 2020-07-03 | Pre-training language model quantization method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010636126.3A CN111814448B (en) | 2020-07-03 | 2020-07-03 | Pre-training language model quantization method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111814448A CN111814448A (en) | 2020-10-23 |
CN111814448B true CN111814448B (en) | 2024-01-16 |
Family
ID=72856262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010636126.3A Active CN111814448B (en) | 2020-07-03 | 2020-07-03 | Pre-training language model quantization method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111814448B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100383B (en) * | 2020-11-02 | 2021-02-19 | 之江实验室 | Meta-knowledge fine tuning method and platform for multitask language model |
GB2609768A (en) * | 2020-11-02 | 2023-02-15 | Zhejiang Lab | Multi-task language model-oriented meta-knowledge fine tuning method and platform |
US20240104346A1 (en) * | 2022-09-15 | 2024-03-28 | Huawei Technologies Co., Ltd. | Method and device for compressing generative pre-trained language models via quantization |
CN118278535A (en) * | 2022-12-30 | 2024-07-02 | 中国电信股份有限公司 | Fine tuning method, device, equipment, medium and program for pre-training model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897734A (en) * | 2017-01-12 | 2017-06-27 | 南京大学 | K average clusters fixed point quantization method heterogeneous in layer based on depth convolutional neural networks |
CN107944553A (en) * | 2017-11-22 | 2018-04-20 | 浙江大华技术股份有限公司 | A kind of method for trimming and device of CNN models |
CN108415888A (en) * | 2018-02-12 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Compression method and system for neural network language model |
CN110363281A (en) * | 2019-06-06 | 2019-10-22 | 上海交通大学 | A kind of convolutional neural networks quantization method, device, computer and storage medium |
CN110377686A (en) * | 2019-07-04 | 2019-10-25 | 浙江大学 | A kind of address information Feature Extraction Method based on deep neural network model |
CN110489555A (en) * | 2019-08-21 | 2019-11-22 | 创新工场(广州)人工智能研究有限公司 | A kind of language model pre-training method of combination class word information |
CN110597986A (en) * | 2019-08-16 | 2019-12-20 | 杭州微洱网络科技有限公司 | Text clustering system and method based on fine tuning characteristics |
CN111340186A (en) * | 2020-02-17 | 2020-06-26 | 之江实验室 | Compressed representation learning method based on tensor decomposition |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180107926A1 (en) * | 2016-10-19 | 2018-04-19 | Samsung Electronics Co., Ltd. | Method and apparatus for neural network quantization |
-
2020
- 2020-07-03 CN CN202010636126.3A patent/CN111814448B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897734A (en) * | 2017-01-12 | 2017-06-27 | 南京大学 | K average clusters fixed point quantization method heterogeneous in layer based on depth convolutional neural networks |
CN107944553A (en) * | 2017-11-22 | 2018-04-20 | 浙江大华技术股份有限公司 | A kind of method for trimming and device of CNN models |
CN108415888A (en) * | 2018-02-12 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Compression method and system for neural network language model |
CN110363281A (en) * | 2019-06-06 | 2019-10-22 | 上海交通大学 | A kind of convolutional neural networks quantization method, device, computer and storage medium |
CN110377686A (en) * | 2019-07-04 | 2019-10-25 | 浙江大学 | A kind of address information Feature Extraction Method based on deep neural network model |
CN110597986A (en) * | 2019-08-16 | 2019-12-20 | 杭州微洱网络科技有限公司 | Text clustering system and method based on fine tuning characteristics |
CN110489555A (en) * | 2019-08-21 | 2019-11-22 | 创新工场(广州)人工智能研究有限公司 | A kind of language model pre-training method of combination class word information |
CN111340186A (en) * | 2020-02-17 | 2020-06-26 | 之江实验室 | Compressed representation learning method based on tensor decomposition |
Non-Patent Citations (2)
Title |
---|
EFFECTIVENESS OF SELF-SUPERVISED PRE-TRAINING FOR ASR;Alexei Baevski etc.;IEEE Xplore;全文 * |
基于BERT+ATT和DBSCAN的长三角专利匹配算法;曹旭友;周志平;王利;赵卫东;;信息技术(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111814448A (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111814448B (en) | Pre-training language model quantization method and device | |
CN109977212B (en) | Reply content generation method of conversation robot and terminal equipment | |
US11334819B2 (en) | Method and system for distributed machine learning | |
CN107967515B (en) | Method and apparatus for neural network quantization | |
CN110546656B (en) | Feedforward generation type neural network | |
US9400955B2 (en) | Reducing dynamic range of low-rank decomposition matrices | |
US20140156575A1 (en) | Method and Apparatus of Processing Data Using Deep Belief Networks Employing Low-Rank Matrix Factorization | |
CN108319988B (en) | Acceleration method of deep neural network for handwritten Chinese character recognition | |
CN108415888A (en) | Compression method and system for neural network language model | |
JP6950756B2 (en) | Neural network rank optimizer and optimization method | |
CN112687266B (en) | Speech recognition method, device, computer equipment and storage medium | |
CN112840358B (en) | Cursor-based adaptive quantization for deep neural networks | |
CN116976428A (en) | Model training method, device, equipment and storage medium | |
US20220092382A1 (en) | Quantization for neural network computation | |
US20220027719A1 (en) | Compressing tokens based on positions for transformer models | |
CN111324731B (en) | Computer-implemented method for embedding words of corpus | |
CN116524941A (en) | Self-adaptive quantization compression method and system for voice model and electronic equipment | |
CN116644797A (en) | Neural network model quantization compression method, electronic device and storage medium | |
CN116312607A (en) | Training method for audio-visual voice separation model, electronic device and storage medium | |
Tai et al. | Learnable mixed-precision and dimension reduction co-design for low-storage activation | |
CN116384471A (en) | Model pruning method, device, computer equipment, storage medium and program product | |
Shahnawazuddin et al. | Sparse coding over redundant dictionaries for fast adaptation of speech recognition system | |
CN106847268B (en) | Neural network acoustic model compression and voice recognition method | |
CN111368976B (en) | Data compression method based on neural network feature recognition | |
Macoskey et al. | Learning a neural diff for speech models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant after: Sipic Technology Co.,Ltd. Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant before: AI SPEECH Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |