CN115238893B

CN115238893B - Neural network model quantification method and device for natural language processing

Info

Publication number: CN115238893B
Application number: CN202211162125.5A
Authority: CN
Inventors: 刘祥龙; 魏秀颖; 龚睿昊; 李莹; 吕金虎
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2023-01-17
Anticipated expiration: 2042-09-23
Also published as: CN115238893A

Abstract

The invention discloses a neural network model quantification method and device for natural language processing. The method comprises the following steps: carrying out scaling parameter transfer aiming at a LayerNorm structure in the full-precision pre-training language model to obtain an equivalent floating point pre-training language model; determining a clipping range based on the floating point pre-training language model using a word-based clipping step based on a small amount of data; and calculating a quantization step size according to the cutting range to obtain a quantized pre-training language model. By using the method and the device, the pre-training language model with better quantization degree can be obtained under the condition of no extra calculation overhead, so that the required calculation overhead is obviously reduced, and the method and the device are particularly suitable for the requirement of edge equipment on low power consumption.

Description

Neural network model quantification method and device for natural language processing

Technical Field

The invention relates to a neural network model quantization method facing natural language processing, and also relates to a corresponding neural network model quantization device, belonging to the technical field of computational linguistics.

Background

The neural networks used in natural language processing are mainly classified into two types, one is a time sequence model represented by a recurrent neural network/long-short term memory model, and the other is a parallel computation model represented by a Transformer/BERT.

Compared with the recurrent neural network, the Transformer can capture historical information related to the current output by using a self-attention mechanism (self-attention) like the recurrent neural network, and can process the current input and all historical inputs near the current input in parallel like the feedforward neural network, so that the problem of low information processing speed of the recurrent neural network is solved. Furthermore, transformer is also the cornerstone of the pre-trained language models such as BERT, GPT, T5, etc. that are currently mainstream. However, the amount of parameters of these pre-trained language models is generally large, and some technical means of model quantification needs to be adopted to help them to be used on lightweight devices.

In the Chinese invention patent with the patent number ZL 202011470331.3, an automatic compression method and platform for a pre-training language model facing multiple tasks are disclosed. Designing a meta-network of a structure generator, constructing a knowledge distillation coding vector based on a Transformer layer sampling knowledge distillation method, and generating a distillation structure model corresponding to a currently input coding vector by using the structure generator; simultaneously, a Bernoulli distributed sampling method is provided to train a structure generator; in each iteration, migrating each encoder unit by using a Bernoulli distribution sampling mode to form a corresponding encoding vector; by changing the coding vector input into the structure generator and the training data of small batches, combining the training structure generator and the corresponding distillation structure, the structure generator capable of generating weights for different distillation structures can be learned; and meanwhile, on the basis of the trained meta-learning network, searching an optimal compression structure through an evolutionary algorithm, thereby obtaining an optimal general compression architecture of the pre-training language model irrelevant to the task.

In addition, in the Chinese invention application with the application number of 202210540113.5, a language model compression method based on uncertainty estimation knowledge distillation is disclosed. The method comprises the following steps: 1) Performing half-and-half compression on an original language model to obtain a compressed neural network; 2) Reasonably initializing parameters of the compressed neural network by using an original language model; 3) Adding a parameter distillation loss function of a feedforward network structure, and designing an uncertainty estimation loss function and a cross entropy loss function of a natural language processing task; 4) And training the compressed neural network model by using the designed loss function. The technical scheme reduces the calculated amount in the network compression training process, improves the network compression ratio, accelerates the network reasoning speed, can be widely applied to model deployment and model compression tasks, and provides a new model compression solution for application scenes with scarce hardware resources.

Disclosure of Invention

The invention aims to provide a neural network model quantification method oriented to natural language processing.

Another technical problem to be solved by the present invention is to provide a neural network model quantization apparatus for natural language processing.

In order to achieve the purpose, the invention adopts the following technical scheme:

according to a first aspect of the embodiments of the present invention, there is provided a natural language processing-oriented neural network model quantization method, including the following steps:

(1) Carrying out scaling parameter transfer aiming at a LayerNorm structure in the full-precision pre-training language model to obtain an equivalent floating point pre-training language model;

(2) Based on a small amount of data, determining a clipping range on the basis of the floating point pre-training language model obtained in the step (1) by using a word-based clipping step;

(3) And (3) calculating a quantization step size according to the cutting range obtained in the step (2) to obtain a quantized pre-training language model.

Preferably, in the step (1), the scaling parameters in the LayerNorm structure are extracted and transferred to the weights of the subsequent modules.

Preferably, when the subsequent module is a residual connecting module, for a linear transformation branch in the residual connecting module, the transferred scaling parameters are absorbed by the following formula:

wherein, x represents the input vector and x represents the input vector,

representing said scaling parameters acting on the input vector,Wthe weight of the branch of the linear transformation is represented,

represents the Hadamard product of the matrix,nis a positive integer.

Preferably, when the subsequent module is a residual connecting module, the scaling parameter is directly multiplied by a short-circuit branch in the residual connecting module.

Preferably, in the step (2), the maximum value embedded at token of each word is used as the representation of the abnormal value, and the minimum value embedded at token of each word is used as the representation of the negative abnormal value.

Preferably, in the step (2), the maximum value set is used for all words

The rate of clipping is enumerated on it and the value of the corresponding clipping is calculated.

Preferably, the set of all word maxima is determined according to the alpha quantile function

The upper limit of the cutting range is obtained;

taking the minimum value of all words, and calculating the lower limit of the cutting range according to the alpha percentile;

and calculating a quantization step length s through the upper limit and the lower limit of the cutting range, calculating a corresponding loss function L(s), and finally selecting the quantization step length with the minimum loss.

According to a second aspect of the embodiments of the present invention, there is provided a natural language processing-oriented neural network model quantization apparatus, including a processor and a memory, where the processor reads a computer program in the memory for executing the above neural network model quantization method.

Compared with the prior art, the neural network model quantification method and device for natural language processing provided by the invention have the following technical characteristics:

1. by utilizing the transfer of the scaling parameters, the scaling parameters in the LayerNorm structure are transferred to a subsequent residual error connection module, and meanwhile, a pre-training language model with better quantization degree can be obtained under the condition of no extra calculation overhead by matching with a word-based cutting step;

2. the calculation cost required by the pre-training language model after the quantization processing is obviously reduced, and the method is particularly suitable for the requirement of edge equipment on low power consumption;

3. the neural network model quantification method and device are easy to implement, can be applied to various pre-training language models based on a Transformer, such as BERT, roBERTA, BART and the like, and have a wide application range.

Drawings

Figure 1 is a diagram of the BERT pre-trained language model,

a schematic diagram of the distribution of outliers of (a);

figure 2 is a diagram of the BERT pre-trained language model,

the distribution of abnormal values of (a);

figure 3 is a diagram of the BERT pre-trained language model,

a schematic diagram of the distribution of outliers of (a);

FIG. 4 is a flow diagram illustrating the quantization process of the pre-trained language model before scaling parameter transition;

FIG. 5 is a flow chart illustrating the quantization process of the pre-trained language model after scaling parameter transition;

FIG. 6 is a diagram illustrating fast convergence of quantization step sizes using a coarse-to-fine paradigm;

fig. 7 is a schematic diagram of a neural network model quantization apparatus according to an embodiment of the present invention.

Detailed Description

The technical contents of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Daniel Jurafsky and James H.Martin discuss in detail the basic structure of the Transformer and its working principle in the natural Language Processing task in his third edition to be published (see https:// web. Stanford. Edu/-. Jurafsky/slp3 /), and it is not described in detail here.

In further work, related researchers found that there was an inherent quantization bottleneck for such a Transformer-based pre-trained language model. For example, there are large outliers in these pre-trained language models. And these sharp outliers (e.g., close to 100) exhibit some structured features, such as often appearing on some particular dimension or word. This can even result in a 12% loss for 8-bit quantization. For this problem, there have been work to make some observations of outliers, which are often found in some specific dimensions and often in words such as [ SEP ]. However, there has been work without further observation of outliers, and rather this problem is circumvented with smaller quantization granularity. This increases computational complexity and is not necessarily suitable for actual landing.

For this reason, we studied the effect of outlier clipping in depth, and found that different outliers have different effects on the performance of the transform-based pre-trained language models such as BERT, roBERTa, BART, etc. at clipping. Where the more aggressive outliers provided by few words (e.g., [ SEP ] delimiters) can be cut sharply and safely with little impact on accuracy. Therefore, the neural network model quantization method provided by the embodiment of the invention firstly preliminarily detects the clipping range from the perspective of the word, and then optimizes the clipping range in a fine-grained manner, so that signals with smaller meanings are quickly skipped, and more attention is paid to important parts. This is explained in more detail below:

it has been mentioned above that for a pre-trained language model based on Transformer, the quantization or lower (4-bit) quantization perception training after the standard 6/8-bit training results in a severe degradation of accuracy. By studying the degradation of precision and quantization errors caused by each quantizer, we recognize that the LayerNorm (layer normalized) structure and the output of the GELU activation function are the most problematic tensors. The LayerNorm structure's scaling parameters cause the output distribution to have sharp outliers, which should be the cause of large quantization errors.

To this end, we first focused on exploring the potential causes of outliers. In the process of exploring the potential causes, the basic structure of the LayerNorm structure is firstly deeply analyzed because the output of the LayerNorm structure has an abnormal value, and the formula is as follows:

（1）

wherein, an input matrix in a pre-training language model is marked as X, corresponding input vectors are marked as X, subscripts t and j respectively represent the t-th token and the j-th dimension, and corresponding output results are marked as X

(ii) a Represents scalar multiplication;

and

respectively representing the variance of the mean of vectors x of LayerNorm input t token level;

a scaling parameter representing the jth dimension,

a translation factor representing the jth dimension, t and j both being positiveAn integer number.

With reference to fig. 1 to 3, we further analyzed the distribution of LayerNorm structure parameters in the BERT pre-training language model, and found that the parameters are scaled under the same dimension of abnormal values as the output

The value of (b) is sharper than the other values. Furthermore, translation factor

Is smaller, so we ignore it when determining the keypoint. From this, it can be deduced that the parameters are scaled

It should be the key point that the output of LayerNorm structure has an abnormal value, and the scaling parameter can be extracted from the equation

To eliminate its effect.

By plotting the distribution of the parameters of the LayerNorm structure and its output, we find that the scaling parameters and their output have outliers in the same dimension, by fitting the scaling parameters

By extraction, we obtain Non-Scaling LayerNorm (Non-Scaling layer normalization) shown in equation (2):

（2）

we find the resulting tensor (distribution of tensor)

Has milder distribution, calculates the quantitative error evaluation index and displays the index

The quantization error of (2) is smaller.

Based on the above knowledge, we will scale the parameters in the LayerNorm structure

And extracting the language model and transferring the language model to a subsequent module such as a residual connecting module to obtain a pre-training language model more favorable for quantification.

Specifically, the LayerNorm structure in equation (1) is first reformed into the Non-Scaling LayerNorm structure in equation (2), and then the Scaling parameters are used

The transfer is performed. The relationship between the two is shown in formula (3):

（3）

since many of the pre-trained language models such as LayerNorm of BERT, roBERTA, BART, etc. are followed by residual concatenation modules, we consider scaling parameters

Transfer to the two branches of the residual concatenation module.

Here, for the linear transformation branch, we have the following equation (4) to absorb the shifted scaling parameters with weights.

（4）

The above equation (4) represents a linear variation, where x represents an input vector,

representing scaling parameters that act on the input vector,Wthe weight of the linear transformation branch is represented,

represents the Hadamard product of the matrix,nis a positive integer. Equation (4) shows that the scaling parameters that act on the input vector can be transferred into the weights of the subsequent modules.

For a short circuit (short) branch, in the embodiment of the present invention, a scaling parameter may be directly multiplied on the short circuit branch

。

FIG. 4 is a flow chart illustrating the quantization process of the pre-trained language model before scaling parameter transition; FIG. 5 is a diagram illustrating a quantization process of a pre-trained language model after scaling parameter transition. As can be seen by comparing FIG. 4 with FIG. 5, the embodiment of the present invention is

The "Quant" process is used, then the matrix multiplication with weight changed by quantization is used in the linear transformation branch, the scaling parameters are directly multiplied in the other short-circuited branch and the "dequantization" process is passed. In practice this means that the scaling parameters are delayed

The correlation of (2). Thus, such a shift of the scaling parameters does not increase the computational overhead.

On the other hand, firstly, the activation value is cut on a full-precision pre-training language model such as BERT, roBERTA, BART and the like, and the specific influence of cutting is judged by observing the condition of reduced precision. We have found that clipping of different outliers has a large impact on accuracy. For example, some outliers, although sharper, can be clipped greatly without affecting the final accuracy, while some outliers have a large impact. We have also found that the range of outliers provided by different words varies widely, i.e. those with a wide coverage of long tails are less important and correspond to only a small number of words. Therefore, we first identify from the dimensions of the words which are relatively important outliers.

On this basis, we try to find a suitable cut-off value, and thus obtain a suitable quantization step size. Here we need to take the importance of outliers into account, because some outliers are less important, though sharper, and some outliers, once clipped, cause a large precision change.

To this end, as shown in fig. 6, we design a coarse-to-fine model to find the quantization step (or clipping value) s that can minimize the quantization loss, which is described in detail as follows:

（5）

where L(s) represents the loss of quantization step size s, which is defined as the quantization network output

And the unquantized network f outputs the second-order norm of the difference.

In the coarse paradigm stage, outliers that cover a large but insignificant area need to be skipped quickly. In one embodiment of the invention, we use the maximum value embedded at token for each word as its representative of the outlier (the minimum value embedded at token for each word as representative of the negative outlier), so we get a new tensor.

（6）

Where T represents the total number of words, x ₁ Representing the first word, x ₂ Representing the second word and so on.

Representing the set of all word maxima.

In the next fine paradigm stage, we enumerate the rate of clipping and compute the value of the corresponding clipping to clip the original tensor.

（7）

Wherein the determination is based on an alpha quantile (quantile) function

The alpha percentile of (a) to obtain the upper limit of the clipping range. Similarly, according to the above steps, the minimum value of all the words can be taken, the lower limit of the clipping range is calculated according to the alpha percentile, the quantization step length s is calculated according to the upper limit and the lower limit of the clipping range, the corresponding loss function L(s) is calculated, and the clipping value/quantization step length with the minimum loss is finally selected.

In an embodiment of the present invention, for a full-precision pre-training language model, a pre-training language model with a better quantization degree can be obtained through the following steps:

1. and (3) carrying out scaling parameter transfer aiming at the LayerNorm structure in the pre-training language model to obtain an equivalent floating point pre-training language model.

2. And based on a small amount of data, determining a clipping range on the basis of the floating point pre-training language model obtained in the last step by using the word-based clipping step.

For example, assuming that the pre-trained language model is used to make positive or negative emotion decisions on an input sentence, the small amount of data mentioned above refers to 100 or 200 pieces of emotion classified data. It will be appreciated that when the pre-trained language model is used to perform other natural language processing tasks, the content and scope of the small amount of data needs to be adjusted accordingly. This is a conventional technique commonly known to those skilled in the art and will not be described herein.

3. And calculating a quantization step size according to the cutting range obtained in the last step to obtain a finally quantized pre-training language model.

In order to verify the actual effect of the neural network model quantification method provided by the invention, experiments are verified by combining an actual natural language processing task, and the method is specifically described as follows:

aiming at pre-training language models such as BERT, roBERTA, BART and the like, on a GLUE benchmark classification task, the neural network model quantification method respectively improves about 8 percent (BERT), 8 percent (RoBERTA) and 12 percent (BART) compared with the prior art under 6-bit setting. Specifically, for classification tasks, such as an emotion classification task (whether positive emotion or negative emotion), a judgment task of similarity of an input sentence (how similar two sentences are), and accurate output results can be achieved with high accuracy under low power consumption, for example, results of whether the expression emotion of the sentence is positive or negative and how positive the two sentences are input are can be quickly obtained. On the SQuAD reading understanding task, about 4%, 9%, 2% (three pre-training language models of SQuAD v 1) and 4%, 9%, 3% (three pre-training language models of SQuAD v 2) are respectively improved compared with the prior art under the 6-bit setting. Specifically, for the reading and understanding task, namely, the answer is extracted according to the text segment and the question, the neural network model quantification method can be used for effectively and quickly locating the answer to the question in the original text and outputting the answer. Aiming at a BART pre-training language model, on CNN DailyMail and XSum generation tasks by using the neural network model quantification method, compared with the prior art, the neural network model quantification method respectively improves the CNN DailyMail by about 3 percent and the XSum by 4 percent under 6-bit setting. Specifically, for a text generation task, namely, generating an output text according to an input text, the output text can be generated more quickly and accurately by using the neural network model quantization method.

On the basis of the neural network model quantization method facing natural language processing, the invention also provides a neural network model quantization device facing natural language processing. As shown in fig. 7, the neural network model quantifying means comprises one or more processors 71 and a memory 72. Wherein, the memory 72 is coupled to the processor 71 and is used for storing one or more programs, when the one or more programs are executed by the one or more processors 71, the one or more processors 71 implement the natural language processing oriented neural network model quantization method in the above embodiment.

The processor 71 is configured to control the overall operation of the apparatus for quantizing a neural network model for natural language processing, so as to complete all or part of the steps of the method for quantizing a neural network model for natural language processing. The processor 71 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing (DSP) chip, or the like. The memory 72 is used to store various types of data to support the operation of the natural language processing oriented neural network model quantization method, such data may include, for example, instructions for any application or method operating on the natural language processing oriented neural network model quantization device, as well as application-related data.

The memory 72 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, or the like.

In an exemplary embodiment, the apparatus for quantizing a neural network model for natural language processing may be specifically implemented by a computer chip or an entity, or implemented by a product with a certain function, and is configured to perform the method for quantizing a neural network model for natural language processing, and achieve a technical effect consistent with the method. One typical embodiment is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a vehicle human interaction device, a police checkpoint screening device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

In another exemplary embodiment, the present invention further provides a computer readable storage medium including program instructions, which when executed by a processor, implement the steps of the natural language processing oriented neural network model quantization method in any one of the above embodiments. For example, the computer readable storage medium may be a memory including program instructions executable by a processor of a natural language processing oriented neural network model quantization apparatus to perform the above natural language processing oriented neural network model quantization method, and achieve technical effects consistent with the above method.

1. by utilizing the transfer of the scaling parameters, the scaling parameters in the LayerNorm structure are transferred to a subsequent residual error connection module, and meanwhile, the pre-training language model with better quantization degree can be obtained under the condition of no extra calculation overhead by matching with a word-based cutting step;

2. the calculation overhead required by the pre-training language model subjected to quantization processing is remarkably reduced, and the method is particularly suitable for the requirement of edge equipment on low power consumption;

The method and the device for quantizing the neural network model facing the natural language processing provided by the invention are explained in detail above. It will be apparent to those skilled in the art that any obvious modifications thereof can be made without departing from the spirit of the invention, which infringes the patent right of the invention and bears the corresponding legal responsibility.

Claims

1. A neural network model quantification method for natural language processing is characterized by comprising the following steps:

(1) Extracting scaling parameters in a LayerNorm structure aiming at the LayerNorm structure in the full-precision pre-training language model and transferring the scaling parameters to the weight of a subsequent module to obtain an equivalent floating point pre-training language model; wherein, when the subsequent module is a residual connecting module, for a linear transformation branch in the residual connecting module, the transferred scaling parameters are absorbed by the following formula:

wherein, x represents the input vector,

representing said scaling parameters acting on the input vector,Wthe weight of the linear transformation branch is represented,

represents the Hadamard product of the matrix,nis a positive integer;

(2) Determining a clipping range on the basis of the floating point pre-training language model obtained in the step (1) by using a word-based clipping step based on a small amount of data; wherein the maximum value embedded at token of each word is used as a representative of the abnormal value, and the minimum value embedded at token of each word is used as a representative of the negative abnormal value; set of maximum values for all words

Enumerating a rate of clipping thereto and calculating a value corresponding to clipping; determining a set of all word maxima from an alpha quantile function

The upper limit of the cutting range is obtained; taking the minimum value of all words, and calculating the lower limit of the clipping range according to the alpha percentile;

(3) And calculating a quantization step length s through the upper limit and the lower limit of the cutting range, calculating a corresponding loss function L(s), and finally selecting the quantization step length with the minimum loss to obtain the quantized pre-training language model.

2. The neural network model quantization method of claim 1, wherein:

and when the subsequent module is a residual connecting module, directly multiplying the short-circuit branch in the residual connecting module by the scaling parameter.

3. A natural language processing-oriented neural network model quantization apparatus, characterized by comprising a processor and a memory, the processor reading a computer program in the memory for executing the neural network model quantization method of claim 1 or 2.