CN115238893B - Neural network model quantification method and device for natural language processing - Google Patents

Neural network model quantification method and device for natural language processing Download PDF

Info

Publication number
CN115238893B
CN115238893B CN202211162125.5A CN202211162125A CN115238893B CN 115238893 B CN115238893 B CN 115238893B CN 202211162125 A CN202211162125 A CN 202211162125A CN 115238893 B CN115238893 B CN 115238893B
Authority
CN
China
Prior art keywords
neural network
language model
quantization
network model
clipping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211162125.5A
Other languages
Chinese (zh)
Other versions
CN115238893A (en
Inventor
刘祥龙
魏秀颖
龚睿昊
李莹
吕金虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202211162125.5A priority Critical patent/CN115238893B/en
Publication of CN115238893A publication Critical patent/CN115238893A/en
Application granted granted Critical
Publication of CN115238893B publication Critical patent/CN115238893B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a neural network model quantification method and device for natural language processing. The method comprises the following steps: carrying out scaling parameter transfer aiming at a LayerNorm structure in the full-precision pre-training language model to obtain an equivalent floating point pre-training language model; determining a clipping range based on the floating point pre-training language model using a word-based clipping step based on a small amount of data; and calculating a quantization step size according to the cutting range to obtain a quantized pre-training language model. By using the method and the device, the pre-training language model with better quantization degree can be obtained under the condition of no extra calculation overhead, so that the required calculation overhead is obviously reduced, and the method and the device are particularly suitable for the requirement of edge equipment on low power consumption.

Description

Neural network model quantification method and device for natural language processing
Technical Field
The invention relates to a neural network model quantization method facing natural language processing, and also relates to a corresponding neural network model quantization device, belonging to the technical field of computational linguistics.
Background
The neural networks used in natural language processing are mainly classified into two types, one is a time sequence model represented by a recurrent neural network/long-short term memory model, and the other is a parallel computation model represented by a Transformer/BERT.
Compared with the recurrent neural network, the Transformer can capture historical information related to the current output by using a self-attention mechanism (self-attention) like the recurrent neural network, and can process the current input and all historical inputs near the current input in parallel like the feedforward neural network, so that the problem of low information processing speed of the recurrent neural network is solved. Furthermore, transformer is also the cornerstone of the pre-trained language models such as BERT, GPT, T5, etc. that are currently mainstream. However, the amount of parameters of these pre-trained language models is generally large, and some technical means of model quantification needs to be adopted to help them to be used on lightweight devices.
In the Chinese invention patent with the patent number ZL 202011470331.3, an automatic compression method and platform for a pre-training language model facing multiple tasks are disclosed. Designing a meta-network of a structure generator, constructing a knowledge distillation coding vector based on a Transformer layer sampling knowledge distillation method, and generating a distillation structure model corresponding to a currently input coding vector by using the structure generator; simultaneously, a Bernoulli distributed sampling method is provided to train a structure generator; in each iteration, migrating each encoder unit by using a Bernoulli distribution sampling mode to form a corresponding encoding vector; by changing the coding vector input into the structure generator and the training data of small batches, combining the training structure generator and the corresponding distillation structure, the structure generator capable of generating weights for different distillation structures can be learned; and meanwhile, on the basis of the trained meta-learning network, searching an optimal compression structure through an evolutionary algorithm, thereby obtaining an optimal general compression architecture of the pre-training language model irrelevant to the task.
In addition, in the Chinese invention application with the application number of 202210540113.5, a language model compression method based on uncertainty estimation knowledge distillation is disclosed. The method comprises the following steps: 1) Performing half-and-half compression on an original language model to obtain a compressed neural network; 2) Reasonably initializing parameters of the compressed neural network by using an original language model; 3) Adding a parameter distillation loss function of a feedforward network structure, and designing an uncertainty estimation loss function and a cross entropy loss function of a natural language processing task; 4) And training the compressed neural network model by using the designed loss function. The technical scheme reduces the calculated amount in the network compression training process, improves the network compression ratio, accelerates the network reasoning speed, can be widely applied to model deployment and model compression tasks, and provides a new model compression solution for application scenes with scarce hardware resources.
Disclosure of Invention
The invention aims to provide a neural network model quantification method oriented to natural language processing.
Another technical problem to be solved by the present invention is to provide a neural network model quantization apparatus for natural language processing.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to a first aspect of the embodiments of the present invention, there is provided a natural language processing-oriented neural network model quantization method, including the following steps:
(1) Carrying out scaling parameter transfer aiming at a LayerNorm structure in the full-precision pre-training language model to obtain an equivalent floating point pre-training language model;
(2) Based on a small amount of data, determining a clipping range on the basis of the floating point pre-training language model obtained in the step (1) by using a word-based clipping step;
(3) And (3) calculating a quantization step size according to the cutting range obtained in the step (2) to obtain a quantized pre-training language model.
Preferably, in the step (1), the scaling parameters in the LayerNorm structure are extracted and transferred to the weights of the subsequent modules.
Preferably, when the subsequent module is a residual connecting module, for a linear transformation branch in the residual connecting module, the transferred scaling parameters are absorbed by the following formula:
Figure 564221DEST_PATH_IMAGE001
wherein, x represents the input vector and x represents the input vector,
Figure 215782DEST_PATH_IMAGE002
representing said scaling parameters acting on the input vector,Wthe weight of the branch of the linear transformation is represented,
Figure 464361DEST_PATH_IMAGE003
represents the Hadamard product of the matrix,nis a positive integer.
Preferably, when the subsequent module is a residual connecting module, the scaling parameter is directly multiplied by a short-circuit branch in the residual connecting module.
Preferably, in the step (2), the maximum value embedded at token of each word is used as the representation of the abnormal value, and the minimum value embedded at token of each word is used as the representation of the negative abnormal value.
Preferably, in the step (2), the maximum value set is used for all words
Figure 593991DEST_PATH_IMAGE004
The rate of clipping is enumerated on it and the value of the corresponding clipping is calculated.
Preferably, the set of all word maxima is determined according to the alpha quantile function
Figure 877204DEST_PATH_IMAGE004
The upper limit of the cutting range is obtained;
taking the minimum value of all words, and calculating the lower limit of the cutting range according to the alpha percentile;
and calculating a quantization step length s through the upper limit and the lower limit of the cutting range, calculating a corresponding loss function L(s), and finally selecting the quantization step length with the minimum loss.
According to a second aspect of the embodiments of the present invention, there is provided a natural language processing-oriented neural network model quantization apparatus, including a processor and a memory, where the processor reads a computer program in the memory for executing the above neural network model quantization method.
Compared with the prior art, the neural network model quantification method and device for natural language processing provided by the invention have the following technical characteristics:
1. by utilizing the transfer of the scaling parameters, the scaling parameters in the LayerNorm structure are transferred to a subsequent residual error connection module, and meanwhile, a pre-training language model with better quantization degree can be obtained under the condition of no extra calculation overhead by matching with a word-based cutting step;
2. the calculation cost required by the pre-training language model after the quantization processing is obviously reduced, and the method is particularly suitable for the requirement of edge equipment on low power consumption;
3. the neural network model quantification method and device are easy to implement, can be applied to various pre-training language models based on a Transformer, such as BERT, roBERTA, BART and the like, and have a wide application range.
Drawings
Figure 1 is a diagram of the BERT pre-trained language model,
Figure 434088DEST_PATH_IMAGE005
a schematic diagram of the distribution of outliers of (a);
figure 2 is a diagram of the BERT pre-trained language model,
Figure 169962DEST_PATH_IMAGE006
the distribution of abnormal values of (a);
figure 3 is a diagram of the BERT pre-trained language model,
Figure 103283DEST_PATH_IMAGE007
a schematic diagram of the distribution of outliers of (a);
FIG. 4 is a flow diagram illustrating the quantization process of the pre-trained language model before scaling parameter transition;
FIG. 5 is a flow chart illustrating the quantization process of the pre-trained language model after scaling parameter transition;
FIG. 6 is a diagram illustrating fast convergence of quantization step sizes using a coarse-to-fine paradigm;
fig. 7 is a schematic diagram of a neural network model quantization apparatus according to an embodiment of the present invention.
Detailed Description
The technical contents of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Daniel Jurafsky and James H.Martin discuss in detail the basic structure of the Transformer and its working principle in the natural Language Processing task in his third edition to be published (see https:// web. Stanford. Edu/-. Jurafsky/slp3 /), and it is not described in detail here.
In further work, related researchers found that there was an inherent quantization bottleneck for such a Transformer-based pre-trained language model. For example, there are large outliers in these pre-trained language models. And these sharp outliers (e.g., close to 100) exhibit some structured features, such as often appearing on some particular dimension or word. This can even result in a 12% loss for 8-bit quantization. For this problem, there have been work to make some observations of outliers, which are often found in some specific dimensions and often in words such as [ SEP ]. However, there has been work without further observation of outliers, and rather this problem is circumvented with smaller quantization granularity. This increases computational complexity and is not necessarily suitable for actual landing.
For this reason, we studied the effect of outlier clipping in depth, and found that different outliers have different effects on the performance of the transform-based pre-trained language models such as BERT, roBERTa, BART, etc. at clipping. Where the more aggressive outliers provided by few words (e.g., [ SEP ] delimiters) can be cut sharply and safely with little impact on accuracy. Therefore, the neural network model quantization method provided by the embodiment of the invention firstly preliminarily detects the clipping range from the perspective of the word, and then optimizes the clipping range in a fine-grained manner, so that signals with smaller meanings are quickly skipped, and more attention is paid to important parts. This is explained in more detail below:
it has been mentioned above that for a pre-trained language model based on Transformer, the quantization or lower (4-bit) quantization perception training after the standard 6/8-bit training results in a severe degradation of accuracy. By studying the degradation of precision and quantization errors caused by each quantizer, we recognize that the LayerNorm (layer normalized) structure and the output of the GELU activation function are the most problematic tensors. The LayerNorm structure's scaling parameters cause the output distribution to have sharp outliers, which should be the cause of large quantization errors.
To this end, we first focused on exploring the potential causes of outliers. In the process of exploring the potential causes, the basic structure of the LayerNorm structure is firstly deeply analyzed because the output of the LayerNorm structure has an abnormal value, and the formula is as follows:
Figure 241004DEST_PATH_IMAGE008
(1)
wherein, an input matrix in a pre-training language model is marked as X, corresponding input vectors are marked as X, subscripts t and j respectively represent the t-th token and the j-th dimension, and corresponding output results are marked as X
Figure 968788DEST_PATH_IMAGE009
(ii) a Represents scalar multiplication;
Figure 191959DEST_PATH_IMAGE010
and
Figure 397813DEST_PATH_IMAGE011
respectively representing the variance of the mean of vectors x of LayerNorm input t token level;
Figure 390039DEST_PATH_IMAGE012
a scaling parameter representing the jth dimension,
Figure 288725DEST_PATH_IMAGE013
a translation factor representing the jth dimension, t and j both being positiveAn integer number.
With reference to fig. 1 to 3, we further analyzed the distribution of LayerNorm structure parameters in the BERT pre-training language model, and found that the parameters are scaled under the same dimension of abnormal values as the output
Figure 468034DEST_PATH_IMAGE012
The value of (b) is sharper than the other values. Furthermore, translation factor
Figure 209069DEST_PATH_IMAGE014
Is smaller, so we ignore it when determining the keypoint. From this, it can be deduced that the parameters are scaled
Figure 55803DEST_PATH_IMAGE002
It should be the key point that the output of LayerNorm structure has an abnormal value, and the scaling parameter can be extracted from the equation
Figure 594231DEST_PATH_IMAGE002
To eliminate its effect.
By plotting the distribution of the parameters of the LayerNorm structure and its output, we find that the scaling parameters and their output have outliers in the same dimension, by fitting the scaling parameters
Figure 791994DEST_PATH_IMAGE002
By extraction, we obtain Non-Scaling LayerNorm (Non-Scaling layer normalization) shown in equation (2):
Figure 808492DEST_PATH_IMAGE015
(2)
we find the resulting tensor (distribution of tensor)
Figure 509732DEST_PATH_IMAGE007
Has milder distribution, calculates the quantitative error evaluation index and displays the index
Figure 750220DEST_PATH_IMAGE007
The quantization error of (2) is smaller.
Based on the above knowledge, we will scale the parameters in the LayerNorm structure
Figure 904121DEST_PATH_IMAGE006
And extracting the language model and transferring the language model to a subsequent module such as a residual connecting module to obtain a pre-training language model more favorable for quantification.
Specifically, the LayerNorm structure in equation (1) is first reformed into the Non-Scaling LayerNorm structure in equation (2), and then the Scaling parameters are used
Figure 989889DEST_PATH_IMAGE006
The transfer is performed. The relationship between the two is shown in formula (3):
Figure 811214DEST_PATH_IMAGE016
(3)
since many of the pre-trained language models such as LayerNorm of BERT, roBERTA, BART, etc. are followed by residual concatenation modules, we consider scaling parameters
Figure 957025DEST_PATH_IMAGE006
Transfer to the two branches of the residual concatenation module.
Here, for the linear transformation branch, we have the following equation (4) to absorb the shifted scaling parameters with weights.
Figure 863801DEST_PATH_IMAGE017
(4)
The above equation (4) represents a linear variation, where x represents an input vector,
Figure 750330DEST_PATH_IMAGE006
representing scaling parameters that act on the input vector,Wthe weight of the linear transformation branch is represented,
Figure 691741DEST_PATH_IMAGE003
represents the Hadamard product of the matrix,nis a positive integer. Equation (4) shows that the scaling parameters that act on the input vector can be transferred into the weights of the subsequent modules.
For a short circuit (short) branch, in the embodiment of the present invention, a scaling parameter may be directly multiplied on the short circuit branch
Figure 274032DEST_PATH_IMAGE006
FIG. 4 is a flow chart illustrating the quantization process of the pre-trained language model before scaling parameter transition; FIG. 5 is a diagram illustrating a quantization process of a pre-trained language model after scaling parameter transition. As can be seen by comparing FIG. 4 with FIG. 5, the embodiment of the present invention is
Figure 668104DEST_PATH_IMAGE018
The "Quant" process is used, then the matrix multiplication with weight changed by quantization is used in the linear transformation branch, the scaling parameters are directly multiplied in the other short-circuited branch and the "dequantization" process is passed. In practice this means that the scaling parameters are delayed
Figure 361254DEST_PATH_IMAGE006
The correlation of (2). Thus, such a shift of the scaling parameters does not increase the computational overhead.
On the other hand, firstly, the activation value is cut on a full-precision pre-training language model such as BERT, roBERTA, BART and the like, and the specific influence of cutting is judged by observing the condition of reduced precision. We have found that clipping of different outliers has a large impact on accuracy. For example, some outliers, although sharper, can be clipped greatly without affecting the final accuracy, while some outliers have a large impact. We have also found that the range of outliers provided by different words varies widely, i.e. those with a wide coverage of long tails are less important and correspond to only a small number of words. Therefore, we first identify from the dimensions of the words which are relatively important outliers.
On this basis, we try to find a suitable cut-off value, and thus obtain a suitable quantization step size. Here we need to take the importance of outliers into account, because some outliers are less important, though sharper, and some outliers, once clipped, cause a large precision change.
To this end, as shown in fig. 6, we design a coarse-to-fine model to find the quantization step (or clipping value) s that can minimize the quantization loss, which is described in detail as follows:
Figure 891592DEST_PATH_IMAGE019
(5)
where L(s) represents the loss of quantization step size s, which is defined as the quantization network output
Figure 644785DEST_PATH_IMAGE020
And the unquantized network f outputs the second-order norm of the difference.
In the coarse paradigm stage, outliers that cover a large but insignificant area need to be skipped quickly. In one embodiment of the invention, we use the maximum value embedded at token for each word as its representative of the outlier (the minimum value embedded at token for each word as representative of the negative outlier), so we get a new tensor.
Figure 526153DEST_PATH_IMAGE021
(6)
Where T represents the total number of words, x 1 Representing the first word, x 2 Representing the second word and so on.
Figure 22993DEST_PATH_IMAGE022
Representing the set of all word maxima.
In the next fine paradigm stage, we enumerate the rate of clipping and compute the value of the corresponding clipping to clip the original tensor.
Figure 673418DEST_PATH_IMAGE023
(7)
Wherein the determination is based on an alpha quantile (quantile) function
Figure 331932DEST_PATH_IMAGE022
The alpha percentile of (a) to obtain the upper limit of the clipping range. Similarly, according to the above steps, the minimum value of all the words can be taken, the lower limit of the clipping range is calculated according to the alpha percentile, the quantization step length s is calculated according to the upper limit and the lower limit of the clipping range, the corresponding loss function L(s) is calculated, and the clipping value/quantization step length with the minimum loss is finally selected.
In an embodiment of the present invention, for a full-precision pre-training language model, a pre-training language model with a better quantization degree can be obtained through the following steps:
1. and (3) carrying out scaling parameter transfer aiming at the LayerNorm structure in the pre-training language model to obtain an equivalent floating point pre-training language model.
2. And based on a small amount of data, determining a clipping range on the basis of the floating point pre-training language model obtained in the last step by using the word-based clipping step.
For example, assuming that the pre-trained language model is used to make positive or negative emotion decisions on an input sentence, the small amount of data mentioned above refers to 100 or 200 pieces of emotion classified data. It will be appreciated that when the pre-trained language model is used to perform other natural language processing tasks, the content and scope of the small amount of data needs to be adjusted accordingly. This is a conventional technique commonly known to those skilled in the art and will not be described herein.
3. And calculating a quantization step size according to the cutting range obtained in the last step to obtain a finally quantized pre-training language model.
In order to verify the actual effect of the neural network model quantification method provided by the invention, experiments are verified by combining an actual natural language processing task, and the method is specifically described as follows:
aiming at pre-training language models such as BERT, roBERTA, BART and the like, on a GLUE benchmark classification task, the neural network model quantification method respectively improves about 8 percent (BERT), 8 percent (RoBERTA) and 12 percent (BART) compared with the prior art under 6-bit setting. Specifically, for classification tasks, such as an emotion classification task (whether positive emotion or negative emotion), a judgment task of similarity of an input sentence (how similar two sentences are), and accurate output results can be achieved with high accuracy under low power consumption, for example, results of whether the expression emotion of the sentence is positive or negative and how positive the two sentences are input are can be quickly obtained. On the SQuAD reading understanding task, about 4%, 9%, 2% (three pre-training language models of SQuAD v 1) and 4%, 9%, 3% (three pre-training language models of SQuAD v 2) are respectively improved compared with the prior art under the 6-bit setting. Specifically, for the reading and understanding task, namely, the answer is extracted according to the text segment and the question, the neural network model quantification method can be used for effectively and quickly locating the answer to the question in the original text and outputting the answer. Aiming at a BART pre-training language model, on CNN DailyMail and XSum generation tasks by using the neural network model quantification method, compared with the prior art, the neural network model quantification method respectively improves the CNN DailyMail by about 3 percent and the XSum by 4 percent under 6-bit setting. Specifically, for a text generation task, namely, generating an output text according to an input text, the output text can be generated more quickly and accurately by using the neural network model quantization method.
On the basis of the neural network model quantization method facing natural language processing, the invention also provides a neural network model quantization device facing natural language processing. As shown in fig. 7, the neural network model quantifying means comprises one or more processors 71 and a memory 72. Wherein, the memory 72 is coupled to the processor 71 and is used for storing one or more programs, when the one or more programs are executed by the one or more processors 71, the one or more processors 71 implement the natural language processing oriented neural network model quantization method in the above embodiment.
The processor 71 is configured to control the overall operation of the apparatus for quantizing a neural network model for natural language processing, so as to complete all or part of the steps of the method for quantizing a neural network model for natural language processing. The processor 71 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing (DSP) chip, or the like. The memory 72 is used to store various types of data to support the operation of the natural language processing oriented neural network model quantization method, such data may include, for example, instructions for any application or method operating on the natural language processing oriented neural network model quantization device, as well as application-related data.
The memory 72 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, or the like.
In an exemplary embodiment, the apparatus for quantizing a neural network model for natural language processing may be specifically implemented by a computer chip or an entity, or implemented by a product with a certain function, and is configured to perform the method for quantizing a neural network model for natural language processing, and achieve a technical effect consistent with the method. One typical embodiment is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a vehicle human interaction device, a police checkpoint screening device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
In another exemplary embodiment, the present invention further provides a computer readable storage medium including program instructions, which when executed by a processor, implement the steps of the natural language processing oriented neural network model quantization method in any one of the above embodiments. For example, the computer readable storage medium may be a memory including program instructions executable by a processor of a natural language processing oriented neural network model quantization apparatus to perform the above natural language processing oriented neural network model quantization method, and achieve technical effects consistent with the above method.
Compared with the prior art, the neural network model quantification method and device for natural language processing provided by the invention have the following technical characteristics:
1. by utilizing the transfer of the scaling parameters, the scaling parameters in the LayerNorm structure are transferred to a subsequent residual error connection module, and meanwhile, the pre-training language model with better quantization degree can be obtained under the condition of no extra calculation overhead by matching with a word-based cutting step;
2. the calculation overhead required by the pre-training language model subjected to quantization processing is remarkably reduced, and the method is particularly suitable for the requirement of edge equipment on low power consumption;
3. the neural network model quantification method and device are easy to implement, can be applied to various pre-training language models based on a Transformer, such as BERT, roBERTA, BART and the like, and have a wide application range.
The method and the device for quantizing the neural network model facing the natural language processing provided by the invention are explained in detail above. It will be apparent to those skilled in the art that any obvious modifications thereof can be made without departing from the spirit of the invention, which infringes the patent right of the invention and bears the corresponding legal responsibility.

Claims (3)

1. A neural network model quantification method for natural language processing is characterized by comprising the following steps:
(1) Extracting scaling parameters in a LayerNorm structure aiming at the LayerNorm structure in the full-precision pre-training language model and transferring the scaling parameters to the weight of a subsequent module to obtain an equivalent floating point pre-training language model; wherein, when the subsequent module is a residual connecting module, for a linear transformation branch in the residual connecting module, the transferred scaling parameters are absorbed by the following formula:
Figure DEST_PATH_IMAGE001
wherein, x represents the input vector,
Figure 373848DEST_PATH_IMAGE002
representing said scaling parameters acting on the input vector,Wthe weight of the linear transformation branch is represented,
Figure DEST_PATH_IMAGE003
represents the Hadamard product of the matrix,nis a positive integer;
(2) Determining a clipping range on the basis of the floating point pre-training language model obtained in the step (1) by using a word-based clipping step based on a small amount of data; wherein the maximum value embedded at token of each word is used as a representative of the abnormal value, and the minimum value embedded at token of each word is used as a representative of the negative abnormal value; set of maximum values for all words
Figure 966634DEST_PATH_IMAGE004
Enumerating a rate of clipping thereto and calculating a value corresponding to clipping; determining a set of all word maxima from an alpha quantile function
Figure 158581DEST_PATH_IMAGE004
The upper limit of the cutting range is obtained; taking the minimum value of all words, and calculating the lower limit of the clipping range according to the alpha percentile;
(3) And calculating a quantization step length s through the upper limit and the lower limit of the cutting range, calculating a corresponding loss function L(s), and finally selecting the quantization step length with the minimum loss to obtain the quantized pre-training language model.
2. The neural network model quantization method of claim 1, wherein:
and when the subsequent module is a residual connecting module, directly multiplying the short-circuit branch in the residual connecting module by the scaling parameter.
3. A natural language processing-oriented neural network model quantization apparatus, characterized by comprising a processor and a memory, the processor reading a computer program in the memory for executing the neural network model quantization method of claim 1 or 2.
CN202211162125.5A 2022-09-23 2022-09-23 Neural network model quantification method and device for natural language processing Active CN115238893B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211162125.5A CN115238893B (en) 2022-09-23 2022-09-23 Neural network model quantification method and device for natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211162125.5A CN115238893B (en) 2022-09-23 2022-09-23 Neural network model quantification method and device for natural language processing

Publications (2)

Publication Number Publication Date
CN115238893A CN115238893A (en) 2022-10-25
CN115238893B true CN115238893B (en) 2023-01-17

Family

ID=83667365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211162125.5A Active CN115238893B (en) 2022-09-23 2022-09-23 Neural network model quantification method and device for natural language processing

Country Status (1)

Country Link
CN (1) CN115238893B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306796B (en) * 2023-05-17 2023-09-15 北京智源人工智能研究院 Model self-growth training acceleration method and device, electronic equipment and storage medium
CN116451770B (en) * 2023-05-19 2024-03-01 北京百度网讯科技有限公司 Compression method, training method, processing method and device of neural network model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257751A (en) * 2020-09-23 2021-01-22 华为技术有限公司 Neural network pruning method
CN114444679A (en) * 2020-11-06 2022-05-06 山东产研鲲云人工智能研究院有限公司 Method and system for quantizing binarization input model and computer readable storage medium
CN114580281A (en) * 2022-03-04 2022-06-03 北京市商汤科技开发有限公司 Model quantization method, apparatus, device, storage medium, and program product

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657780A (en) * 2018-06-15 2019-04-19 清华大学 A kind of model compression method based on beta pruning sequence Active Learning
CN113673260A (en) * 2020-05-15 2021-11-19 阿里巴巴集团控股有限公司 Model processing method, device, storage medium and processor
US20220292360A1 (en) * 2021-03-15 2022-09-15 Nvidia Corporation Pruning neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257751A (en) * 2020-09-23 2021-01-22 华为技术有限公司 Neural network pruning method
CN114444679A (en) * 2020-11-06 2022-05-06 山东产研鲲云人工智能研究院有限公司 Method and system for quantizing binarization input model and computer readable storage medium
CN114580281A (en) * 2022-03-04 2022-06-03 北京市商汤科技开发有限公司 Model quantization method, apparatus, device, storage medium, and program product

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CenteredWeight Normalization;Lei Huang等;《2017 IEEE international conference on computer vision(ICCV)》;20171225;第2822-2830页 *
自然语言处理中的神经网络模型;冯志伟等;《当代外语研究》;20220831(第4期);第98-154页 *

Also Published As

Publication number Publication date
CN115238893A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
Dauphin et al. Language modeling with gated convolutional networks
CN115238893B (en) Neural network model quantification method and device for natural language processing
CN106502985B (en) neural network modeling method and device for generating titles
Li et al. Towards binary-valued gates for robust lstm training
KR20210029785A (en) Neural network acceleration and embedding compression system and method including activation sparse
US20200159832A1 (en) Device and text representation method applied to sentence embedding
CN110782008B (en) Training method, prediction method and device of deep learning model
CN108460028B (en) Domain adaptation method for integrating sentence weight into neural machine translation
CN112257858A (en) Model compression method and device
CN110457718B (en) Text generation method and device, computer equipment and storage medium
CN107292382A (en) A kind of neutral net acoustic model activation primitive pinpoints quantization method
CN111368037A (en) Text similarity calculation method and device based on Bert model
CN110781686B (en) Statement similarity calculation method and device and computer equipment
CN112818110B (en) Text filtering method, equipment and computer storage medium
CN115221846A (en) Data processing method and related equipment
CN116308754B (en) Bank credit risk early warning system and method thereof
CA3232610A1 (en) Convolution attention network for multi-label clinical document classification
CN113505193A (en) Data processing method and related equipment
CN114064852A (en) Method and device for extracting relation of natural language, electronic equipment and storage medium
Wei et al. EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting
CN116304748A (en) Text similarity calculation method, system, equipment and medium
EP3362951B1 (en) Neural random access machine
CN111259147A (en) Sentence-level emotion prediction method and system based on adaptive attention mechanism
CN114861907A (en) Data calculation method, device, storage medium and equipment
CN116450813B (en) Text key information extraction method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant