CN114418088A

CN114418088A - Model training method

Info

Publication number: CN114418088A
Application number: CN202111628710.5A
Authority: CN
Inventors: 王中风; 邵海阔; 鲁金铭; 魏敬和
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-29

Abstract

The application provides a model training method, which comprises defining a new quantitative linear layer; all elements in the multi-dimensional input tensor of the quantized linear layer are quantized into a PINT format, all elements in the tensor to be calculated of the quantized linear layer are quantized into a PINT data format, and matrix multiplication calculation is carried out on the quantized multi-dimensional input tensor and the tensor to be calculated to obtain a fixed point result; inversely quantizing the fixed point result into a floating point number and transmitting the floating point number to a subsequent network layer; replacing an original linear layer in the model with a quantized linear layer, and training the model based on floating point number and PINT data formats. The method has the advantages that the quantization linear layer based on the PINT data format is developed, the PINT data format with low bit and high representation capability is applied to model training, the linear layer used in the model is replaced by the quantization linear layer, and requirements on data calculation, storage and the like are effectively reduced under the condition that the accuracy of the trained model changes slightly.

Description

Model training method

Technical Field

The application relates to the technical field of natural language processing, in particular to a model training method.

Background

In recent years, models such as BERT based on a transform network have been excellent in fields such as natural language processing. The Transformer is a classic model applied to NLP (Natural Language Processing) proposed by Google team in 2017, and uses Self-Attention mechanism, and does not adopt a sequential structure of RNN (Recurrent Neural Network), so that the model can be trained in parallel, and can have global information of the sample. The model that is popular today, such as bert (bidirectional Encoder expressions from transforms), is also based on transform implementation.

Taking the BERT model in the model as an example, BERT is called simply Encoder responses from transforms, and is a pre-trained language characterization model. The BERT model uses the encoder in the transform as the main structure, and emphasizes that the traditional one-way language model or the shallow splicing method of two one-way language models is not adopted for Pre-training as before, but a model training scheme of Pre-training (Pre-training) and Fine-Tuning (Fine-Tuning) is adopted, so that deep two-way language representation can be generated.

There is a need to train such language models on resource-limited edge computing platforms for online learning and data privacy concerns. However, such models usually have huge network structures and a large number of parameters, and therefore, the training process requires huge computing and storage resources.

Disclosure of Invention

The application provides a model training method, which aims to solve the problem that huge calculation and storage resources are needed in the model training process.

A model training method, comprising:

defining a new quantized linear layer;

all elements in the multi-dimensional input tensor of the quantized linear layer are quantized into a PINT data format by using a preset quantization function, wherein the multi-dimensional input tensor refers to a multi-dimensional eigenvalue tensor in a forward propagation stage; the multi-dimensional eigenvalue tensor is propagated forward from the adjacent network layers; the multidimensional input tensor is a multidimensional error tensor in a back propagation stage and a weight gradient calculation stage, and the multidimensional error tensor is propagated from the adjacent network layers in a back direction; the PINT data format is a data format of a segmented integer;

quantizing all elements in the tensor to be calculated of the quantized linear layer into a PINT data format by using the preset quantization function, wherein the tensor to be calculated refers to a weight matrix of the quantized linear layer in the forward propagation stage and the backward propagation stage; the tensor to be calculated refers to the eigenvalue tensor calculated in the forward propagation stage in the weight gradient calculation stage;

performing matrix multiplication calculation on the quantized multidimensional input tensor and the tensor to be calculated to obtain a fixed point result;

dequantizing the fixed point result to a floating point number, and propagating the floating point number to a subsequent network layer;

replacing an original linear layer in the model with the quantized linear layer, and training the model based on the floating point number and the PINT data format.

Further, the preset quantization function is quantization, and the preset quantization function is defined in Python programming language.

Further, the function of the quantized linear layer is the same as the linear layer function originally in the model.

Further, the quantized linear layer is represented by the PINT data format, and the multidimensional error tensor, the weight matrix and the floating point number are all represented by 32-bit floating point numbers, so that the model forms a mixed precision training method.

Further, the PINT data format includes two parameters of a data bit width and a partition point, and the PINT value is divided into three parts of coding spaces by combining the data bit width and the partition point, where each coding space corresponds to a scaling factor.

Further, the preset quantization function implements a quantization process by:

setting an overall scaling factor;

calculating each scaling factor corresponding to the three parts of coding space in the quantization process;

and determining the coding space to which the quantized numerical value belongs according to the whole scaling factor and each scaling factor, and obtaining a PINT data format corresponding to the numerical value.

Further, the overall scaling factor and each scaling factor are calculated by a preset formula.

Further, the computation of the network layer includes the forward propagation stage, the backward propagation stage, and the weight gradient computation stage.

Further, the quantized linear layer is defined in a Pytorch deep learning framework.

Further, the network layer includes a linear layer, an embedding layer, an attention mechanism, a residual concatenation, an activation function, and a normalization.

According to the technical scheme, the application provides a model training method, which comprises the following steps: defining a new quantized linear layer; all elements in the multi-dimensional input tensor of the quantized linear layer are quantized into a PINT format by using a preset quantization function, all elements in the tensor to be calculated of the quantized linear layer are quantized into a PINT data format by using the preset quantization function, and matrix multiplication calculation is carried out on the quantized multi-dimensional input tensor and the tensor to be calculated to obtain a fixed point result; dequantizing the fixed point result to a floating point number and propagating the floating point number to a subsequent network layer; replacing an original linear layer in the model with a quantized linear layer, and training the model based on floating point number and PINT data formats. The method has the advantages that the quantization linear layer based on the PINT data format is developed, the PINT data format with low bit and high representation capability is applied to model training, the linear layer used in the model is replaced by the quantization linear layer, and requirements on aspects such as data calculation and storage are effectively reduced under the condition that the accuracy of the trained model changes slightly.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a Transformer model shown in an embodiment of the present application;

FIG. 2 is a schematic diagram of word embedding and position coding according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a self-attention computing process shown in an embodiment of the present application;

FIG. 4 is a schematic view of a multi-head attention calculation flow shown in the embodiment of the present application;

FIG. 5 is a schematic diagram of a feedforward neural network shown in an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a Mask operation process according to an embodiment of the present application;

fig. 7 is a schematic diagram of a PINT (8,3) data format shown in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a BERT model shown in an embodiment of the present application;

FIG. 9 is a diagram illustrating a partially quantized BERT model according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a forward propagation process of a linear layer according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating a data flow of a linear layer in a training process according to an embodiment of the present application;

FIG. 12 is a schematic diagram illustrating the backward propagation of a linear layer according to an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating weight gradient calculation of a linear layer according to an embodiment of the present application;

fig. 14 is a schematic diagram of a quantization and calculation process of the QLinear layer in a back propagation stage according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

In the field of natural language processing, the model-based training method is provided based on an application scenario that a model usually has a huge network structure and a large number of parameters, and therefore huge calculation and storage resources are needed in a training process. For the convenience of further understanding of the present application, the BERT model is taken as an example for illustration, but it should be noted that the present application is not limited to be applied to the BERT model, and similar models using linear layers can all use the method provided by the present application.

In some embodiments, changing the data format during training can significantly impact the computational and storage requirements. Taking the BERT model as an example, the method is based on a special low-bit data format, quantifies partial calculation of the training process of the BERT model, can effectively reduce the requirements on data calculation, storage and the like under the condition that the accuracy of the trained model changes a little, and provides effective support for the design of a hardware accelerator trained by the model.

In order to facilitate understanding of technical solutions of the embodiments of the present application, before describing specific embodiments of the present application, some technical terms in the technical fields described in the embodiments of the present application are first briefly explained.

Taking a transform model as an example, fig. 1 is a schematic structural diagram of the transform model shown in the embodiment of the present application, and as shown in fig. 1, the entire model mainly includes a coder group, a decoder group, an embedded layer, a position code, a Softmax classifier, and the like. The encoder group consists of N encoders with the same structure, the decoder group consists of N decoders with the same structure, and N is 6 in the example model of the Transformer. The method mainly comprises the structural components of multi-head attention, a feedforward neural network, residual error connection, normalization and the like in the encoder and the decoder. The Softmax classifier uses a Softmax function, i.e., a normalized exponential function, that "compresses" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1, to classify the output of the decoder, which is more than in the multiple classification problem.

The working principle of the Transformer model is explained by taking text translation (translating a Chinese sentence into an English sentence) as an example. The Chinese sentence to be translated firstly passes through an input embedding layer and a position coding layer, and generates a corresponding input tensor (tensor is a multi-dimensional array which is a common concept in the deep learning technology) according to the dictionary and the word position information. In the encoding phase, the tensor is fed into the encoder bank to generate an encoding tensor. Inside each encoder, the tensor needs to sequentially pass through structures such as multi-head attention, residual connection and normalization, a feedforward neural network, residual connection and normalization and the like, and finally the encoding tensor with the same dimensionality as the tensor of the input sentence is obtained through calculation. The encoded tensor is transmitted to the decoder group to participate in the calculation of the decoding stage, and the calculation process of the tensor inside each structure is described in turn below.

1) Multi-Head Attention (Multi-Head Attention, MHA)

Acquisition of encoder input:as shown in FIG. 2, a sentence to be translated (i.e., I has a cat) passes through the embedding layer and the position coding layer to obtain an input tensor, d_modelOne of the hyper-parameters of the model is the length, i.e. the embedding dimension, of the input tensor. The input matrix X is generated by splicing the input tensors, and the matrix size is s multiplied by d_model(where s is the sentence length).

The self-attention mechanism is as follows: as shown in FIG. 3, the matrix X input to the encoder is multiplied by three weight matrices and then added to the respective bias matrices (B)_Q，B_K，B_V) The three Weight matrices are Query Weight matrices (W, respectively)_Q) Key Weight Matrix (W)_K) Value Weight Matrix (W, Value Weight Matrix)_V) And obtaining three result matrixes which are respectively an inquiry matrix Q, a key matrix K and a value matrix V. The weight matrix has a size d_model×d_k(d_kBeing one of the hyper-parameters of the model, in the Transformer example model, d_k64), the size of the resulting matrix is s × d_k. Matrix K obtained by transposing Q and K^TMultiplying, obtaining matrix S through division and Softmax operation, multiplying S with value matrix V to obtain output matrix Z of self-attention structure, wherein the size of Z is sxd_k。

The multi-head Attention is composed of h Self-Attention mechanisms with the same structure (h is one of the hyper-parameters of the model, and h is 8 in the transform example model), as shown in FIG. 4, an input matrix X is sent to each Self-Attention module, and an output result matrix Z is output₁，Z₂，…，Z_hSpliced to form a matrix Z with a size of sx (d)_kH), i.e. s × d_model. Z passes through a linear layer (nn. linear) and the weight matrix W_HMultiplying and then adding an offset B_HTo obtain the output matrix H (with size of s × d) of MHA_model)。

2) Residual concatenation and normalization

The input of the multi-head attention is X, the output is H, and the result Y is obtained by sequentially carrying out addition and normalization operations on the H. The addition operation refers to X + H, which is a residual error connection, and is usually used to solve the problem of multi-layer network training, and the network can pay attention to the current difference part, which is often used in networks such as ResNet. The Normalization used in the Transformer model is Layer Normalization (LN), which is commonly used for RNN structures. The LN comprehensively considers the input of all dimensions of a layer, calculates the average input value and the input variance of the layer, and then converts the input of each dimension by using the same normalization operation, thereby achieving the effects of relieving gradient disappearance/explosion, accelerating training and regularizing.

3) Feedforward neural network

And inputting X once through a multi-head attention layer, performing residual connection and normalization calculation to obtain an output Y, and enabling Y to enter a feedforward neural network. A feedforward neural network consists of two fully connected layers, i.e., linear layers (e.g., nn. linear), where the activation function of the first layer is the ReLU function and the second layer does not use the activation function. The ReLU refers to a Rectified Linear Unit, i.e., a Linear rectification function, which is a commonly used activation function in a neural network, and has a function expression of f (x) max (0, x). The output of the feedforward neural network is connected and normalized through residual errors to obtain an output coding information matrix O (with the size of s multiplied by d) of the current coder_model) As shown in FIG. 5, wherein d_ffRepresenting one dimension of a linear layer weight matrix of size d_model×d_ff。

4) Encoder and decoder

As shown in fig. 1, the decoder structure of the Transformer is similar to the encoder, but there are some differences:

a) two multi-headed attention layers are included.

b) The first multi-headed attention layer employs a Mask operation (the first multi-headed attention in FIG. 1 is with a Mask).

c) The input K, V matrix of the second multi-headed attention layer is computed using the output encoded information matrix O of the encoder, and Q uses the output of the first multi-headed attention layer.

The first multi-head attention layer of the decoder adopts Mask operation, because the decoder translates sequentially in the translation process, i.e. the i +1 word can be translated after the i word is translated. The ith translation can be prevented by the Maske operationThe information after i +1 words is known when the word is present. The principle is shown in FIG. 6, where the Mask matrix M is s × s, white represents 0, gray represents 1, and M and QK are calculated before Softmax in FIG. 3^TA bit-wise multiplication calculation is performed.

The above is an introduction of the Transformer model, the present application takes BERT as an example, and BERT is now described as follows. BERT is pre-trained on a corpus containing 33 billion words, which includes two tasks. The first task is to randomly remove 15% of the words in the sentence and replace them with a mask, allowing the model to predict the words; the second task is that each training sample is a top and bottom sentence, 50% of the samples are related to the top sentence, and the other 50% of the samples are unrelated to the top sentence, so that the model needs to judge the relationship between the two sentences. There is a loss (loss) for each of these two tasks, and the two losses are added together as a total loss to optimize the model.

After the pre-training is completed, the model needs to be finely tuned for specific tasks (such as single-sentence classification tasks, multi-sentence classification tasks, question and answer tasks, entity naming identification and the like). For different types of tasks, BERT uses different output network layers at the network end, which in the embodiments of the present application may include a linear layer, an embedding layer, an attention mechanism, residual concatenation, an activation function, and normalization. In a task-specific dataset, all parameters of the BERT model and the output network layer are trained together until the model converges.

In specific implementation, a model training method can be implemented by the following steps: defining a new quantized linear layer; all elements in a multidimensional input tensor of a quantized linear layer are quantized into a PINT (Piecewise Integer) format by using a preset quantization function (wherein the multidimensional input tensor refers to a multidimensional eigenvalue tensor in a forward propagation stage, the multidimensional eigenvalue tensor is propagated forward from an adjacent network layer, the multidimensional input tensor refers to a multidimensional error tensor in a backward propagation stage and a weight gradient calculation stage, the multidimensional error tensor is propagated backward from the adjacent network layer, the PINT data format is a data format of a segmented integer, all the elements in a tensor to be calculated of the quantized linear layer are quantized into a PINT data format by using the preset quantization function (the tensor to be calculated refers to a weight matrix for quantizing the linear layer in the forward propagation stage and the backward propagation stage, the tensor to be calculated refers to an eigenvalue tensor obtained in the forward propagation stage in the weight gradient calculation stage), performing matrix multiplication calculation on the quantized multidimensional input tensor and the tensor to be calculated to obtain a fixed point result; dequantizing the fixed point result to a floating point number, and propagating the floating point number to a subsequent network layer (in the embodiment of the present application, the backward propagation refers to only the calculation order of the backward propagation stage); finally, the original linear layer in the model is replaced by a quantized linear layer, and the model is trained based on floating point number and PINT data formats.

In an actual scenario, training of the model includes calculation of a linear layer, and calculation of a network layer in a neural network training process may be generally divided into three stages, a Forward Propagation stage (FP), a Backward Propagation stage (BP), and a Weight Gradient calculation stage (WG).

In some embodiments, Integer arithmetic (Integer arithmetric) is widely used in the deployment of neural network algorithms in order to reduce computational and data storage requirements. In a conventional data quantization scheme, given a floating point value set x and an integer bit width k before quantization, the process of data quantization can be expressed as:

where s represents a scaling factor (scaling factor) for normalizing integer values, the k-bit integer and the scaling factor s together constituting a fixed pointNumber (fixed-point number). The Round function represents a rounding operation, and the Clamp function limits the value to [ -2 ]^k-1，2^k-1]Within the range.

Unlike conventional Integer representation schemes, embodiments of the present application apply a segmented Integer (PINT) data format to the training (tuning) of the BERT model. The PINT data format comprises two parameters of data bit width and a separation point, the PINT numerical value is divided into three encoding spaces by combining the data bit width and the separation point, and each encoding space corresponds to a scaling factor. Specifically, PINT is defined by two parameter data bit widths k and a separation point d, which illustrates how the k-bit value is divided into two non-overlapping parts, High Bits (HB) and Low Bits (LB). In addition, PINT data represented by a (k, d) includes a flag bit (flag) and a signed integer of k-1 bits. The flag and the separation point work together to divide the numerical space of PINT into three parts, each part having a respective scaling factor, as shown in the following equation. Taking PINT data in (8,3) format as an example, the data format is shown in fig. 7.

Wherein, the specific value Xp of the representation of the PINT data can be obtained by the following formula:

in the above equation, Xp means a specific value indicated by decoding of the 8-bit PINT data of the embodiment of the present application, indicating an or operation, wherein an (HB) ═ 1 indicates that the High Bit (HB) is all 0's or all 1's, in which case only the low bit data is active. As can be seen from the formula, the numerical representation range of the PINT with the bit width k is [ -2 [ ]^2(k-2)，2^2(k-2)-1]This range is the same as the common Integer representation capability with 2k-3 bits wide.

In one implementation, the predetermined quantization function may be quantization. The preset quantization function may implement the quantization process by: setting an overall scaling factor; calculating each scaling factor corresponding to the three parts of coding space in the quantization process; and determining the coding space to which the quantized numerical value belongs according to the integral scaling factor and each scaling factor, and obtaining a PINT data format corresponding to the numerical value. The overall scaling factor and each scaling factor are calculated by a preset formula.

For example, the specific process may be: when the PINT data format is applied to training of the BERT model, the floating-point number needs to be converted into PINT, and this conversion process may also be referred to as quantization, and the specific process of quantization is shown in equations (1) to (8).

Assume that the set of values to be quantized is x ═ x₁，x₂，…，x_nN is a positive integer, and the quantized PINT data format is defined as (k, d), in order to make the calculation more efficient, the quantization process can be divided into the following steps:

1) firstly, setting an integral scaling factor to be a power of 2 according to the absolute maximum value of an x value set, as shown in formula (1);

2) then, calculating each scaling factor corresponding to the coding space of the three parts of the PINT in the current quantization process, as shown in formulas (2) to (6);

3) then, according to the value range of the current element x, determining which part of the coding space of PINT the value q after x quantization belongs to, and obtaining the k-bit PINT representation corresponding to q. As shown in equation (8), where Srout represents a random rounding function, the operands may be rounded randomly, Clamp represents an interval limiting function, and the value is limited to [ -2 ]^k-2，2^k-2-1]Within the interval, x is the set of values to be quantized, and m is the overall scaling factor.

m＝「log₂(max|x|)] (1)

r₁＝2^m (2)

s₁＝r₁÷2^k-2 (3)

r₂＝r₁÷2^k-2-d (4)

s₂＝r₂÷2^k-2 (5)

r₃＝r₂÷2^k-2 (6)

s₃＝r₃÷2^d (7)

In the pytorech deep learning framework, the present application defines a specialized quantization function named quantization to implement the calculations involved in the above equations. The quantized linear layer is also defined in the Pytorch deep learning framework. In a specific implementation, the process of quantizing the function can be implemented by the following program codes (the following codes are only schematic illustrations for the understanding of the present application):

the embodiment of the application shows a BERT model training method based on a PINT data format, and the PINT data format expands a numerical value representation range under the condition of using less bit width. The shorter data bit width has certain advantages in the aspects of calculation, storage and the like, and the calculation power and the storage requirements can be reduced. The PINT data format is applied to training (fine tuning) of a BERT model, a part of network layers are replaced by quantized network layers represented based on PINT data (the part refers to that linear layers are only a part of structures in the BERT model, in the model replacement process, calculation of the linear layers is quantized, calculation of other layers is still completed by using original floating point numbers), and the accuracy of the trained model on an NLP data set is kept at the same level as that of a model trained based on 32-bit floating point numbers (FP 32).

Specifically, in the embodiment of the present application, a Linear Layer (Linear Layer) used in the BERT model is replaced with a Quantized Linear Layer (QLinear) that implements calculation based on the PINT data format, and training of the BERT-base model is completed on the gluce data set. GLUE, known collectively as General Language Understanding Evaluation, is a multitasking natural Language Understanding benchmark and analysis dataset. As shown in fig. 8, the structure of the BERT model uses Linear layers at various places. In the post-quantization training method, the linear layers are each replaced with a quantized linear layer, highlighted in grey in fig. 9. It should be noted that the function of the quantized linear layer is the same as the function of the original linear layer in the model.

The linear layer is a class in a Pytorch deep learning framework and is generally used for setting a fully-connected layer in a neural network, and in various models developed based on a transform structure, the linear layer is used in multiple places to change the dimensionality of a matrix. The definition form can be as follows:

class torch.nn.Linear(in_features,out_features,bias＝True)

the parameters mainly comprise:

in _ features: the number of input features, i.e. the lowest dimension size of the input tensor, e.g. the input tensor size is (len, in _ features), where len represents the input sentence length;

out _ features: the number of output features, i.e. the lowest dimension size of the output tensor, e.g. (len, out _ features) of the output tensor size;

a bias: the matrix calculates whether to add an offset.

The main role of the Linear layer is to convert the matrix X with size (len, in _ features) into tensor Y with size (len, in _ features), and the process can be expressed by formula as tensor Y

Y＝X×W^T+b；

Where W is a weight matrix with a size of (out _ features, in _ features), b represents the bias, and is a tensor with a length of out _ features, and the calculation of the linear layer thereof can be represented by fig. 10.

As mentioned above, the computation of a certain network layer in the neural network training process is divided into three stages, namely a forward propagation stage, a backward propagation stage and a weight gradient computation stage. Taking a linear layer in the BERT model as an example, see fig. 11, the former network structure is residual concatenation and normalization, and the latter network structure is attention mechanism. The calculation of the linear layer at various stages in the training process is shown in FIG. 11. In the FP stage, the linear layer is calculated as shown in FIG. 10.

In the BP phase, the loss (loss) calculated by the neural network terminal loss function propagates back in the network in the form of error (error), when the calculation is performed to the linear layer, the input of the back propagation phase is the gradient dY of the matrix Y transferred from the subsequent network layer relative to the loss, the linear layer needs to calculate the gradient dX of X relative to the loss, and this process is shown in fig. 12 and can be expressed by the following formula:

dX＝dY×W；

similarly, in the WG stage, the embodiment of the present application needs to calculate the weight matrix with respect to the lost gradient dY using the gradient dY and the input X of the FP stage to update the weights of the network model, and this process is shown in fig. 13 and can be expressed as:

dW＝dY^T×X；

in order to further understand the present solution, the following description will be made with reference to more specific examples. For example, taking the BP phase as an example, the quantization and calculation process of data in the QLinear layer is roughly divided into the following steps, which can be represented by the flowchart of fig. 14.

Step 1, quantizing all elements of an error tensor dY represented by a 32-bit floating point number into a PINT data format by using a quantization function, wherein the error tensor dY is reversely propagated by a following network structure such as an attention mechanism;

step 2, quantizing all elements of the weight matrix W of the current linear layer represented by 32-bit floating point number into a PINT data format by using a quantization function;

step 3, completing matrix multiplication calculation by using the quantized dY and W to obtain a result dXq represented by 32-bit fixed point numbers;

step 4, inverse quantize the result dXq to a 32-bit floating point number dX, and back propagate to previous network structures such as residual join and normalize modules for subsequent computations. The conversion of fixed point numbers to floating point numbers is a matter of convention, using conventional methods.

The calculation process of the FP and WG phases is similar to that of the BP phase, and the example is not repeated here.

From the above solutions, in the Pytorch deep learning framework, the present application defines a quantization linear layer module QLinear, which uses a quantization function quantization, quantizes data required for calculation from 32-bit floating point number to PINT data format before training calculation of each stage, and then performs corresponding calculation. For example, the main Python code of the quantization linear layer QLinear may be (the following code is only illustrative):

in the embodiment of the application, a linear layer in a BERT model is replaced by a quantized linear layer, and the specific implementation mode is to define a new network layer QLinear in a pitorch deep learning framework, and the function of the new network layer QLinear is the same as that of an nn linear layer used in an original model. When building a BERT model, the QLinear module is used for replacing an original linear layer such as an nn linear module in the embodiment of the application, so that the calculation of the linear layer is realized in a PINT data format, other network structures of the model such as an embedding layer, an attention mechanism, residual error connection, normalization and the like are kept unchanged, and the calculation is still performed by using 32-bit floating point numbers, so that the training method with mixed precision is obtained.

Specifically, the input of the linear layer is a multidimensional tensor transmitted by a previous adjacent network layer (except for the linear layer, such as an attention mechanism, an activation function and the like) in the BERT model, and the data format of the multidimensional tensor is a 32-bit floating point number. And after the input tensor data completes the calculation of each training stage in the linear layer, outputting the corresponding result tensor to the next adjacent network layer, wherein the data format of the result tensor is 32-bit floating point number. In the original linear layer, the input tensor data is calculated in a 32-bit floating point number format. In the newly defined quantization linear layer QLinear, the embodiment of the present application converts the input tensor data represented by 32-bit floating point number into PINT data format first, so that the computation of matrix multiplication in three stages of forward propagation, backward propagation and weight gradient computation in the model training process is completed in the PINT data format.

That is, when building a model, the quantized linear layer is represented by PINT data format, and the multidimensional error tensor, the weight matrix and the floating point number are all represented by 32-bit floating point number, so that the model is a training method with mixed precision. It should be noted that the object of the mixed precision is the entire model, and is not limited to the data of the quantized linear layer.

The embodiment of the application is oriented to training of a Transformer/BERT model, and provides a partial quantitative training method of mixing precision, which can be applied to training (fine tuning) of the BERT model, based on a low-bit efficient data format. Specifically, a quantization neural network linear layer (hereinafter referred to as quantization linear layer QLinear) based on a PINT data format is developed, and the quantization linear layer is realized by computing a matrix multiplication of the linear layer at each stage in a model training process by using a quantized PINT data format, and can be used for replacing linear layers which are used in a large number in a BERT model. The PINT data format with low bit and high representation capability is applied to the training (fine tuning) process of the BERT model, the requirements on calculation and storage are reduced, the training of the BERT model is completed by using the PINT data format and floating point numbers together, and the precision of the model obtained by training is kept at the same level as that of the model trained by using the full floating point number method. Therefore, under the condition that the accuracy of the trained model is changed slightly, the requirements on data calculation, storage and the like can be effectively reduced, and the problem that huge calculation and storage resources are needed in the model training process is solved.

In order to evaluate the feasibility and model accuracy of the quantitative training method in the embodiment of the present application, a BERT-base model is trained (fine-tuned) using a glute dataset in the embodiment of the present application. The BERT-base model has 12 transform blocks (encoders), and other major hyper-parameters include: d_model＝768，d_k＝64,d_ff＝3072,h＝12。

One way of evaluating may be to deploy on the NVIDIA TITAN Xp platform based on the Pytorch deep learning framework. Using a tile of TITAN Xp GPU, part of the training process is configured to batch size 32, training set cycle number 3, the quantized training method uses the data format of PINT (8,3) + FP32 (32-bit floating point number), and the training method used as a reference uses the data format of full FP 32. The performance of the model trained by the two schemes on the respective tasks of the GLUE data set is shown in Table 1.

Table 1: feasibility evaluation table of quantitative training method

Wherein, the definitions of each task are as follows:

CoLA (The Corpus of Linguistic Acceptability Corpus);

SST-2(The Stanford Sentiment Treebank, Stanford Sentiment Tree Bank);

MRPC (The Microsoft Research Parahrase Corpus, Microsoft institute of Research Paraphrase Corpus);

STSB (The Semantic text Similarity Benchmark);

QQP (The Quora Question Pairs, Quora problem log set);

MNLI (The Multi-coarse Natural Language Inference corps, Multi-type Natural Language Inference database);

QNLI (query-answering NLI, natural language inference);

RTE (The Recognizing Textual implementation data sets);

WNLI (Winograd NLI, Winograd Natural language inference).

The results in table 1 show that the accuracy of the method for quantizing the linear layer of the neural network based on the PINT data format changes very little, so that the requirements on data calculation, storage and the like can be effectively reduced under the condition that the accuracy of the trained model changes very little.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of model training, comprising:

defining a new quantized linear layer;

2. The model training method of claim 1, wherein the preset quantization function is quantization, and the preset quantization function is defined in Python programming language.

3. The model training method of claim 1, wherein the function of the quantized linear layer is the same as the function of the original linear layer in the model.

4. The model training method of claim 1, wherein the quantized linear layer is represented by the PINT data format and the multidimensional error tensor, the weight matrix, and the floating point number are all represented by 32-bit floating point numbers when the model is built, so that the model forms a mixed-precision training method.

5. The model training method of claim 1, wherein the PINT data format comprises two parameters of a data bit width and a partition point, and the PINT value is divided into three parts of coding spaces by combining the data bit width and the partition point, wherein each part of coding spaces corresponds to a scaling factor.

6. The model training method of claim 5, wherein the predetermined quantization function is a quantization process implemented by:

setting an overall scaling factor;

7. The model training method of claim 6, wherein the overall scaling factor and each scaling factor are calculated by a predetermined formula.

8. The model training method of claim 1, wherein the computation of the network layer comprises the forward propagation phase, the backward propagation phase, and the weight gradient computation phase.

9. The model training method of claim 1, wherein the quantized linear layers are defined in a Pytorch deep learning framework.

10. The model training method of claim 1, wherein the network layer comprises a linear layer, an embedding layer, an attention mechanism, residual concatenation, an activation function, and normalization.