CN114861907A

CN114861907A - Data calculation method, device, storage medium and equipment

Info

Publication number: CN114861907A
Application number: CN202210424489.XA
Authority: CN
Inventors: 李恭政
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-05

Abstract

The application relates to the field of data calculation, and provides a data calculation method, a data calculation device, a storage medium and data calculation equipment. The method comprises the following steps: acquiring a target activation value and a target weight of a floating point type to be involved in a target matrix multiplication operation; respectively carrying out quantization processing on the target activation value and the target weight to obtain a quantization activation value of a fixed point type corresponding to the target activation value and a quantization weight of a fixed point type corresponding to the target weight; performing the target matrix multiplication operation by using the quantization activation value and the quantization weight; and performing inverse quantization processing on the result of the target matrix multiplication operation according to the target activation value or the quantization mode of the target weight to obtain target output data of a floating point type. By quantizing the limited number of target activation values and target weights meeting the conditions, the method and the device reduce the resources and time consumed during calculation, save the video memory, and do not reduce the calculation precision.

Description

Data calculation method, device, storage medium and equipment

Technical Field

The present application relates to the field of data computing, and more particularly, to a data computing method, apparatus, storage medium, and device.

Background

In some data calculation scenarios, data is generally input into a model corresponding to the scenario, and a result corresponding to the data is obtained through calculation of each layer in the model. The activation values of each layer participating in calculation in the model are usually floating point type data.

For example, when an image or text is generated by a generative model, a target text is obtained by inputting a generation target and a hidden variable to the model. The activation values of each layer in the model participating in calculation are usually corresponding 32-bit floating point type data, and the target text is obtained after operation is performed through the model. Wherein, the activation value may refer to input data or output data of each layer in the model.

The existing data calculation mode usually quantizes all floating point type data participating in calculation into fixed point type data so as to accelerate calculation and improve efficiency. However, if the calculation is performed after all the floating point type data are quantized into the fixed point type data, the precision of the final result obtained will have a large deviation.

Disclosure of Invention

In this context, embodiments of the present application desirably provide a data calculation method, apparatus, storage medium, and device, which perform quantization processing on a limited number of target activation values and target weights to be involved in matrix multiplication, that is, convert a limited number of floating point type data into fixed point type data, instead of converting all floating point type data into fixed point type data, so as to improve calculation efficiency and ensure calculation accuracy.

In a first aspect of the present application, there is provided a data calculation method comprising:

acquiring a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to target matrix multiplication and are floating point type data within a first preset threshold range;

respectively carrying out quantization processing on the target activation value and the target weight to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range;

performing the target matrix multiplication operation by using the quantization activation value and the quantization weight;

and performing inverse quantization processing on the result of the target matrix multiplication operation according to the quantization mode of the target activation value or the target weight to obtain target output data, wherein the target output data is floating point type data within a first preset threshold range.

In one embodiment of the present application, it is applied in neural network models;

wherein the neural network model comprises at least one of the target matrix multiplication operations;

the target matrix multiplication operation is one of the following matrix multiplications:

QKV matrix multiplication of the attention calculation layer;

mapping matrix multiplication of a mapping layer;

a first fully-connected matrix multiplication of the feedforward neural network layer;

a second fully-connected matrix multiplication of the feedforward neural network layer.

In one embodiment of the present application, in the attention calculation layer, the target activation value is input data of the attention calculation layer, and the target weight includes a query weight, a keyword weight, and a value weight of the attention calculation layer;

performing quantization processing on the target weight, including:

respectively carrying out quantization processing according to the channel dimensions of the query weight, the keyword weight and the value weight to obtain a quantization query weight corresponding to the query weight, a quantization keyword weight corresponding to the keyword weight and a quantization value weight corresponding to the value weight;

the performing the target matrix multiplication operation by using the quantization activation value and the quantization weight includes:

matrix multiplication is carried out on the quantization activation value, the quantization inquiry weight, the quantization keyword weight and the quantization value weight respectively;

performing inverse quantization processing on the result of the target matrix multiplication according to the quantization mode of the target activation value or the target weight to obtain target output data, including:

performing inverse quantization processing on the results of the three matrix multiplication operations according to the quantization modes of the target activation values or the target weights respectively;

and calculating attention according to a preset rule by adopting three matrix multiplication results subjected to inverse quantization, and taking the attention as output data of the attention calculation layer.

In one embodiment of the present application, when the attention calculation layer is a mask multi-head attention calculation layer, the channel dimension of the weight is a head-of-interest dimension of the weight.

In one embodiment of the application, quantization processing or inverse quantization processing is performed through a preset fusion operator;

the fusion operator comprises a quantitative fusion operator and an inverse quantitative fusion operator;

the quantization fusion operator is used for performing fusion calculation on data calculation before quantization processing of a target activation value and quantization processing of the target activation value;

and the inverse quantization fusion operator is used for performing fusion calculation on the data calculation after inverse quantization processing and the inverse quantization processing.

In one embodiment of the present application, each target matrix multiplication corresponds to one quantized fusion operator and/or one dequantized fusion operator;

the quantization fusion operator corresponding to the QKV matrix multiplication is used for performing fusion calculation on normalization processing and quantization processing of the activation value after the normalization processing; the inverse quantization fusion operator corresponding to the QKV matrix multiplication is used for performing fusion calculation by adding an offset term and inverse quantization processing of the operation result of the QKV matrix multiplication;

the quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on permutation conversion processing and the quantization processing of the activation value after the permutation conversion processing; the inverse quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on the inverse quantization processing, the offset term addition and the residual addition of the operation result of the mapping matrix multiplication;

the quantization fusion operator corresponding to the first full-connection matrix multiplication is used for performing fusion calculation on normalization processing and the quantization processing of the activation value after the normalization processing; the inverse quantization fusion operator corresponding to the first full-connection matrix multiplication is used for performing fusion calculation on inverse quantization processing, bias term addition and activation operation of the operation result of the first full-connection matrix multiplication;

and the inverse quantization fusion operator corresponding to the second full-connection matrix multiplication is used for performing fusion calculation on the inverse quantization processing, the addition of the bias terms and the normalization processing of the operation result of the second full-connection matrix multiplication.

In an embodiment of the present application, if the neural network model is in a parallel training state, performing quantization processing on the target activation value and the target weight according to a parallel training mode of the neural network model;

when the parallel training mode of the neural network model is data parallel, carrying out channel dimension quantization processing on the target weight meeting the condition, and carrying out tensor dimension quantization processing on the target activation value and the target weight not meeting the condition;

and when the parallel training mode of the neural network model is model parallel, respectively carrying out data block dimension quantization processing on the target activation value and the target weight, wherein the data block dimensions quantization modes of the target activation value and the target weight are different.

In an embodiment of the present application, the performing quantization processing on channel dimensions on the target weights meeting the condition includes:

obtaining the attention head dimension of the target weight;

partitioning the target weight into blocks according to the attention head dimension to obtain each target weight sub-block;

and respectively carrying out quantization processing on each target weight subblock.

In an embodiment of the application, the performing quantization processing on the target activation value and the target weight respectively includes:

acquiring the parallel scale of the model and the attention head dimension of the target weight;

performing data partitioning on the target activation value according to the parallel scale of the model to obtain each target activation value sub-block; partitioning the target weight according to the parallel scale of the model and the dimensions of the attention head, and obtaining each target weight sub-block;

and respectively carrying out quantization processing on each target activation value sub-block and each target weight sub-block.

In an embodiment of the present application, the data blocking the target weights according to the parallel scale of the model and the dimension of the focus head includes:

and taking the product of the concerned head dimension of the target weight and the parallel scale of the model as a divisor for partitioning data.

In a second aspect of the present application, there is provided a data computing apparatus comprising:

the acquisition module is configured to acquire a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to target matrix multiplication and are floating point type data within a first preset threshold range;

the quantization module is configured to perform quantization processing on the target activation value and the target weight respectively to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range;

a calculation module configured to perform the target matrix multiplication operation using the quantization activation value and the quantization weight;

and the inverse quantization module is configured to perform inverse quantization processing on the result of the target matrix multiplication according to the quantization mode of the target activation value or the target weight to obtain target output data, wherein the target output data is floating point type data within a first preset threshold range.

In a third aspect of the present application, a computer-readable storage medium is provided, comprising instructions which, when run on a computer, cause the computer to perform the method according to the first aspect.

In a fourth aspect of the present application, a computing device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the computer program.

Compared with the prior art, according to the data calculation method, the data calculation device, the data calculation storage medium and the data calculation equipment, a limited number of target activation values and target weights to be involved in matrix multiplication are subjected to quantization processing, namely, a limited number of floating point type data are converted into fixed point type data instead of converting all floating point type data into fixed point type data, so that resources and time consumed during calculation are reduced, a video memory is saved, calculation precision is not reduced, and better experience is brought to a user.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a schematic view of an application scenario of a data calculation method according to some embodiments of the present application;

FIG. 2 is a schematic flow chart diagram illustrating a data calculation method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a GPT model quantization calculation according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an inverse quantization calculation of the GPT model according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a quantitative cut during parallel training of a model according to another embodiment of the present application;

FIG. 6 is a schematic block diagram of a data computing device according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present application will be described with reference to a number of exemplary embodiments. It is understood that these examples are given solely to enable those skilled in the art to better understand and to practice the present application, and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

It is to be noted that the various embodiments or figures of the present application refer to terms such as:

float is floating point type data;

FP16/FP 32: 16-bit/32-bit floating point type data;

per-tensor is each tensor;

per-channel for each channel;

per-block: each data block.

At present, a data calculation mode for a model constructed based on a neural network usually needs to calculate 32-bit data, so that the bandwidth pressure of data calculation is high, and the calculation performance is reduced.

Therefore, the embodiment of the present application provides a data calculation method, which can perform quantization processing on a limited number of target activation values and target weights to be involved in matrix multiplication on the premise of having a small influence on a data calculation result, that is, a limited number of floating point type data are converted into fixed point type data instead of converting all floating point type data into fixed point type data, thereby reducing bandwidth pressure of data calculation, improving calculation capability of data calculation equipment, and ensuring precision of the data calculation result.

The data calculation method provided by the embodiment of the application can be applied to a neural network model realized based on Artificial Intelligence, wherein Artificial Intelligence (AI) is a theory, a method, a technology and an application system which simulate, extend and expand human Intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and use the knowledge to acquire an optimal result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the natural language processing technology and the deep learning direction.

For example, Deep Learning (Deep Learning) in Machine Learning (ML) may be involved, including various types of artificial neural networks (artificial neural networks).

First, an execution body of the embodiment of the present application will be described. The data calculation method provided by the application can be executed through data calculation equipment. The data computing device may be a server, wherein the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and an artificial intelligence platform. The data computing device may be a server, and the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The data computing device may have the capability to implement automatic sentence generation and translation techniques in natural language processing techniques, and the like.

The data computing device may be provided with Machine Learning (ML) capabilities. ML is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks.

In the embodiment of the present application, the neural network model applying the data calculation method mainly relates to application of various artificial neural networks, such as sequence generation through the neural network model.

It should be noted that the embodiment of the present application does not limit the type of the model for performing data calculation by using the method, the model may be any type of model, and in a possible implementation manner, the model may be a Recurrent Neural Network (RNN) model.

Next, a data calculation method provided by the embodiment of the present application is described with a server as an execution subject and in combination with an actual application scenario.

Referring to fig. 1, a schematic diagram of an application scenario of a data calculation method provided in an embodiment of the present application is shown. As shown in fig. 1, the application scenario includes a server 101, and the server 101 executes the data calculation method provided in the embodiment of the present application.

In the embodiment of the present application, when data calculation needs to be performed on input data, the server 101 may input the input data into a model for performing data calculation, so as to determine output data corresponding to the input data through the model.

The application scenes of the method comprise scenes such as voice analysis, voice noise reduction, voice translation, word recognition and sequence.

When the scene is a voice analysis scene, the model may be a voice analysis model, the input data may be voice data to be subjected to voice analysis, and the determined output data corresponding to the input data may be data for completing the voice analysis. When the scene is a voice noise reduction scene, the model may be a voice noise reduction model, the input data may be voice data to be noise reduced, and the determined output data corresponding to the input data may be the voice data obtained by noise reduction of the voice data to be noise reduced.

When the context is a sentence translation scenario, the model may be a sentence translation model, the input data may correspond to sentence data of a first language to be translated, and the determined output data corresponding to the input data may be sentence data of a second language obtained by translation. When the method is applied to a sequence generation scene, the model may be a sequence generation model, the input data may be data including a sequence to be generated, the determined output data corresponding to the input data is a sequence obtained by performing data calculation according to the sequence to be generated, and the like, which are not described again.

After the input data is input to the model, the model processes the input data to obtain an activation value flowing in each layer, the activation value may be an activation value input to a certain neural network layer or an activation value output by a certain neural network layer, and the weight value may be a weight value inherent to a certain neural network layer. In the process of the flow of the activation values, a target activation value and a target weight can be obtained, quantization processing is carried out on the target activation value and the target weight, and then matrix multiplication to be participated in is carried out on the quantized target activation value and the target weight continuously.

The principles and spirit of the present application are explained in detail below with reference to several representative embodiments of the present application.

For example, referring to fig. 1, the server 101 may take the input activation value of the linear superposition layer in the model as the target activation value, and obtain the target activation value as a one-dimensional sequence [ -0.127, 0.126, 0.08, -0.07 ].

After obtaining the target activation values, the server 101 may quantize the target activation values, determine a quantized activation value corresponding to each target activation value, and record the quantized activation value as a quantized activation value. The quantized activation value is fixed-point type data, and specifically, in this example, the target activation value is quantized to an integer value.

In this embodiment, the target activation value may be quantified to within the data range [ -127,127] by expanding by a factor of 1000. Thus, the quantized activation value determined is: "-127, 126, 8, -7", resulting in a one-dimensional sequence [ -127, 126, 8, -7 ].

Thus, the quantization activation value and the quantization weight are forward-calculated.

After the calculation is completed, the result after the forward calculation can be inversely quantized according to the quantization mode of the target activation value or the target weight to obtain a corresponding output result, so that the influence of the quantized activation value and the quantized weight on the output result obtained by the forward calculation is reduced. That is, the difference between the result of calculating the activation value and the target weight and the result of calculating the quantized activation value and the quantized weight is within an allowable range.

In this embodiment, the result of the forward calculation may be scaled by 1000 times according to the quantization mode, so as to implement inverse quantization. As shown in fig. 1, after forward calculation is performed according to the quantization activation value and the quantization weight, a one-dimensional sequence is obtained as [102, 103, -76, 8], data in the sequence can be dequantized and scaled by 1000 times to obtain a sequence [0.102, 0.103, -0.76, 0.08], and the sequence is recorded as a corresponding output result.

And after the calculation of the input data is completed through the model, determining output data corresponding to the input data.

In this example, on the premise of not affecting the data calculation result as much as possible, the target activation value and the target weight are quantized into integer values with fewer digits, so that the bandwidth pressure of data calculation is reduced, and the calculation capability of the data calculation device is improved.

It should be noted that, although only how to quantize the target activation value and dequantize the output data is specifically described in the above example, in some embodiments of the present application, the target weight is also quantized, and the quantized weight and the quantized activation value are processed according to the original calculation manner of the activation value and the weight; for example, if a target activation value is input, matrix multiplication is performed on the target activation value and the target weight, when the data calculation method of the present application is used to calculate data, the target activation value and the target weight are quantized respectively, then the quantized activation value and the quantized weight are subjected to matrix multiplication, and the calculation result is used as corresponding output data, or the calculation result is subjected to inverse quantization and then used as corresponding output data.

It is understood that the quantization mode of the corresponding weight is similar to the quantization mode of the corresponding activation value, that is, a quantization coefficient is used to process the corresponding weight to obtain a quantization corresponding weight.

The following describes various aspects of the present application with reference to several specific embodiments.

In connection with the application scenario of fig. 1, a method for data computation according to an exemplary embodiment of the present application is described below with reference to fig. 2. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

Next, a data calculation method provided by the embodiment of the present application will be described by taking a server as the data calculation device and taking a natural language generated scene as an example. The server is deployed with the model, the model may be a model obtained after training, and the data calculation method is an inference process for the model.

Referring to fig. 2, the data calculation method includes:

step S110, acquiring a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to target matrix multiplication and are floating point type data within a first preset threshold range;

the data calculation method provided by the embodiment of the application quantizes each target activation value and target weight in the neural network model, the quantization of each target activation value and target weight can be divided into multiple groups, and one group of target activation values and target weights can comprise multiple target activation values and target weights.

It should be noted that a set of target activation values and target weights should be to participate in the same matrix multiplication operation. That is, the present embodiment determines whether the activation value is a target activation value according to whether the activation value is to be matrix-multiplied with a weight, and similarly determines whether the weight is a target weight according to whether the weight is to be matrix-multiplied with the activation value; in addition, if the activation values and weights are to be matrix multiplied, the activation values and weights are determined as a set of target activation values and target weights. For example, if an activation value of 1 is to be matrix multiplied by a weight of 1 by 1, then the activation value of 1 and the weight of 1 are a set of target activation values and target weights.

It can be understood that, since the natural activation values and weights in the neural network model are themselves floating point type data, this embodiment does not determine whether the data types of the activation values and weights in the neural network model are floating point type data. I.e. automatically considering activation values and weights in the neural network model as floating point type data. A person skilled in the art may further determine whether the activation value and the weight to be involved in the matrix multiplication operation are floating point type data according to an actual application scenario, which is not limited in this embodiment.

The floating point type data is real numbers used for indicating decimal numbers, and the first predetermined threshold range may be a threshold range of floating point type data recognized in the prior art, and may be, for example, 1.8E-308 to 1.8E + 308.

It should be noted that the floating-point type data can express more information, that is, the result of calculation based on the floating-point type data will be more accurate, however, the floating-point type data needs to be quantized because the floating-point type data will take more time and calculation resources to perform calculation. In consideration of the fact that if all floating point type data in the neural network model are subjected to quantization processing, the accuracy of the final output result of the model is greatly lost, and therefore, in the embodiment, a target activation value and a target weight meeting certain conditions are acquired to be subjected to quantization processing.

The reason is that matrix multiplication of floating-point data takes extremely high computational resources and time, i.e., has extremely high computational complexity, while other multiplication and addition calculations take less computational resources than matrix multiplication. Therefore, a better resource saving effect can be achieved by only quantizing the weight and the activation value which need to be subjected to matrix multiplication, namely, the model calculation speed and the resource consumption after quantization are greatly reduced, and the loss precision is smaller.

According to the above principle, the inventor finds that the neural network model includes at least one matrix multiplication operation requiring the participation of the weight and the activation value. Therefore, in one embodiment, the target weight and the target activation value can be determined based on matrix multiplication, that is, the matrix multiplication operation in which the activation value and the weight participate is considered as the target matrix multiplication operation; in particular the target matrix multiplication operation is one of the following matrix multiplications:

QKV matrix multiplication of the attention calculation layer;

mapping matrix multiplication of a mapping layer;

In one embodiment, the generative unsupervised Pre-Training Model Gererate Pre-Training Model (GPT) includes all the objective matrix multiplication operations listed in the above embodiments.

It should be noted that the GPT model includes a plurality of stacked decoder-encoders (transform-encoders), each of which has the same structure, for example, includes a masked multi-head attention layer, a mapping layer, a feedforward neural network layer, a residual connection layer, and a normalization layer, and thus each of the transform-encoders includes the respective target matrix multiplication operations listed in the above embodiments.

Step S120, quantizing the target activation value and the target weight respectively to obtain a quantized activation value corresponding to the target activation value and a quantized weight corresponding to the target weight, wherein both the quantized activation value and the quantized weight are fixed point type data within a second preset threshold range;

if the content of the scene part is applied, the floating point type data is quantized, that is, the floating point type data is converted into the fixed point type data, and the conversion formula of the floating point type data and the fixed point type data is as follows:

the method includes the steps of (x _ out) ((x/scale + zero _ point), quant _ min, quant _ max) -zero _ point) ("scale"), where x _ out "represents fixed point type data obtained after quantization processing, x is floating point type data before quantization processing, quant _ min and quant _ max respectively represent the maximum and minimum values that can be represented by the fixed point type data under a specific bit width, and scale is a scaling factor (for example, 1000 times of expansion of an application scene part).

In the quantization process, it is most important to calculate scale coefficients. The Scale coefficient calculation method is the ratio of the statistical range of tensor and the range of fixed point number, therefore, how to count the range of tensor becomes the key of quantization precision, not only looking at the number as much as possible from the range, but also distinguishing the numbers close to each other as much as possible from the precision. Common methods for calculating scale include max, percent, etc.

In addition, the quantization method can be classified into symmetric quantization and asymmetric quantization according to whether or not the zero _ point is 0. Zero _ point refers to a floating point number with 0 mapped to a fixed point number, symmetric quantization if the floating point number with 0 mapped to the fixed point number with 0, and asymmetric quantization otherwise. In deep learning, since the activation values and weights substantially conform to a normal distribution with an average value of 0, the embodiment of the present application adopts symmetric quantization to achieve the purpose of saving computation.

The fixed point type data is data in which an agreed decimal point is hidden in a certain fixed position, and the second preset threshold range may be a threshold range of fixed point type data approved by the prior art, and may also be, for example, 1.8E-308 to 1.8E + 308.

To describe in more detail which activation values and weights need to be quantized in the neural network model, the following description will be given by taking how quantization processing of each target activation value and target weight is performed in the GPT model as an example with reference to fig. 3, where float input is an input of floating point type data: that is, the output active value of the embedding (embedding) layer, because the embedding layer does not involve the matrix multiplication of the weight and the active value, the performance cannot be improved even if the quantization processing is performed, and the apparent memory ratio is small, the embodiment does not perform the quantization processing on the active value and the weight of the embedding layer, so as to ensure the accuracy of the model for performing data calculation.

In the attention calculation layer, the target activation value is input data of the attention calculation layer, and the target weight comprises a query weight, a keyword weight and a value weight of the attention calculation layer;

performing quantization processing on the target weight, including:

Referring to fig. 3, the query-scale corresponds to a query quantized coefficient in a transform-decoder structure. The activation value query inputted into the attention calculation layer is subjected to matrix multiplication with the query weight Q, the keyword weight K and the value weight V of the attention calculation layer (i.e. QKV multiplication of the attention calculation layer, and Gemm shown in fig. 3 represents matrix multiplication). As can be seen from fig. 3, the inputs of the 3 matrix multiplications are substantially the same active value, so that the present embodiment combines the quantization processing operations of the inputs of the 3 matrix multiplications into one, thereby reducing the number of quantization steps and ensuring the quantization accuracy.

Additionally, in some embodiments, the attention calculation layer may be a masked multi-head attention calculation layer, i.e., the attention calculation layer includes multiple heads of interest. Therefore, in order to increase the granularity of quantization and reduce quantization errors, in one embodiment, if the dimension of the target weight is (head _ num, size _ per _ head), the dimension of the target weight, i.e., size _ per _ head, is regarded as a channel dimension, and the target weights are quantized according to the channel dimension, i.e., the target weights are divided into target weight sub-blocks according to the dimension of the target weight, and then are quantized respectively.

Since the two matrix multiplications of Q × K and attention attn × v have no weight and are both active value multiplications, the video memory occupation cannot be reduced, and therefore, the present embodiment does not perform quantization processing on the data participating in the two matrix multiplications.

With continued reference to fig. 3, considering that Out _ scale/fc1_ scale and fc2_ scale correspond to the scaling factor of project matrix multiplication (i.e., mapping matrix multiplication of the mapping layer), fc1 matrix multiplication (i.e., first full-connection matrix multiplication of the feedforward neural network layer) and fc2 matrix multiplication (i.e., second full-connection matrix multiplication of the feedforward neural network layer) of the decoder structure, respectively, the present embodiment performs normal quantization operation in order to save computation resources and time.

After describing how to determine the target activation values and the target weights and how to perform quantization processing on the respective target activation values and the target weights, step S130 is performed next, and the target matrix multiplication is performed using the quantized activation values and the quantized weights.

In order to further improve the calculation efficiency, in an embodiment of the present application, quantization processing is performed through a preset fusion operator; the fusion operator comprises a quantitative fusion operator;

the quantization fusion operator is used for performing fusion calculation on data calculation before quantization processing of the target activation value and quantization processing of the target activation value.

Specifically, each target matrix multiplication may correspond to one quantized fusion operator;

the quantization fusion operator corresponding to the QKV matrix multiplication is used for performing fusion calculation on normalization processing and quantization processing of the activation value after the normalization processing; such as the layer normalized layerorm in the dashed box in fig. 3 and the input queue scale factor query-scale (equivalent to Quant in fig. 4).

The quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on permutation conversion processing and the quantization processing of the activation value after the permutation conversion processing; such as a Transpose arrangement conversion process and a quantization process of the activation value Quant in a dotted frame in fig. 4.

The quantization fusion operator corresponding to the first full-connection matrix multiplication is used for performing fusion calculation on normalization processing and the quantization processing of the activation value after the normalization processing; such as layer normalized LayerNorm processing and quantization processing of the activation value Quant in the dashed box before the first fully-connected matrix multiplication FC1 Gemm in fig. 4.

The inverse quantization fusion operator corresponding to the second full-connection matrix multiplication is used for performing fusion calculation on the activation processing and the quantization processing of the activation value after the activation processing; the activation GELU processing before the second fully-connected matrix multiplication fc2_ scale as in the dashed box in fig. 4 and the quantization processing of activation values according to fc2_ scale.

After the target matrix multiplication is performed by using the quantization activation value and the quantization weight, in order to ensure the precision of the final calculation result, step S140 is further executed to perform inverse quantization processing on the result of the target matrix multiplication according to the quantization mode of the target activation value or the target weight, so as to obtain target output data, where the target output data is floating point type data within a first preset threshold range.

The inverse quantization process is an inverse operation of the quantization process, for example, the quantization process enlarges the data by 1000 times, and the inverse quantization process reduces the data by 1000 times.

In addition, since some target weights are quantized in the channel dimension, in some embodiments, the quantization mode of the target weights is different from that of the target activation values, and in this case, the result of the multiplication of the target matrix may be dequantized according to actual needs by using the quantization mode according to the target activation values or the target weights.

Similar to the quantization embodiment, in an embodiment of the present application, inverse quantization processing is further performed through a preset fusion operator;

the fusion operator is an inverse quantization fusion operator;

In an embodiment of the present application, each target matrix multiplication further corresponds to an inverse quantization fusion operator;

the inverse quantization fusion operator corresponding to the QKV matrix multiplication is used for performing fusion calculation by adding an offset term and inverse quantization processing of the operation result of the QKV matrix multiplication; specifically, in the step of QKV multiplication, after QKV matrix multiplication is completed, the present embodiment fuses the dequantization process and the bias term bias addition together, for example, dequantization and query weight Q bias term addition Qbias in fig. 4.

The inverse quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on the inverse quantization processing, the offset term addition and the residual addition of the operation result of the mapping matrix multiplication; specifically, for example, dequantized, map Bias term added Proj Bias, and Add input residual are added in fig. 4.

The inverse quantization fusion operator corresponding to the first full-connection matrix multiplication is used for performing fusion calculation on inverse quantization processing, bias term addition and activation operation of the operation result of the first full-connection matrix multiplication; specifically, for example, dequantized dequantize, Bias term addition FC1 Bias & act, and GELU activation operations (shown in fig. 3) in fig. 4.

The inverse quantization fusion operator corresponding to the second full-connection matrix multiplication is used for performing fusion calculation on the inverse quantization processing, the addition of the bias terms and the normalization processing of the operation result of the second full-connection matrix multiplication; specifically, for example, dequantization, Bias term addition FC2Bias & act, and normalization process Add & Norm in FIG. 4.

In addition, the quantization fusion operator can be obtained by fusing the quantization processing pre-calculation original kernel function; the inverse quantization fusion operator may be obtained by fusing an original kernel function after the inverse quantization processing with the inverse quantization processing.

When the embodiment of the application carries out quantitative reasoning, the quantization and inverse quantization processes are integrated into the front operator and the back operator, so that the access of the video memory is reduced, and the performance is improved. Since the weights and their quantization coefficients are known, the present embodiment can quantize the weights in advance.

Some neural network models such as a GPT large model have the problem that due to the fact that parameter quantity is large, memory of a single display card is insufficient, and all models cannot be installed. In order to solve the problem of insufficient video memory, tensor tenor cutting can be carried out on the weight, and the tensor after cutting is stored on different video cards, so that the purpose of loading larger models is achieved.

Specifically, in one embodiment, if the neural network model is in a parallel training state, the target activation value and the target weight are respectively quantized according to a parallel training mode of the neural network model;

and when the parallel training mode of the neural network model is data parallel, carrying out channel dimension quantization processing on the target weight meeting the condition, and carrying out tensor dimension quantization processing on the target activation value and the target weight not meeting the condition. Wherein the coincidence condition may be a weight in which the target weight is output for a plurality of attention heads.

In one embodiment, the performing quantization processing on channel dimensions on the target weights meeting the condition includes:

obtaining the attention head dimension of the target weight;

carrying out data blocking on the target weight according to the attention head dimension to obtain each target weight sub-block;

During parallel training of the model, for weight quantification, data in a weight channel is generally distributed to different display cards, so that the statistical range of the data in the channel cannot be normally obtained, and further, the scale coefficient of the weight cannot be calculated; for the activation value, since different machines calculate different parts of the activation value, the entire range of the activation value cannot be counted, and thus the activation value scale coefficient cannot be calculated.

In order to solve the problem that scales of an activation value and a weight cannot be counted among machines due to parallel training of a model, the embodiment of the application provides a method for quantizing per-block data block granularity.

In one embodiment, the performing quantization processing on the target activation values and the target weights respectively for the dimensions of the data blocks includes:

In one embodiment, the data partitioning of the target weights according to the parallel scale of the model and the dimension of the focus head comprises:

and taking the product of the concerned head dimension of the target weight and the parallel scale of the model as a divisor for data partitioning.

Per-block is finer than Per-channel quantization granularity, block processing is carried out on the activation value according to the parallel scale of the model, the parallel size of the model is assumed to be K, the data number of the original activation value is n, the Per-tensor quantization granularity is n, and the Per-block quantization granularity is n/K; and for the weight, performing block processing according to the parallel scale of the model, and assuming that the parallel scale size of the model is K, the number of data of the original weight is n, and the attention head number of the weight is n, so that the quantization granularity of per-channel is n/h, and the quantization granularity of per-block is n/(h × K).

Referring to fig. 5, in this figure, a rectangle represents the per-tensor quantization process, a rectangle represents the one tensor, a horizontal division represents the per-channel quantization process, and a horizontal + vertical division represents the per-block quantization process.

In fig. 5, the parameters at the beginning of quant represent quantization of activation values, and the parameters at the end of weight represent quantization of weights.

The choice of the three quantization processes depends on the training method. In the case of data parallel, the present embodiment adopts a quantization method for the activation value per-transducer and the weight per-channel, so that in the case of data parallel, quant is a whole rectangle at the beginning and a rectangle divided into two at the end.

Under the model parallel, firstly, the model parallel process of attention, the operation of tensor split is carried out on QKV weight due to the model parallel, and the split operation is not on the channel dimension, so that the operation is expressed as two vertical split lines and becomes a rectangle divided into four. Quant _ out is the result of computing attition for the output of QKV matrix multiplications, and in the case of tenor-split, is also cut longitudinally, as is out _ weight until attition computation is complete (i.e., the node represented by Quant _ fc 1).

Similar to the model parallel process of attention, so does the computation process of the feed-forward neural network layer ffn. FC1_ weight and FC2_ weight become a quantization process divided into four per-blocks under the double action of model parallel and per _ channel quantization, a matrix multiplication result quant _ FC2 of FC1 is changed into per-block quantization divided into two in the longitudinal direction from the original per-tensor quantization until ffn calculation process is finished, namely an output result of FC2 matrix multiplication is restored into the per-block quantization process.

For example, a matrix [ [1,2,3,4], [5,6,7,8] ], per-tensor quantizes 8 values in the matrix together; the per-channel quantization mode is to quantize [1,2,3,4] and quantize [5,6,7,8 ]; the per-block quantization mode is that [1,2,3,4] is first divided into two data blocks [1,2] and [3,4], then [1,2] and [3,4] are quantized, and then [5,6,7,8] are repeatedly executed.

According to the data calculation method, the limited number of target activation values and target weights to be involved in matrix multiplication are subjected to quantization processing, namely the limited number of floating point type data are converted into fixed point type data instead of converting all the floating point type data into the fixed point type data, so that resources and time consumed during calculation are reduced, a video memory is saved, calculation precision is not reduced, and better experience is brought to a user. In addition, in some embodiments, quantization and inverse quantization are fused with operators before and after, so that the calculation efficiency is further improved. In addition, in some embodiments, the data block granularity is quantized for the parallel training state model, and the data quantization is performed at a finer granularity, so that the quantization error is effectively reduced, the cross-machine communication overhead is avoided, and the training efficiency is improved.

Having described the method of the exemplary embodiment of the present application, next, with reference to fig. 6, for the apparatus for data calculation of the exemplary embodiment of the present application, the apparatus 60 includes:

an obtaining module 610 configured to obtain a target activation value and a target weight, where the target activation value and the target weight are to be subjected to target matrix multiplication and are both floating point type data within a first preset threshold range;

a quantization module 620, configured to perform quantization processing on the target activation value and the target weight respectively to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, where the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range;

a calculation module 630 configured to perform the target matrix multiplication operation using the quantization activation value and the quantization weight;

and the inverse quantization module 640 is configured to perform inverse quantization processing on the result of the target matrix multiplication according to the quantization mode of the target activation value or the target weight to obtain target output data, where the target output data is floating point type data within a first preset threshold range.

In one embodiment of the present application, the apparatus 60 is applied in a neural network model;

QKV matrix multiplication of the attention calculation layer;

mapping matrix multiplication of a mapping layer;

the quantization module 620 is further configured to perform quantization processing according to the respective channel dimensions of the query weight, the keyword weight, and the value weight, to obtain a quantization query weight corresponding to the query weight, a quantization keyword weight corresponding to the keyword weight, and a quantization value weight corresponding to the value weight;

the calculation module 630 is further configured to perform matrix multiplication operations with the quantization activation value and the quantization query weight, the quantization keyword weight, and the quantization value weight, respectively;

the inverse quantization module 640 is further configured to perform inverse quantization processing on the results of the three matrix multiplication operations according to the quantization modes of the target activation values or the target weights respectively; and

In an embodiment of the present application, the quantization module 620 is further configured to obtain a focus head dimension of the target weight; partitioning the target weight into blocks according to the attention head dimension to obtain each target weight sub-block; and respectively carrying out quantization processing on each target weight subblock.

In an embodiment of the present application, the quantization module 620 is further configured to obtain a parallel scale of the model and an attention head dimension of the target weight; performing data partitioning on the target activation value according to the parallel scale of the model to obtain each target activation value sub-block; partitioning the target weight according to the parallel scale of the model and the dimensions of the attention head, and obtaining each target weight sub-block; and respectively carrying out quantization processing on each target activation value sub-block and each target weight sub-block.

According to the data calculation device, the limited number of target activation values and target weights to be involved in matrix multiplication are subjected to quantization processing, namely the limited number of floating point type data are converted into fixed point type data instead of converting all the floating point type data into the fixed point type data, so that resources and time consumed during calculation are reduced, a video memory is saved, calculation precision is not reduced, and better experience is brought to a user. In addition, in some embodiments, quantization and inverse quantization are fused with operators before and after, so that the calculation efficiency is further improved. In addition, in some embodiments, the data block granularity is quantized for the parallel training state model, and the data quantization is performed at a finer granularity, so that the quantization error is effectively reduced, the cross-machine communication overhead is avoided, and the training efficiency is improved.

Having described the method and apparatus of the exemplary embodiments of the present application, next, a computer-readable storage medium of the exemplary embodiments of the present application is described with reference to fig. 7, which illustrates an optical disc 70, on which a computer program (i.e., a program product) is stored, where the computer program, when executed by a processor, implements the steps described in the above method embodiments, for example, obtains a target activation value and a target weight, where the target activation value and the target weight are to be subjected to a target matrix multiplication operation and are both floating point type data within a first preset threshold range; respectively carrying out quantization processing on the target activation value and the target weight to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range; performing the target matrix multiplication operation by using the quantization activation value and the quantization weight; performing inverse quantization processing on the result of the target matrix multiplication operation according to the quantization mode of the target activation value or the target weight to obtain target output data, wherein the target output data is floating point type data within a first preset threshold range; the specific implementation of each step is not repeated here.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

Having described the methods, apparatus, and storage media of the exemplary embodiments of the present application, a device for data computation of the exemplary embodiments of the present application will now be described with reference to fig. 8.

FIG. 8 illustrates a block diagram of an exemplary computing device 80 suitable for use in implementing embodiments of the present application, where the computing device 80 may be a computer system or server. The computing device 80 shown in fig. 8 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the application.

As shown in fig. 8, components of computing device 80 may include, but are not limited to: one or more processors or processing units 801, a system memory 802, and a bus 803 that couples various system components including the system memory 802 and the processing unit 801.

Computing device 80 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computing device 80 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 802 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)8021 and/or cache memory 8022. Computing device 80 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM8023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, and typically referred to as a "hard disk drive"). Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 803 by one or more data media interfaces. At least one program product may be included in system memory 802 having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.

Program/utility 8025, having a set (at least one) of program modules 8024, can be stored, for example, in system memory 802, and such program modules 8024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. Program modules 8024 generally perform the functions and/or methods of embodiments described herein.

Computing device 80 may also communicate with one or more external devices 804 (e.g., keyboard, pointing device, display, etc.). Such communication may be through input/output (I/O) interfaces 805. Moreover, computing device 80 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 806. As shown in FIG. 8, the network adapter 806 communicates with other modules of the computing device 80, such as the processing unit 801, over the bus 803. It should be appreciated that although not shown in FIG. 8, other hardware and/or software modules may be used in conjunction with computing device 80.

The processing unit 801 executes various functional applications and data calculations by running a program stored in the system memory 802, for example, obtains a target activation value and a target weight, where the target activation value and the target weight are to be subjected to a target matrix multiplication operation and are both floating point type data within a first preset threshold range; respectively carrying out quantization processing on the target activation value and the target weight to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range; performing the target matrix multiplication operation by using the quantization activation value and the quantization weight; and performing inverse quantization processing on the result of the target matrix multiplication operation according to the quantization mode of the target activation value or the target weight to obtain target output data, wherein the target output data is floating point type data within a first preset threshold range. The specific implementation of each step is not repeated here.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the data computing device are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, according to embodiments of the application. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the application have been described with reference to several particular embodiments, it is to be understood that the application is not limited to the specific embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects cannot be combined to advantage. The application is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of data computation, comprising:

2. The data computation method of claim 1, applied in a neural network model;

QKV matrix multiplication of the attention calculation layer;

mapping matrix multiplication of a mapping layer;

3. The data calculation method according to claim 2, wherein in the attention calculation layer, the target activation value is input data of the attention calculation layer, and the target weight includes a query weight, a keyword weight, and a value weight of the attention calculation layer;

performing quantization processing on the target weight, including:

4. The data calculation method of claim 3, wherein when the attention calculation layer is a mask multi-head attention calculation layer, a channel dimension of a weight is a head-of-interest dimension of the weight.

5. The data calculation method according to claim 2, wherein quantization processing or inverse quantization processing is performed by a preset fusion operator;

6. The data calculation method of claim 5, wherein each target matrix multiplication corresponds to one quantized fusion operator and/or one inverse quantized fusion operator;

the quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on permutation conversion processing and the quantization processing of the activation value after the permutation conversion processing; the inverse quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on inverse quantization processing, bias term addition and residual addition of the operation result of the mapping matrix multiplication;

7. The data calculation method according to any one of claims 2 to 5, wherein if the neural network model is in a parallel training state, the target activation value and the target weight are respectively subjected to quantization processing according to a parallel training mode of the neural network model;

8. The data computing method of claim 7, wherein the performing quantization processing of channel dimensions on the target weights that meet conditions comprises:

obtaining the attention head dimension of the target weight;

9. The data calculation method of claim 7, wherein the performing quantization processing on the target activation values and the target weights respectively for data block dimensions comprises:

performing data partitioning on the target activation value according to the parallel scale of the model to obtain each target activation value sub-block; performing data blocking on the target weight according to the parallel scale of the model and the dimension of the attention head to obtain each target weight sub-block;

10. The data computing method of claim 7, wherein the data partitioning of the target weights according to the parallel scale of the model and the care-head dimension comprises:

11. A data computing apparatus, comprising:

12. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1-10.

13. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-10 when executing the computer program.