CN115879504B

CN115879504B - Device and method for splitting and quantizing layerrnorm operator

Info

Publication number: CN115879504B
Application number: CN202211729854.4A
Authority: CN
Inventors: 郝鑫; 吴晗
Original assignee: Zhuhai Ouye Semiconductor Co ltd
Current assignee: Zhuhai Ouye Semiconductor Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-08-29
Anticipated expiration: 2042-12-30
Also published as: CN115879504A

Abstract

The application discloses a layerrnorm operator splitting and quantizing device and method, wherein the device comprises a mean value calculating unit, splitting units, a preset number of parallel quantizing units and splicing units, wherein the mean value calculating unit is connected with the splitting units, the splitting units are respectively connected with the quantizing units, the quantizing units are connected with the splicing units, and the preset number is equal to the number of data categories contained in an input tensor. According to the application, the input tensor is split into the preset number of sub tensors through the splitting unit, and then each sub tensor is processed through each quantization unit, so that the token of different data types in the input tensor can be processed in parallel by adopting different quantization units, the problem that the data value of part of the token is filtered due to large data value distribution difference among the token of different data types is avoided, the precision of the quantized layerrnom operator is ensured, and the model precision of the network model with the layerrnom operator deployed on the embedded equipment is improved on the premise of ensuring the execution efficiency.

Description

Device and method for splitting and quantizing layerrnorm operator

Technical Field

The application relates to the technical field of deep learning, in particular to a layerrnorm operator splitting and quantizing device and method.

Background

On the visual classification network, vision Transformer (VIT) network is widely used, the VIT introduces a transducer structure, and the problem of sequential computation of RNN (or LSTM, GRU, etc.) is solved by self-attention mechanism (self-attention mechanism) in the transducer structure, so that performance and accuracy superior to CNN are obtained.

When the VIT network is applied to the embedded device, the VIT deployed at the embedded device performs quantization processing (for example, quantization of the flat32 to int8, etc.) due to the limitation of power consumption of the embedded device. However, since embedded devices do not generally support division computation of the int8 type, the layerrnorm operator needs to be split into several small operators and then quantized. However, when the layerrnorm operator is split into a plurality of small operators for quantization, there is a general problem that the model of the quantized model has low precision.

There is thus a need for improvements and improvements in the art.

Disclosure of Invention

The application aims to solve the technical problem of providing a layerrnorm operator splitting and quantizing device and method aiming at the defects of the prior art.

In order to solve the above technical problem, a first aspect of an embodiment of the present application provides a layerrnorm operator splitting quantization apparatus, where the apparatus includes: the device comprises an average value calculation unit, a splitting unit, a preset number of parallel quantization units and a splicing unit, wherein the average value calculation unit is connected with the splitting unit, the splitting unit is respectively connected with each quantization unit, each quantization unit is connected with the splicing unit, and the preset number is the same as the number of data categories contained in the input tensor of the average value calculation unit.

The layerrnorm operator splitting and quantizing device is characterized in that the average value calculating unit is used for calculating an average value corresponding to an input tensor and determining candidate tensors based on the input tensor and the average value; the splitting unit is used for splitting the candidate tensor into a preset number of sub-tensors based on the data category contained in the input tensor; the quantization unit is used for quantizing the sub-tensors to obtain quantized sub-tensors; the splicing unit is used for splicing the quantized sub-tensors and determining a batch normalization tensor corresponding to the input tensor based on the spliced tensor obtained by splicing.

The layerrnorm operator splitting and quantizing device, wherein the data category comprises a classification category and an image category.

The layerrnorm operator splitting and quantizing device comprises a quantizing unit, wherein the quantizing unit comprises a square operator, a mean operator, an adding operator, an open square operator and a dividing operator, the square operator, the mean operator, the adding operator, the open square operator and the dividing operator are sequentially connected, and the dividing operator is connected with a splitting unit.

The layerrnorm operator splitting and quantizing device comprises a splicing operator, a multiplying operator and an adding operator which are sequentially connected, wherein the splicing operator is connected with each quantizing unit.

The layerrnorm operator splitting and quantizing device comprises a mean value calculation unit, a splitting unit, a plurality of parallel quantizing units and all operators contained in a splicing unit, wherein the operators except for a division operator contained in the quantizing unit adopt an int8 data type.

The second aspect of the embodiment of the application provides a layerrnorm operator splitting quantization method, which comprises the following steps:

inputting the input tensor into a mean value calculation unit, calculating a mean value corresponding to the input tensor through the mean value calculation unit, and determining a candidate tensor based on the input tensor and the mean value;

inputting the candidate tensor into a splitting unit, and splitting the candidate tensor into a preset number of sub-tensors through the splitting unit, wherein each sub-tensor corresponds to a data category, and the data categories corresponding to the sub-tensors are different;

inputting each sub-tensor into each quantization unit respectively, and quantizing each sub-tensor by each quantization unit to obtain quantized sub-tensors, wherein each quantization unit corresponds to each sub-tensor one by one;

and inputting the quantized sub-tensors into a splicing unit, splicing the quantized sub-tensors through the splicing unit, and determining a batch normalization tensor corresponding to the input tensor based on the spliced tensor obtained by splicing.

The layerrnorm operator splitting quantization method, wherein the data category comprises a classification category and an image category.

A third aspect of the embodiments of the present application provides a computer readable storage medium storing one or more programs executable by one or more processors to implement steps in a layerrnorm operator splitting quantization method as described in any of the above.

A fourth aspect of an embodiment of the present application provides an embedded device comprising a layerrnorm operator splitting quantization apparatus as described above.

The beneficial effects are that: compared with the prior art, the application provides a layerrnorm operator splitting and quantizing device and method, wherein the device comprises: the device comprises an average value calculation unit, a splitting unit, a preset number of parallel quantization units and a splicing unit, wherein the average value calculation unit is connected with the splitting unit, the splitting unit is respectively connected with each quantization unit, each quantization unit is connected with the splicing unit, and the preset number is equal to the number of data categories contained in the input tensor of the average value calculation unit. According to the application, the input tensor to be processed by the layerrnom operator is split into the preset number of sub tensors through the splitting unit, and then the preset number of sub tensors are processed through the preset number of parallel quantization units, so that different quantization units can be adopted for carrying out parallel processing on the token of different data types in the input tensor, the problem that the data value of part of the token is filtered due to large data value distribution difference among the token of different data types can be avoided, the precision of the quantized layerrnom operator is ensured, and the model precision of the network model with the layerrnom operator deployed on the embedded equipment is improved on the premise of ensuring the execution efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without creative effort for a person of ordinary skill in the art.

Fig. 1 is a schematic structural diagram of a layerrnorm operator splitting and quantizing device provided by the application.

FIG. 2 is a diagram illustrating an exemplary layerrnorm operator splitting quantization apparatus provided by the present application.

Fig. 3 is a flowchart of a layerrnorm operator splitting quantization method provided by the application.

Detailed Description

The application provides a layerrnorm operator splitting and quantizing device and method, which are used for making the purposes, technical schemes and effects of the application clearer and more definite, and the application is further described in detail below by referring to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the sequence number and the size of each step in this embodiment do not mean the sequence of execution, and the execution sequence of each process is determined by the function and the internal logic of each process, and should not be construed as limiting the implementation process of the embodiment of the present application.

According to research, vision Transformer (VIT) networks are widely applied to vision classification networks, the VIT introduces a transducer structure, and the problem of sequential calculation of RNNs (or LSTM, GRU and the like) is solved through a self-attention mechanism (self-attention mechanism) in the transducer structure, so that the performance and precision superior to those of CNNs are achieved.

When the VIT network is applied to the embedded device, the VIT deployed at the embedded device performs quantization processing (for example, quantization of the flat32 to int8, etc.) due to the limitation of power consumption of the embedded device. However, since embedded devices do not generally support division of the int8 type, the layerrnorm operator needs to be split into several small operators and then quantized. However, when the layerrnorm operator is split into a plurality of small operators for quantization, there is a general problem that the model of the quantized model has low precision.

In order to solve the above problems, a layerrnom operator splitting quantization apparatus provided in an embodiment of the present application is used for splitting quantization of layerrnom operators, where the apparatus includes: the device comprises an average value calculation unit, a splitting unit, a preset number of parallel quantization units and a splicing unit, wherein the average value calculation unit is connected with the splitting unit, the splitting unit is respectively connected with each quantization unit, each quantization unit is connected with the splicing unit, and the preset number is equal to the number of data categories contained in the input tensor of the average value calculation unit. According to the application, the input tensor to be processed by the layerrnom operator is split into the preset number of sub tensors through the splitting unit, and then the preset number of sub tensors are processed through the preset number of parallel quantization units, so that different quantization units can be adopted for carrying out parallel processing on the token of different data types in the input tensor, the problem that the data value of part of the token is filtered due to large data value distribution difference among the token of different data types can be avoided, the precision of the quantized layerrnom operator is ensured, and the model precision of the network model with the layerrnom operator deployed on the embedded equipment is improved on the premise of ensuring the execution efficiency.

The application will be further described by the description of embodiments with reference to the accompanying drawings.

The embodiment provides a layerrnom operator splitting and quantizing device, which is used for splitting and quantizing a layerrnom operator, wherein a calculation formula of the layerrnom operator is as follows:

where x represents the input tensor, y represents the batch normalized tensor, ex represents the mean, var x represents the variance, E represents the pretrained fixed value, and γ and β represent the transformation coefficients of the affine transformation.

The layerrnorm operator splitting quantization device in this embodiment may include, as shown in fig. 1, a mean value calculation unit 100, a splitting unit 200, a preset number of parallel quantization units 300, and a splicing unit 400, where the mean value calculation unit 100 is connected to the splitting unit 200, the splitting unit 200 is respectively connected to each quantization unit 300, and each quantization unit 300 is connected to the splicing unit 400. The preset number is equal to the number of data categories contained in the input tensor of batch normalization of the layerrnom operators, that is, the number of parallel quantization units is the same as the number of data categories contained in the input tensor, each quantization unit is used for carrying out quantization processing on the token corresponding to one data category, so that different quantization units can be adopted for carrying out parallel processing on the token of different data categories in the input tensor, the problem that the data values of part of the token are filtered due to large data value distribution difference among the token of different data categories can be avoided, the precision of the quantized layerrnom operators is ensured, and the model precision of a network model which is deployed on embedded equipment and carries the layerrnom operators is further improved.

The mean value calculation unit is used for calculating a mean value corresponding to the input tensor and determining candidate tensors based on the input tensor and the mean value. The input tensor is an input term of the mean value calculation unit, and the candidate tensor is an output term of the mean value calculation unit, wherein the candidate tensor is equal to the product of the input vector and the mean value tensor corresponding to the input tensor. In one implementation, the mean value calculating unit is configured to calculate x-E [ x ] in a calculation formula of the layerrnorm operator, as shown in fig. 2, where the mean value calculating unit may include a mean value operator reduce mean and a subtraction operator Sub, and the mean value operator reduce mean is connected to the subtraction operator Sub; the subtracting operator Sub is connected with the output operator for outputting the input tensor shown and the splitting unit, and the subtracting operator Sub performs subtracting operation on the output items of the input tensor and the mean operator reduce mean to obtain candidate tensors, and inputs the candidate tensors into the splitting unit.

In addition, since the candidate tensor is obtained by subtracting the input tensor and the output term of the mean operator, the mean calculation unit does not change the number of tokens in the input tensor and the data category of each token, that is, the number of tokens and the data category of each token included in the input tensor and the number of tokens and the data category of each token included in the candidate tensor. Thus, the subsequent splitting unit may split the candidate tensors according to the class of data comprised by the input tensor.

The splitting unit is configured to split the candidate tensor into a preset number of sub-tensors based on the data types included in the input tensor, where the data types of the token included in each sub-tensor in the preset number are the same, and the data types of the token between different sub-tensors are different from each other. For example, the input tensor includes 50 tensors, where the data category of one tensor is a classification category and the data category of 49 tensors is an image category, and then the splitting unit splits the candidate tensor into two sub-tensors, the distribution is denoted as a sub-tensor a and a sub-tensor B, the sub-tensor a includes one data category as the classification category and the sub-tensor B includes 49 data category as the image category. Furthermore, in one implementation, as shown in fig. 2, the splitting unit may employ Split operators by which the candidate tensor is Split into two sub-tensors.

The splitting reaction in the embodiment is based on the data types contained in the input tensor, so as to avoid the problem that the data values of partial token are filtered due to large data value distribution difference among token of different data types, and ensure the precision of the quantized layerrnorm operator. This is because research discovers that the layerrnorm operator in the VIT model leads to serious degradation of the accuracy of the VIT model after splitting and quantization. Therefore, the application performs layer-by-layer precision analysis (that is, performs cosine similarity comparison on the output tensor of each layer) on the model before and after quantization by using cosine similarity, and discovers that after the division operator is passed, the cosine similarity of the tensor before and after quantization has obvious difference, and the cosine similarity before the division operator is basically the same. However, the computation of the division operator is independent of quantization, so that the tensor of the dimension [1,50,1] of the input division operator is analyzed, and the first value in [1,50,1] and the other 49 values are found to be not in an order of magnitude according to the distribution of the tensor of [1,50,1], so that after quantization, the first value is quantized to 0, thereby affecting the quantized model accuracy.

Further, it was found from the study of [1,50,1] that 50 in [1,50,1] has a meaning of the number of tokens of the transformer network (the token is a word or a word in the semantic analysis class model, it is actually an image cut into non-overlapping patch sequences in the image classification), and the first token of 50 tokens is a classified token, and the last 49 tokens are tokens of the image patch, that is, the data class of the first token and the data class of the last 49 tokens are different, resulting in a large difference in distribution of the first token and the last 49 tokens. The quantization process is to choose a threshold that results in the first token being filtered out due to the large variance in distribution from the next 49 tokens. Therefore, in this embodiment, after the candidate tensor is obtained, the candidate tensor is split into the number of sub-tensors by the splitting unit, and then each sub-tensor is quantized by the parallel quantization unit, so that the problem of large numerical distribution difference caused by different data types can be avoided, and further, the model precision of the quantized model can be improved.

In one implementation, the layerrnom operator splitting and quantizing device is applied to the VIT network, so that the data categories included in the input tensor include a classification category and an image category, wherein the data category of the first token in the input tensor is the classification category, and the data category of the last 49 tokens is the image category. Thus, when splitting the candidate tensor based on the data category, the first token may be split directly into one sub-tensor and the last 49 tokens may be split into one sub-tensor, based on the position of each token in the candidate tensor.

The network structure of each quantization unit in the preset number of quantization units is the same, and each quantization unit is usedIn the process of quantizing the corresponding sub-tensors, that is, the preset number of quantization units corresponds to the preset number of sub-tensors one by one, each quantization unit is used for quantizing the corresponding sub-tensors. In one implementation, a preset number of quantization units are used in the calculation formula for calculating the layerrnorm operatorCorrespondingly, as shown in fig. 2, the quantization unit includes a square operator Pow, a mean operator reduce mean, an addition operator Add, an open square operator Sqrt, and a division operator Div, where the square operator Pow, the mean operator sub reduce mean, the addition operator Add, the open square operator Sqrt, and the division operator Div are sequentially connected, and the division operator Div is connected with the splitting unit.

The splicing unit is used for splicing the quantized sub-tensors, determining a batch normalization tensor corresponding to the input tensor based on the spliced tensor obtained by splicing, namely splicing the quantized tensors corresponding to the sub-vectors by the splicing unit to obtain the spliced tensor, and carrying out affine transformation on the spliced tensor to obtain the batch normalization tensor. Based on this, in one implementation, as shown in fig. 2, the splicing unit includes a splicing operator Concat, a multiplication operator Mul, and an addition operator Add that are sequentially connected, where the splicing operator Concat is connected with each quantization unit.

Furthermore, because the division operator needs to use float data, the layerrnom operator is quantized by adopting the layerrnom operator splitting quantization device provided by the embodiment, so that the layerrnom operator except the division operator can be quantized to be int8, that is, after the layerrnom operator is quantized by adopting the layerrnom operator splitting quantization device provided by the embodiment, all operators except the division operator in the VIT network can be of the int8 type, and therefore, the model precision of the quantized VIT network can be ensured, and the quantized VIT network can be deployed in embedded equipment, so that the reasoning speed and the reasoning precision of the model on the embedded equipment are improved.

In summary, the present embodiment provides a layerrnorm operator splitting quantization apparatus, where the apparatus includes: the device comprises an average value calculation unit, a splitting unit, a preset number of parallel quantization units and a splicing unit, wherein the average value calculation unit is connected with the splitting unit, the splitting unit is respectively connected with each quantization unit, each quantization unit is connected with the splicing unit, and the preset number is equal to the number of data categories contained in the input tensor of the average value calculation unit. According to the application, the input tensor to be processed by the layerrnom operator is split into the preset number of sub tensors through the splitting unit, and then the preset number of sub tensors are processed through the preset number of parallel quantization units, so that different quantization units can be adopted for carrying out parallel processing on the token of different data types in the input tensor, the problem that the data value of part of the token is filtered due to large data value distribution difference among the token of different data types can be avoided, the precision of the quantized layerrnom operator is ensured, and the model precision of the network model with the layerrnom operator deployed on the embedded equipment is further improved.

Based on the above-mentioned layerrnom operator splitting quantization device, the present embodiment provides a layerrnom operator splitting quantization method, as shown in fig. 3, where the method includes:

s10, inputting an input tensor into a mean value calculation unit, calculating a mean value corresponding to the input tensor through the mean value calculation unit, and determining a candidate tensor based on the input tensor and the mean value;

s20, inputting the candidate tensor into a splitting unit, and splitting the candidate tensor into a preset number of sub tensors through the splitting unit, wherein each sub tensor corresponds to a data category, and the data categories corresponding to the sub tensors are different;

s30, inputting each sub-tensor into each quantization unit respectively, and quantizing each sub-tensor through each quantization unit to obtain quantized sub-tensors, wherein each quantization unit corresponds to each sub-tensor one by one;

s40, inputting a plurality of quantized sub-tensors into a splicing unit, splicing the quantized sub-tensors through the splicing unit, and determining a batch normalization tensor corresponding to the input tensor based on the spliced tensor obtained by splicing.

In one implementation, the data categories include a classification category and an image category.

Based on the above-described layerrnom operator splitting quantization method, the present embodiment provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the layerrnom operator splitting quantization method as described in the above-described embodiment.

Based on the layerrnom operator splitting and quantizing device, the application also provides embedded equipment, which comprises the layerrnom operator splitting and quantizing device, so that layerrnom operator splitting and quantizing is performed through the layerrnom operator splitting and quantizing device.

In addition, the method is performed by the device, and the performing process of the method is described in detail in the description of each unit module in the device, which is not stated here; the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here again.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A layerrnorm operator splitting quantization apparatus, said apparatus comprising: the device comprises an average value calculation unit, a splitting unit, a preset number of parallel quantization units and a splicing unit, wherein the average value calculation unit is connected with the splitting unit, the splitting unit is respectively connected with each quantization unit, each quantization unit is connected with the splicing unit, and the preset number is the same as the number of data categories contained in an input tensor of the average value calculation unit;

the input tensor to be processed by the layerrnom operator is split into the token by the splitting unit, different quantization units are adopted for parallel processing, the filtering of partial token data values caused by large data value distribution difference among the token of different data types is avoided, the precision of the quantized layerrnom operator is ensured, and the model precision of a network model which is deployed on embedded equipment and carries the layerrnom operator is further improved on the premise of ensuring the execution efficiency.

2. The layerrnorm operator splitting and quantizing device according to claim 1, wherein the average calculating unit is configured to calculate an average value corresponding to an input tensor, and determine a candidate tensor based on the input tensor and the average value; the splitting unit is used for splitting the candidate tensor into a preset number of sub-tensors based on the data category contained in the input tensor; the quantization unit is used for quantizing the sub-tensors to obtain quantized sub-tensors; the splicing unit is used for splicing the quantized sub-tensors and determining a batch normalization tensor corresponding to the input tensor based on the spliced tensor obtained by splicing.

3. The layerrnorm operator splitting quantization apparatus of claim 1, wherein the data categories include classification categories and image categories.

4. The layerrnorm operator splitting and quantizing device according to claim 1, wherein the quantizing unit comprises a square operator, a mean operator, an addition operator, an open square operator and a division operator, the square operator, the mean operator, the addition operator, the open square operator and the division operator are sequentially connected, and the division operator is connected with the splitting unit.

5. The layerrnorm operator splitting and quantizing device according to claim 1, wherein the splicing unit comprises a splicing operator, a multiplication operator and an addition operator which are sequentially connected, wherein the splicing operator is connected with each quantizing unit.

6. The layerrnorm operator splitting quantization apparatus according to any of claims 1-5, wherein all operators included in the mean value calculation unit, the splitting unit, the parallel quantization units, and the splicing unit adopt int8 data types except for division operators included in the quantization units.

7. A layerrnorm operator split quantization method, the method comprising:

inputting each sub-tensor into each quantization unit respectively, and quantizing each sub-tensor by each quantization unit to obtain quantized sub-tensors, so as to realize parallel processing of the token of different data types in the input tensor by adopting different quantization units, avoid filtering partial data values of the token due to large data value distribution difference among the token of different data types, ensure the precision of quantized layerrnom operators, and further improve the model precision of a network model which is deployed on embedded equipment and carries the layerrnom operators on the premise of ensuring the execution efficiency, wherein each quantization unit corresponds to each sub-tensor one by one;

8. The layerrnorm operator splitting quantization method of claim 7, wherein the data categories include classification categories and image categories.

9. A computer readable storage medium storing one or more programs executable by one or more processors to implement the steps in the layerrnorm operator splitting quantization method of claim 7 or 8.

10. An embedded device, comprising: the embedded device comprising a layerrnorm operator splitting quantization means as claimed in any of claims 1-6.