CN115879504A

CN115879504A - Device and method for splitting and quantizing layerorm operator

Info

Publication number: CN115879504A
Application number: CN202211729854.4A
Authority: CN
Inventors: 郝鑫; 吴晗
Original assignee: Zhuhai Ouye Semiconductor Co ltd
Current assignee: Zhuhai Ouye Semiconductor Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-03-31
Anticipated expiration: 2042-12-30
Also published as: CN115879504B

Abstract

The application discloses a device and a method for splitting and quantizing a layerorm operator, wherein the device comprises a mean value computing unit, splitting units, a preset number of parallel quantizing units and splicing units, the mean value computing unit is connected with the splitting units, the splitting units are respectively connected with the quantizing units, the quantizing units are connected with the splicing units, and the preset number is equal to the number of data types contained in an input tensor. According to the method, the input tensor is split into the sub-tensors with the preset number through the splitting unit, then each sub-tensor is subjected to parallel processing through each quantization unit, the problem that data values of part of tokens are filtered due to large data value distribution difference among tokens of different data types is solved, the precision of a quantized layerorm operator is guaranteed, and the model precision of a network model which is deployed on the embedding equipment and carries the layerorm operator is improved on the premise of guaranteeing the execution efficiency.

Description

Device and method for splitting and quantizing layerorm operator

Technical Field

The application relates to the technical field of deep learning, in particular to a device and a method for splitting and quantizing a layerorm operator.

Background

On a visual classification network, a Vision transform (VIT) network is widely applied, VIT introduces a transform structure, and solves the sequential calculation problem of RNN (or LSTM, GRU and the like) through a self-attention mechanism in the transform structure, so that the performance and the precision superior to those of CNN are obtained.

When applying a VIT network to an embedded device, due to limitations of power consumption of the embedded device, VITs deployed in the embedded device perform quantization processing (e.g., quantizing flat32 to int8, etc.). However, since the embedded device generally does not support int8 type division calculation, the layerorm operator needs to be split into several small operators in layers and then quantized. However, when the layerorm operator is split into a plurality of small operators for quantization, the problem that the model of the quantized model has low precision is common.

Thus, the prior art has yet to be improved and enhanced.

Disclosure of Invention

The technical problem to be solved by the present application is to provide a device and a method for splitting and quantizing a layerorm operator, aiming at the defects of the prior art.

In order to solve the foregoing technical problem, a first aspect of the embodiments of the present application provides a device for splitting and quantizing a layerorm operator, where the device includes: the device comprises an average value calculating unit, splitting units, a preset number of parallel quantization units and splicing units, wherein the average value calculating unit is connected with the splitting units, the splitting units are respectively connected with the quantization units, and the quantization units are connected with the splicing units, wherein the preset number is equal to the number of data types contained in the input tensor of the average value calculating unit.

The layerorm operator splitting and quantizing device is characterized in that the mean value calculating unit is used for calculating a mean value corresponding to an input tensor, and determining a candidate tensor based on the input tensor and the mean value; the splitting unit is used for splitting the candidate tensor into a preset number of sub-tensors based on the data category contained in the input tensor; the quantization unit is used for quantizing the sub tensor to obtain a quantized sub tensor; the splicing unit is used for splicing a plurality of quantized tensor, and determining batch normalization tensor corresponding to the input tensor based on the splicing tensor obtained by splicing.

The layerorm operator splitting and quantizing device is used for splitting and quantizing the image, wherein the data categories comprise classification categories and image categories.

The layerorm operator splitting and quantizing device is characterized in that the quantizing unit comprises a square operator, a mean operator, an addition operator, an open square operator and a division operator, the square operator, the mean operator, the addition operator, the open square operator and the division operator are sequentially connected, and the division operator is connected with the splitting unit.

The layerorm operator splitting and quantizing device is characterized in that the splicing unit comprises a splicing operator, a multiplication operator and an addition operator which are sequentially connected, wherein the splicing operator is connected with each quantizing unit.

The layerorm operator splitting and quantizing device is characterized in that int8 data types are adopted in all operators contained in the mean value computing unit, the splitting unit, the parallel quantizing units and the splicing unit except division operators included in the quantizing units.

A second aspect of the embodiments of the present application provides a method for splitting and quantizing a layerorm operator, where the method includes:

inputting the input tensor into a mean value calculation unit, calculating a mean value corresponding to the input tensor through the mean value calculation unit, and determining a candidate tensor based on the input tensor and the mean value;

inputting the candidate tensor into a splitting unit, and splitting the candidate tensor into a preset number of sub-tensors through the splitting unit, wherein each sub-tensor corresponds to a data type, and the data types corresponding to the sub-tensors are different from each other;

respectively inputting each sub-tensor into each quantization unit, and quantizing each sub-tensor through each quantization unit to obtain quantized sub-tensors, wherein each quantization unit corresponds to each sub-tensor one by one;

the method comprises the steps of inputting a plurality of quantized tensor into a splicing unit, splicing the quantized tensor through the splicing unit, and determining batch normalization tensor corresponding to the input tensor based on the spliced tensor.

The layerorm operator splitting and quantizing method is characterized in that the data categories comprise classification categories and image categories.

A third aspect of embodiments of the present application provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the layerorm operator split quantization method as described in any one of the above.

A fourth aspect of the embodiments of the present application provides an embedded device, where the embedded device includes the layerorm operator splitting and quantizing apparatus as described above.

Has the advantages that: compared with the prior art, the application provides a device and a method for splitting and quantizing a layerorm operator, wherein the device comprises: the device comprises a mean value calculating unit, splitting units, a preset number of parallel quantization units and splicing units, wherein the mean value calculating unit is connected with the splitting units, the splitting units are respectively connected with the quantization units, and the quantization units are connected with the splicing units, wherein the preset number is equal to the number of data types contained in the input tensor of the mean value calculating unit. The split unit splits the input tensor to be processed by the layerorm operator into the sub-tensors with the preset number, the quantization units with the parallel preset number are used for the sub-tensors with the number, different quantization units can be used for performing parallel processing on tokens of different data types in the input tensor, the problem that data values of part of tokens are filtered due to the fact that the difference of data value distribution among the tokens of different data types is large can be avoided, the precision of the quantized layerorm operator is guaranteed, and the model precision of a network model carrying the layerorm operator deployed on an embedding device is improved on the premise of guaranteeing the execution efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings may be obtained according to the drawings without any inventive work.

Fig. 1 is a schematic structural diagram of a layerorm operator split quantization apparatus provided in the present application.

Fig. 2 is a diagram of an example of a layerorm operator splitting quantization apparatus provided in the present application.

Fig. 3 is a flowchart of a method for split-quantizing a layerorm operator provided in the present application.

Detailed Description

In order to make the purpose, technical scheme and effect of the present application clearer and clearer, the present application is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that, the sequence numbers and sizes of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process is determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.

Research shows that a Vision Transformer (VIT) network is widely applied to a visual classification network, a transducer structure is introduced into VIT, and the sequence calculation problem of RNN (or LSTM, GRU and the like) is solved through a self-attention mechanism in the transducer structure, so that the performance and the precision superior to those of CNN are obtained.

When applying a VIT network to an embedded device, due to limitations of power consumption of the embedded device, VITs deployed in the embedded device perform quantization processing (e.g., quantizing flat32 to int8, etc.). However, since the embedded device generally does not support int 8-type division, the layerorm operator needs to be split into several small operators and then quantized. However, when the layerorm operator is split into a plurality of small operators for quantization, the problem that the model of the quantized model has low precision is common.

In order to solve the above problem, an apparatus for splitting and quantizing a layerorm operator is provided in this embodiment of the present application, where the apparatus includes: the device comprises a mean value calculating unit, splitting units, a preset number of parallel quantization units and splicing units, wherein the mean value calculating unit is connected with the splitting units, the splitting units are respectively connected with the quantization units, and the quantization units are connected with the splicing units, wherein the preset number is equal to the number of data types contained in the input tensor of the mean value calculating unit. The split unit splits the input tensor to be processed by the layerorm operator into the sub-tensors with the preset number, the parallel quantization units with the preset number are used for processing the sub-tensors with the preset number, different quantization units can be used for processing the tokens of different data types in the input tensor in parallel, the problem that data values of part of tokens are filtered due to large data value distribution difference among the tokens of different data types can be avoided, the precision of the quantized layerorm operator is ensured, and the model precision of the network model with the layerorm operator, which is deployed on the embedding equipment, is improved on the premise of ensuring the execution efficiency.

The following further describes the content of the application by describing the embodiments with reference to the attached drawings.

The embodiment provides a splitting and quantizing device for a layerorm operator, wherein the splitting and quantizing device for the operator is used for splitting and quantizing the layerorm operator, and a calculation formula of the layerorm operator is as follows:

wherein x represents the input tensor, y represents the batch normalization tensor, E [ x ] represents the mean value, var [ x ] represents the variance, epsilon represents the pre-trained fixed value, and gamma and beta represent the transformation coefficient of the affine transformation.

Based on a calculation formula of a layerorm operator, as shown in fig. 1, the layerorm operator splitting and quantizing device in this embodiment may include an average value calculation unit 100, a splitting unit 200, a preset number of parallel quantization units 300, and a splicing unit 400, where the average value calculation unit 100 is connected to the splitting unit 200, the splitting unit 200 is connected to each quantization unit 300, and each quantization unit 300 is connected to the splicing unit 400. The preset number is equal to the number of data types contained in the input tensor in which the layerorm operator performs batch normalization, that is, the number of parallel quantization units is the same as the number of data types contained in the input tensor, and each quantization unit is used for quantizing token corresponding to one data type, so that different quantization units can be used for performing parallel processing on tokens of different data types in the input tensor, the problem that data values of part of tokens are filtered due to large data value distribution difference among tokens of different data types can be avoided, the precision of the quantized layerorm operator is ensured, and the model precision of a network model carrying the layerorm operator deployed on the embedded device is improved.

The mean value calculating unit is used for calculating a mean value corresponding to an input tensor, and determining a candidate tensor based on the input tensor and the mean value. The input tensor is an input item of the mean value calculation unit, and the candidate tensor is an output item of the mean value calculation unit, wherein the candidate tensor is equal to the product of the input vector and the mean value tensor corresponding to the input tensor. In one implementation, the mean value calculating unit is configured to calculate x-E [ x ] in the calculation formula of the layerorm operator, as shown in fig. 2, the mean value calculating unit may include a mean operator reducan and a subtraction operator Sub, where the mean operator reducan and the subtraction operator Sub are connected; and the subtraction operator Sub is connected with an output operator for outputting the input tensor and the splitting unit, performs subtraction operation on the output items of the input tensor and the mean value operator Reducemean to obtain a candidate tensor, and inputs the candidate tensor into the splitting unit.

Furthermore, since the candidate tensor is obtained by subtracting the input tensor and the output item of the averaging operator, the averaging calculating unit does not change the number of tokens and the data class of each token in the input tensor, that is, the number of tokens and the data class of each token included in the input tensor and the number of tokens and the data class of each token included in the candidate tensor. Thus, the subsequent splitting unit can split the candidate tensor according to the data class included in the input tensor.

The splitting unit is configured to split the candidate tensor into a preset number of sub-tensors based on data categories included in the input tensor, where data categories of tokens included in each sub-tensor in the preset number of sub-tensors are the same, and data categories of tokens between different sub-tensors are different from each other. For example, the input tensor includes 50 tokens, where a data class of one token is a classification class and a data class of 49 tokens is an image class, then the splitting unit splits the candidate tensor into two sub-tensors, and the distribution is denoted as a sub-tensor a and a sub-tensor B, where the sub-tensor a includes tokens of which one data class is a classification class and the sub-tensor B includes tokens of which 49 data classes is an image class. Further, in one implementation, as shown in fig. 2, the splitting unit may employ a Split operator by which the candidate tensor is Split into two sub-tensors.

The splitting reaction in this embodiment is based on the data class included in the input tensor, so as to avoid the problem that the data values of part of tokens are filtered due to a large difference in data value distribution among tokens of different data classes, and ensure the precision of the quantized layerorm operator. The fact that the accuracy of the VIT model is seriously reduced after the layerorm operator in the VIT model is split and quantized is found through research. For this reason, the layer-by-layer precision analysis is performed on the models before and after quantization by using the cosine similarity (that is, the cosine similarity is compared for the output tensor of each layer), and it is found that after passing through the division operator, the cosine similarity of the tensors before and after quantization has a significant difference, and the cosine similarity before the division operator is basically the same. However, the calculation of the division operator is independent of quantization, so that the tensor with the dimension of [1,50,1] of the input division operator is analyzed, and the first value and the other 49 values in [1,50,1] are not distributed in an order of magnitude according to the distribution of the tensor of [1,50,1], so that after quantization, the first value is quantized to 0, and the accuracy of the quantized model is influenced.

Further, it is found from the study of [1,50,1] that 50 in [1,50,1] has a meaning in VIT model of number of tokens of the transform network (tokens are words or phrases in the semantic analysis class model, and in image classification, it is actually the image is cut into non-overlapping patch sequences), and the first token in 50 tokens is the classified token, and the following 49 tokens are the tokens of the image patch, that is, the data category of the first token and the data category of the following 49 tokens are different, so that the distribution of the first token and the following 49 tokens is greatly different. The quantization process selects a threshold that results in the first token being filtered out because it is significantly different from the distribution of the following 49 tokens. Therefore, in this embodiment, after the candidate tensors are obtained, the splitting unit splits the candidate tensors into a preset number of sub-tensors, and then the parallel quantization units quantize the sub-tensors, so that the problem of large numerical distribution difference caused by different data types can be avoided, and further the model accuracy of the quantized model can be improved.

In one implementation, the layerorm operator splitting quantization apparatus is applied to the VIT network, and thus the data categories included in the input tensor include a classification category and an image category, where the data category of the first token in the input tensor is the classification category, and the data categories of the last 49 tokens are the image categories. Thus, when splitting the candidate tensor based on the data type, the splitting may be performed directly based on the position of each token in the candidate tensor, the first token may be split into one sub-tensor directly, and the latter 49 tokens may be split into one sub-tensor directly.

The network structures of the quantization units in the preset number of quantization units are all the same, and each quantization unit is configured to quantize the corresponding sub-tensor, that is, the quantization units in the preset number are in one-to-one correspondence with the sub-tensors in the preset number, and each quantization unit is configured to quantize the corresponding sub-tensor. In one implementation, the preset number of quantization units is used in the calculation formula for calculating the layerorm operator

Correspondingly, as shown in fig. 2, the quantization unit includes a square operator Pow, a mean operator ReduceMean, an addition operator Add, a square-open operator Sqrt, and a division operator Div, the square operator Pow, the mean operator ReduceMean, the addition operator Add, the square-open operator Sqrt, and the division operator Div are sequentially connected, and the division operator Div is connected to the splitting unit.

The splicing unit is used for splicing a plurality of quantized tensor, determines batch normalization tensor corresponding to the input tensor based on the splicing tensor obtained by splicing, namely splices the quantized tensor corresponding to each sub-vector by the splicing unit to obtain the splicing tensor, and then performs affine transformation on the splicing tensor to obtain the batch normalization tensor. Based on this, in one implementation, as shown in fig. 2, the concatenation unit includes a concatenation operator Concat, a multiplication operator Mul, and an addition operator Add, which are connected in sequence, where the concatenation operator Concat is connected with each quantization unit.

Furthermore, because the division operator needs to use float data, the layer operator is quantized by adopting the layer operator splitting and quantizing device provided by the embodiment, so that the layer operator except the division operator can be quantized to int8, that is, after the layer operator is quantized by adopting the layer operator splitting and quantizing device provided by the embodiment, all operators except the division operator in the VIT network can adopt int8 types, thus not only ensuring the model accuracy of the quantized VIT network, but also deploying the quantized VIT network in the embedded device, and further improving the reasoning speed and reasoning accuracy of the model on the embedded device.

To sum up, this embodiment provides a device for splitting and quantizing a layerorm operator, where the device includes: the device comprises an average value calculating unit, splitting units, a preset number of parallel quantization units and splicing units, wherein the average value calculating unit is connected with the splitting units, the splitting units are respectively connected with the quantization units, and the quantization units are connected with the splicing units, wherein the preset number is equal to the number of data types contained in the input tensor of the average value calculating unit. The input tensor to be processed by the layerorm operator is split into the sub-tensors with the preset number through the splitting unit, then the sub-tensors with the number are subjected to the parallel quantization unit with the preset number, different quantization units can be adopted for carrying out parallel processing on tokens of different data types in the input tensor, the problem that data values of part of tokens are filtered due to large data value distribution difference among the tokens of different data types can be avoided, the precision of the quantized layerorm operator is ensured, and the model precision of a network model carrying the layerorm operator deployed on the embedding equipment is improved.

Based on the above apparatus for splitting and quantizing a layerorm operator, this embodiment provides a method for splitting and quantizing a layerorm operator, as shown in fig. 3, where the method includes:

s10, inputting the input tensor into a mean value calculation unit, calculating a mean value corresponding to the input tensor through the mean value calculation unit, and determining a candidate tensor based on the input tensor and the mean value;

s20, inputting the candidate tensor into a splitting unit, and splitting the candidate tensor into a preset number of sub-tensors through the splitting unit, wherein each sub-tensor corresponds to a data type, and the data types corresponding to the sub-tensors are different from each other;

s30, respectively inputting the sub-tensors into the quantization units, and quantizing the sub-tensors through the quantization units to obtain quantized sub-tensors, wherein the quantization units correspond to the sub-tensors one by one;

s40, inputting the plurality of quantized sub tensors into a splicing unit, splicing the quantized sub tensors through the splicing unit, and determining batch normalization tensors corresponding to the input tensors based on the spliced tensors obtained through splicing.

In one implementation, the data categories include a classification category and an image category.

Based on the above method for split-quantizing a layerorm operator, the present embodiment provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the steps in the method for split-quantizing a layerorm operator according to the above embodiment.

Based on above-mentioned layerorm operator split quantization apparatus, this application still provides an embedded equipment, embedded equipment includes above-mentioned layerorm operator split quantization apparatus, with through layerorm operator split quantization apparatus carries out layerorm operator split quantization.

In addition, the method is executed by the device, and the execution process of the method is described in detail in the description of each unit module in the device, and is not stated one by one here; the specific processes loaded and executed by the instruction processors in the storage medium and the terminal device are described in detail in the method, and are not further stated herein.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A device for splitting and quantizing a layerorm operator, the device comprising: the device comprises an average value calculating unit, splitting units, a preset number of parallel quantization units and splicing units, wherein the average value calculating unit is connected with the splitting units, the splitting units are respectively connected with the quantization units, and the quantization units are connected with the splicing units, wherein the preset number is equal to the number of data types contained in the input tensor of the average value calculating unit.

2. The layerorm operator splitting and quantizing device according to claim 1, wherein the mean value calculating unit is configured to calculate a mean value corresponding to an input tensor, and determine a candidate tensor based on the input tensor and the mean value; the splitting unit is used for splitting the candidate tensor into a preset number of sub-tensors based on the data category contained in the input tensor; the quantization unit is used for quantizing the sub tensor to obtain a quantized sub tensor; the splicing unit is used for splicing a plurality of quantized tensor, and determining batch normalization tensor corresponding to the input tensor based on the splicing tensor obtained by splicing.

3. The layerorm operator splitting quantization apparatus of claim 1, wherein the data classes comprise a classification class and an image class.

4. The layerorm operator splitting and quantizing device according to claim 1, wherein the quantizing unit includes a square operator, a mean operator, an addition operator, an open square operator, and a division operator, the square operator, the mean operator, the addition operator, the open square operator, and the division operator are connected in sequence, and the division operator is connected to the splitting unit.

5. The layerorm operator splitting and quantizing device according to claim 1, wherein the splicing unit comprises a splicing operator, a multiplication operator and an addition operator which are connected in sequence, and the splicing operator is connected to each quantizing unit.

6. The layernorm operator splitting and quantizing device according to any one of claims 1 to 5, wherein int8 data types are adopted in all operators contained in the mean value computing unit, the splitting unit, the parallel quantizing units and the splicing unit except for a division operator included in the quantizing unit.

7. A method for splitting and quantizing a layerorm operator, which is characterized by comprising the following steps:

respectively inputting the sub-tensors into the quantization units, and quantizing the sub-tensors through the quantization units to obtain quantized sub-tensors, wherein the quantization units correspond to the sub-tensors one by one;

inputting a plurality of quantized sub tensors into a splicing unit, splicing the quantized sub tensors through the splicing unit, and determining batch normalization tensors corresponding to the input tensors based on the spliced tensors obtained through splicing.

8. The layerorm operator split quantization method of claim 7, wherein the data classes comprise a classification class and an image class.

9. A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps in the layerorm operator split quantization method of claim 7 or 8.

10. An embedded device, comprising: the embedded device comprises the layerorm operator splitting and quantizing device as recited in any one of claims 1 to 6.