CN114648101A

CN114648101A - Transformer structure-based softmax function quantization realization method and device

Info

Publication number: CN114648101A
Application number: CN202210517307.3A
Authority: CN
Inventors: 徐祥; 杨敏; 杨作兴; 艾国
Original assignee: Hangzhou Yanji Microelectronics Co ltd
Current assignee: Hangzhou Yanji Microelectronics Co ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-06-21
Anticipated expiration: 2042-05-13
Also published as: CN114648101B

Abstract

The invention provides a transformer structure-based softmax function quantization realization method and device, which are applied to a softmax module in a transformer structure and comprise the following steps: according to the bit width of input data of the global shared index mapping table, performing data adjustment and truncation processing on each row of input data in an input matrix transmitted from the MatMul module to the softmax module to obtain truncation data of the row of input data; searching a global shared index mapping table to determine index mapping data corresponding to truncated data of each row of input data; adding the exponential mapping data corresponding to the truncation data of each row of input data to obtain the sum of the exponential mapping data corresponding to the row of input data; determining a reciprocal mapping value of the sum of the exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function; and performing multiplication and shift operation on the exponent mapping data corresponding to the stage data of each row of input data and the reciprocal mapping value of the sum of the exponent mapping data corresponding to the row of input data to obtain a final result corresponding to the row of input data.

Description

Transformer structure-based softmax function quantization realization method and device

Technical Field

The invention relates to the technical field of neural networks, in particular to a transformer structure-based softmax function quantization implementation method and device.

Background

The translation model transformer based on the self-attention mechanism does not use the methods and modules of CNN and RNN, but creatively takes the attention mechanism as the core construction of a codec to execute the translation operation.

Referring to fig. 1, fig. 1 is a schematic diagram of a prior art transformer structure, as shown in fig. 1, the transformer structure sequentially includes: the system comprises a MatMul module used for carrying out matrix multiplication operation on input parameters Q (query) and K (key) of a transform structure, a Scale module used for carrying out scaling on output data of the MatMul module, a Mask module used for shielding partial output data of the Scale module, a SoftMax module used for carrying out Softmax operation on the output data of the Mask module, and a MatMul module used for multiplying the output data of the SoftMax module and the input parameters V (value) of the transform structure. It can be seen that the operation of the softmax module in the transformer structure is preceded by the operation of the MatMul module, and the operation of the softmax module is substantially performed on the matrix output after the operation of the MatMul module line by line.

However, due to unbalanced data distribution between rows in the input matrix, the difference between actual scales (the difference between the maximum value and the minimum value in one row of data) is relatively large, and the quantization results of two rows of data are consistent due to quantization errors, for example, [41.45, 42.15, 42.84, 43.89] and [40.81, 41.93, 43.36, 44.27] are quantized to obtain the same quantization result [41, 42, 43, 44] (scale =1), so that the quantization accuracy is not high.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for implementing softmax function quantization based on a transform structure, which are simple in calculation and capable of effectively improving quantization precision.

In order to achieve the purpose, the invention provides the following technical scheme:

a softmax function quantization realization method based on a transform structure is applied to a softmax module in the transform structure and comprises the following steps:

acquiring an input matrix transmitted by a MatMul module in a transform structure;

performing data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to the bit width of the input data of the global shared index mapping table configured for the transform structure in advance to obtain adjustment data of the row of input data;

cutting off the adjustment data of each row of input data to obtain cut-off data of the row of input data;

searching the global shared index mapping table to determine index mapping data corresponding to truncated data of each row of input data;

adding the exponential mapping data corresponding to the truncation data of each row of input data to obtain the sum of the exponential mapping data corresponding to the row of input data;

determining a reciprocal mapping value of the sum of exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for a transformer structure in advance;

performing multiplication and shift operation on the exponential mapping data corresponding to the truncation data of each row of input data and the reciprocal mapping value of the sum of the exponential mapping data corresponding to the row of input data to obtain a final result corresponding to the row of input data;

and outputting the final result to a subsequent neural network layer of the neural network layer to which the softmax module belongs.

A transformer structure-based softmax function quantization implementation device is applied to a softmax module in a transformer structure, and comprises the following components:

the acquisition unit is used for acquiring an input matrix transmitted by a MatMul module in a transformer structure;

the adjusting unit is used for performing data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to the bit width of the input data of the global shared index mapping table configured for the transformer structure in advance to obtain the adjustment data of the row of input data; cutting off the adjustment data of each row of input data to obtain cut-off data of the row of input data;

the index unit is used for searching the global shared index mapping table to determine index mapping data corresponding to the truncated data of each row of input data;

the summing unit is used for summing the exponential mapping data corresponding to the truncation data of each row of input data to obtain the sum of the exponential mapping data corresponding to the row of input data;

the reciprocal unit is used for determining a reciprocal mapping value of the sum of exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for the transformer structure in advance;

the multiplication unit is used for carrying out multiplication and shift operation on the exponential mapping data corresponding to the truncation data of each row of input data and the reciprocal mapping value of the sum of the exponential mapping data corresponding to the row of input data to obtain a final result corresponding to the row of input data;

and the output unit is used for outputting the final result to a subsequent neural network layer of the neural network layer to which the softmax module belongs.

According to the technical scheme, after data adjustment and truncation processing are carried out on each row of input data in the input matrix transmitted from the MatMul module to the softmax module in the transform structure, the index operation and reciprocal operation are achieved through the globally shared index mapping table and reciprocal mapping function which are configured for the transform structure in advance, calculation is simple, the problem that data distribution among different rows in the input matrix transmitted from the MatMul module to the softmax module is unbalanced, and therefore quantization accuracy is not high can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a schematic diagram of a prior art transformer structure;

fig. 2 is a schematic diagram of an operation process of the softmax module in the transform structure according to the embodiment of the present invention;

FIG. 3 is a flowchart of a method for implementing quantization of a softmax function based on a transform structure according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for implementing quantization of softmax function based on a transform structure according to a second embodiment of the present invention;

FIG. 5 is an exploded view of input data of the softmax module according to an embodiment of the invention;

FIG. 6 is a diagram of a prior art exp exponential function image;

FIG. 7 is a flowchart of a method for implementing quantization of a softmax function based on a transform structure according to a third embodiment of the present invention;

FIG. 8 is a schematic diagram of a pipeline mode according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a device for implementing softmax function quantization based on a transform structure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

In the invention, a globally shared index mapping table and a reciprocal mapping function are configured for the transformer structure in advance, so that the softmax module in the transformer structure can realize index operation and reciprocal operation by using the globally shared index mapping table and reciprocal mapping function, the calculation is simple, the hardware resource consumption can be reduced, and the power consumption is reduced; and the problem of low quantization precision caused by inconsistent data distribution among different rows in the input matrix transmitted from the MatMul module to the softmax module can be solved, and the quantization precision can be effectively improved.

Fig. 2 is a schematic diagram of an operation process of the softmax module in the transform structure according to an embodiment of the present invention, and as shown in fig. 2, assuming that each row of input data of the input matrix transferred from the MatMul module to the softmax module in the transform structure includes n +1 input data, the input data [ X ] of the i +1 th row of input data of the input matrix transferred from the MatMul module to the softmax module is input_i0,……,X_in]The operation process of (2) is as follows:

step 1, determining the input data [ X ] of the row_i0,……,X_in]MAX of (3);

in this embodiment, the input data bit width of the softmax module is already indicated in fig. 2, and is i _ bits, which is generally 16 bits.

Step 2, inputting the row of data [ X ]_i0,……,X_in]Subtracting the maximum value MAX in the input data line from each input data in the row, adding offset (adjustment value) to obtain adjustment data, and cutting the adjustment dataPerforming truncation processing to obtain truncation data [ X 'of the line input data'_i0,……,X′_in]。

Step 3, determining truncated data [ X 'of the line of input data according to a global shared index mapping table (GLUT-exp)'_i0,……,X′_in]Corresponding exponential mapping data [ Q ]_i0,……,Q_in]；

In this embodiment, the input data bit width t _ bits and the output data bit width e _ bits of the global shared index mapping table are already indicated in fig. 2, where the value of t _ bits is generally 8bits, and the value of e _ bits may also be appropriately increased as needed.

Step 4, mapping the exponent corresponding to the truncation data of the input data of the row into data [ Q ]_i0,……,Q_in]Adding to obtain the sum of exponential mapping data sum = Q corresponding to the input data of the row_i0+Q_i1+……+Q_in。

Step 5, determining a reciprocal mapping value corresponding to the input data of the row according to a global shared exponential mapping function (GLUT-1/Q);

in this embodiment, the bit width of the output data of the global shared exponent mapping function is already indicated in fig. 2, and is r _ bits, which is generally 16bits, and the value of r _ bits may also be appropriately adjusted as needed.

Step 6, index mapping data [ Q ] corresponding to the truncation data of the input data of the row_i0,……,Q_in]Multiplying the reciprocal mapping value of the exponential mapping data sum corresponding to the row of input data, and shifting the multiplication result to obtain the final result [ Y ] corresponding to the row of input data_i0,……,Y_in]。

In this embodiment, the output data bit width of the softmax module is already indicated in fig. 2, which is o _ bits, and generally takes a value of 8bits, and the value of o _ bits can be increased to 16bits, but this increases the bandwidth consumed by the whole network.

The input data bit width of the method can be 16bits, the precision of the implementation scheme is high, and the method can adapt to various scenes. Such as object classification, object monitoring, etc.

The method for realizing the softmax function quantization based on the transform structure provided by the invention is described in detail with reference to specific embodiments as follows:

referring to fig. 3, fig. 3 is a flowchart of a method for implementing softmax function quantization based on a transform structure according to an embodiment of the present invention, where the method is applied to a softmax module in a transform structure, and as shown in fig. 3, the method mainly includes the following steps:

step 300, acquiring an input matrix transmitted by a MatMul module in a transform structure;

step 301, according to an input data bit width of a global shared index mapping table configured for a transformer structure in advance, performing data adjustment on each row of input data in an input matrix transmitted from a MatMul module to a softmax module to obtain adjustment data of the row of input data;

step 302, performing truncation processing on the adjustment data of each row of input data to obtain truncation data of the row of input data;

step 303, searching the global shared index mapping table to determine index mapping data corresponding to truncated data of each row of input data;

step 304, adding the exponential mapping data corresponding to the truncated data of each row of input data to obtain the sum of the exponential mapping data corresponding to the row of input data;

step 305, determining a reciprocal mapping value of the sum of exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for a transformer structure in advance;

step 306, performing multiplication and shift operation on the exponent mapping data corresponding to the truncation data of each row of input data and the reciprocal mapping value of the sum of the exponent mapping data corresponding to the row of input data to obtain a final result corresponding to the row of input data;

and 307, outputting the final result to a subsequent neural network layer of the neural network layer to which the softmax module belongs.

As can be seen from the method shown in fig. 3, in this embodiment, after data adjustment and truncation processing are performed on each row of input data in an input matrix transmitted from the MatMul module to the softmax module in the transform structure, an index operation and a reciprocal operation are implemented by using a globally shared index mapping table and reciprocal mapping function configured for the transform structure in advance, so that a calculation process is simplified, hardware resource consumption can be reduced, power consumption can be reduced, the problem of inconsistency between data distribution and data precision between different rows in the input matrix transmitted from the MatMul module to the softmax module can be solved, and quantization precision can be effectively improved.

Referring to fig. 4, fig. 4 is a flowchart of a method for implementing softmax function quantization based on a transform structure according to a second embodiment of the present invention, where the method is applied to a softmax module in a transform structure, and as shown in fig. 4, the method mainly includes the following steps:

step 400, acquiring an input matrix transmitted by a MatMul module in a transform structure;

step 4011, for each row of input data in the input matrix transferred from the MatMul module to the softmax module, performs the following operation steps 4012 to 4013:

step 4012, determining a maximum value in the input data of the row, and determining an adjustment value according to an input data bit width of a global shared index mapping table configured for a transformer structure in advance;

in this embodiment, a specific value of the input data bit width t _ bits of the global shared index mapping table may be 8 bits.

In this embodiment, determining an adjustment offset according to the input data bit width t _ bits of the global shared index mapping table may specifically be implemented by using the following formula: offset = (1< < t _ bits) -1, for example, when t _ bits =8, offset =255 can be determined by calculation according to the above formula.

Step 4013, calculating a difference between the input data and the maximum value for each input data in the row of input data, and adjusting the difference by using the adjustment value to obtain adjustment data of the input data;

in this embodiment, the difference is adjusted by using the adjustment value, specifically, the difference is added to the adjustment value.

In this embodiment, for each input in the i +1 th row of input dataEnter data X_ij(j =0, 1, 2, … …, n), calculating a difference between the input data and the maximum value, and adjusting the difference by the adjustment value to obtain adjustment data X 'of the input data'_ijSpecifically, the following formula can be adopted: x'_ij=X_ij-MAX+offset。

In this embodiment, the adjustment data of all the input data in the row of input data constitutes the adjustment data corresponding to the row of input data.

The above steps 4011 to 4013 are a detailed refinement of step 301 shown in fig. 3.

Step 402, regarding each input data in each row of input data, if the adjustment data of the input data is less than 0, truncating the adjustment data of the input data to 0 to be used as truncation data of the input data, otherwise, using the adjustment data of the input data as truncation data of the input data;

in this embodiment, the input data bit width i _ bits of the softmax module may be 16 bits. The input data of 16bits can be decomposed into three parts, as shown in fig. 5, the first part is a sign bit and includes 1bit, the second part and the third part are respectively an integer bit and a decimal bit, and occupy 15bits in total, wherein the more bits occupied by the decimal bit, the higher the precision of corresponding representation is, but the representation capability of the integer is reduced.

In practical applications, the value of the exp exponential function drops relatively quickly, as shown in fig. 6, the value of exp (-7) is already small and can be basically ignored, so long as the difference between the maximum value and the minimum value in the input data of the softmax module is kept at about 7.

Therefore, in this embodiment, after calculating the difference between the input data and the maximum value and using the adjustment value to adjust the difference to obtain the adjustment data of the input data, the truncation processing may be continued to be performed on the adjustment data of the input data to obtain the truncated data of the input data, so that only 3bits of integer bits are retained in the truncated data of the input data, and it is ensured that the difference between the maximum value and the minimum value in the truncated data is maintained at about 7. In addition, in this embodiment, the number of the decimal place may be preset according to the precision requirement, for example, the number of the decimal place may be 5bits, 6bits, or 7 bits.

In the present embodiment, each input data X is obtained by dividing each input data X_ijAdjusted and cut off to obtain X'_ij， X′_ijHas a value range of [0, offset]X 'taking t _ bits =8 as an example'_ijHas a value range of [0, 255 ]]。

The above step 402 is a detailed refinement of step 302 shown in fig. 3.

Step 403, searching the global shared index mapping table to determine index mapping data corresponding to truncated data of each row of input data;

in this embodiment, the global shared index mapping table may be generated according to a preset minimum scale S that can be represented after fixed-point quantization, an input data bit width t _ bits and an output data bit width e _ bits of the shared index mapping table, and includes 1<<t _ bits table entries, the occupied storage space is (1)<<t _ bits) × e _ bits/8 bytes "<<"is the left shift operator. Wherein, the value of S can be preset according to the digit k of decimal place set by precision requirement, k is positive integer, S =1/2^k. For example, when the number of decimal places is 5, the value of S is 1/32; when the decimal digit is 6, the value of S is 1/64; when the number of decimal places is 7, S takes on the value 1/128.

In this embodiment, a specific generation method of the global shared index mapping table may specifically be: input data X to softmax module_ijTruncated data X 'of'_ijValue range of [0, offset ]]Each integer value u in (d) is calculated as its corresponding exponential mapping value v using the following formula:

v = Round (exp ((u-offset) × S) × ((1< < e _ bits) -1)), where the function Round (x) is used to Round the value x in brackets. Through the calculation formula, the mapping relations of offset u to v can be calculated, and the mapping relations form a global shared index mapping table.

Step 404, adding the exponential mapping data corresponding to the truncated data of each row of input data to obtain the sum of the exponential mapping data corresponding to the row of input data;

in this embodiment, it is assumed that the i +1 th row of input data includes n +1 input data, wherein the jth input data X_ijAdjusted data X 'of (2)'_ijCorresponding exponential mapping data as Q_ijThen, the sum of the exponential mapping data corresponding to the input data of the i +1 th row is sum = Q_i0+Q_i1+……+Q_in。

Step 405, determining a reciprocal mapping value of the sum of exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured by a pre-transformer structure;

in this embodiment, the global shared reciprocal mapping function may be a reciprocal mapping table generated according to a preset function, a linear function obtained by linearly fitting the preset function, or a polynomial function generated by polynomial fitting the preset function; wherein the preset function is f (x) =2^r_bitsThe value range of/x, x is a preset value interval, preferably, the preset value interval is [256, 512 ], where r _ bits is the output data bit width of the reciprocal mapping function, and usually takes 16 bits.

Step 406, performing multiplication and shift operation on the exponent mapping data corresponding to the truncated data of each row of input data and the reciprocal mapping value of the sum of the exponent mapping data corresponding to the row of input data to obtain a final result corresponding to the row of input data;

and step 407, outputting the final result to a neural network layer which is the next neural network layer of the neural network layer to which the softmax module belongs.

As can be seen from the method shown in fig. 4, in this embodiment, data adjustment and truncation processing are performed on each input data in each row of input data in an input matrix transmitted from the MatMul module to the softmax module to obtain truncated data of the row of input data, an index mapping table shared globally is used to implement an index operation to obtain index mapping data corresponding to the truncated data of the row of input data, a sum of the index mapping data corresponding to the row of input data is obtained through summation, a reciprocal mapping value corresponding to the sum of the index mapping data is determined according to a reciprocal mapping function shared globally, and a final result corresponding to the row of input data is obtained through multiplication and shift operation. It can be seen that, in this embodiment, by using the globally shared index mapping table and reciprocal mapping function, the calculation process is simplified, hardware resource consumption can be reduced, power consumption is reduced, the problem of inconsistent data distribution and data precision between different rows in the input matrix transmitted from the MatMul module to the softmax module can be solved, and the quantization precision can be effectively improved.

Referring to fig. 7, fig. 7 is a flowchart of a method for implementing softmax function quantization based on a transform structure according to a third embodiment of the present invention, where the method is applied to a softmax module in a transform structure, as shown in fig. 7, the method mainly includes the following steps:

step 700, acquiring an input matrix transmitted by a MatMul module in a transform structure;

step 701, performing data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to an input data bit width of a global shared index mapping table configured for a transformer structure in advance to obtain adjustment data of the row of input data;

in this embodiment, when data is transferred from the MatMul module to the softmax module, a pipeline mode may be adopted for implementation, fig. 8 shows a schematic diagram of the pipeline mode, as shown in fig. 8, a gray part represents operations of the MatMul module, and a white part represents operations of the softmax module, and in the pipeline mode, the MatMul module only transfers partial data of the input matrix to the softmax module at a time, instead of transferring the entire input matrix to the softmax module at one time. The implementation mode enables the MatMul module and the softmax module to be strongly coupled, data output by the MatMul module is not required to be transmitted to the DDR, but is directly transmitted to the softmax module, bandwidth occupation can be reduced, and the speed of the softmax module for acquiring the data is increased.

In this embodiment, when data is transferred from the MatMul module to the softmax module, pipeline is used, and the softmax module does not obtain a final result corresponding to the entire input matrix by performing operations at the same time, but obtains the final result line by line.

Step 702, performing truncation processing on the adjustment data of each row of input data to obtain truncation data of the row of input data;

step 703, searching the global shared index mapping table to determine index mapping data corresponding to truncated data of each row of input data;

step 704, adding the exponential mapping data corresponding to the truncated data of each row of input data to obtain the sum of the exponential mapping data corresponding to the row of input data;

7051, performing data conversion on the sum of the exponential mapping data corresponding to each row of input data;

in this embodiment, the data conversion of the sum of the exponent mapping data corresponding to each row of input data may specifically include:

s11, determining a first effective bit number required to occupy the sum of the index mapping values corresponding to the row of input data and a second effective bit number required to occupy the left boundary value of the preset value interval, and determining a difference between the first effective bit number and the second effective bit number as a first right shift number required to map the sum of the index mapping values corresponding to the row of input data to the preset value interval;

in this embodiment, a specific method for determining the valid bit number bits (bit) (val) required to be occupied by a value val may be calculated by using the following formula bits (val) = floor (log2(val)) +1, where the function floor (x) is used to round the value x in parentheses downward.

Assuming that the sum of the exponent mapping values is 1600 and the preset value interval is [256, 512) as an example, it can be known from the above formula that the first significant bit number = floor (log2(1600)) +1=11 required for representing the sum of the exponent mapping values, and the second significant bit number = floor (log2(256)) +1=9 required for representing the left boundary value 256 of the preset value interval [256, 512), so that the right shift number bits =11-9=2 required for mapping the sum of the exponent mapping values to the preset value interval can be determined.

And S12, right shifting the exponent mapping data sum corresponding to the row of input data according to the first right shift number, mapping the right shifting result to a numerical value in a value range determined by the output data bit width of the exponent mapping table, and taking the numerical value as conversion data of the exponent mapping data sum corresponding to the row of input data.

In this embodiment, the sum of the exponent mapping values corresponding to the row of input data is right-shifted according to the first right-shift number, that is, the sum of the exponent mapping values corresponding to the row of input data is right-shifted by the first right-shift number. And mapping the right shift result to a numerical value in a value range determined by the output data bit width of the index mapping table, wherein the left boundary value in a preset value interval is subtracted from the right shift result. Taking the preset value interval as [256, 512) as an example, the following formula can be adopted to obtain the conversion data sum' = (sum > > bits) -256 corresponding to the sum of the exponent mapping values corresponding to the row of input data, and ">", which is a right shift operator.

7052, inputting the conversion result of the sum of the exponential mapping data corresponding to each row of input data into a global shared reciprocal mapping function configured for a transformer structure in advance to obtain a reciprocal mapping value of the sum of the exponential mapping data corresponding to the row of input data;

in this embodiment, the inverse mapping function may be: generating a reciprocal mapping table according to a preset function, a linear function obtained by linearly fitting the preset function, or a polynomial function generated by polynomial fitting the preset function; wherein the preset function is f (x) =2^r_bitsThe value range of/x, x ∈ is a preset value interval, preferably, the preset value interval is [256, 512 ], where r _ bits is the output data bit width of the reciprocal mapping function, and usually takes a value of 16.

The above steps 7051 to 7052 are specific refinements of step 305 shown in fig. 3.

Step 706, performing multiplication and shift operation on the inverse mapping value of the sum of the exponent mapping data corresponding to the truncated data of each row of input data and the exponent mapping data corresponding to the row of input data to obtain the final result corresponding to the row of input data.

In this embodiment, the performing multiplication and shift operation on the inverse mapping value of the sum of the exponent mapping data corresponding to the truncated data of each row of input data and the exponent mapping data corresponding to the row of input data to obtain the final result corresponding to the row of input data may specifically include:

s21, for each input data in the row of input data, performing the following operations S22 to S23:

s22, calculating the product of the inverse mapping value of the index mapping data corresponding to the truncation data of the input data and the sum of the index mapping data corresponding to the input data of the row, and rounding the product to obtain an intermediate result corresponding to the input data;

and S23, determining a second right shift number required for mapping the intermediate result corresponding to the input data to a value range determined by the output data bit width of the softmax module according to the first right shift number, the output data bit width of the global shared reciprocal mapping function and the output data bit width of the softmax module, and performing right shift on the intermediate result corresponding to the input data according to the second right shift number to obtain a final result corresponding to the input data.

In the step S23, the intermediate result corresponding to the input data is determined to be mapped to the second right shift number bits2 required for the value range determined by the output data bit width of the softmax module according to the first right shift number bits1, the output data bit width r _ bits of the global shared reciprocal mapping function, and the output data bit width o _ bits of the softmax module, which may specifically be implemented by using the following formula: bits2= r _ bits-o _ bits + bits 1.

And step 707, outputting the final result to a subsequent neural network layer of the neural network layer to which the softmax module belongs.

As can be seen from the method shown in fig. 7, in this embodiment, after data adjustment and truncation processing are performed on each row of input data in an input matrix transmitted from the MatMul module to the softmax module, an index mapping table shared globally is used to perform index operation to obtain index mapping data corresponding to truncated data of the row of input data, a sum of the index mapping data corresponding to the row of input data is obtained through summation, the sum of the index mapping data is subjected to data conversion and then is input to a global shared reciprocal mapping function to obtain a reciprocal mapping value corresponding to the sum of the index mapping data, and finally, a final result corresponding to the row of input data is obtained through multiplication and shift operation. It can be seen that, in this embodiment, by using the globally shared index mapping table and reciprocal mapping function, the calculation process is simplified, hardware resource consumption can be reduced, power consumption is reduced, the problem of inconsistent data distribution and data precision between different rows in the input matrix transmitted from the MatMul module to the softmax module can be solved, and the quantization precision can be effectively improved.

The above describes in detail a method for implementing softmax function quantization based on a transform structure in the embodiment of the present invention, and the embodiment of the present invention also provides a device for implementing softmax function quantization based on a transform structure, which is described in detail below with reference to fig. 9.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an apparatus for implementing softmax function quantization based on a transform structure according to an embodiment of the present invention, where the apparatus is applied to a softmax module in a transform structure, and as shown in fig. 9, the apparatus includes:

an obtaining unit 900, configured to obtain an input matrix transmitted by a MatMul module in a transformer structure;

an adjusting unit 901, configured to perform data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to an input data bit width of a global shared index mapping table configured for a transformer structure in advance, to obtain adjustment data of the row of input data; cutting off the adjustment data of each row of input data to obtain cut-off data of the row of input data;

an index unit 902, configured to search the global shared index mapping table to determine index mapping data corresponding to truncated data of each row of input data;

a summing unit 903, configured to add the exponential mapping data corresponding to the truncated data of each row of input data to obtain an exponential mapping data sum corresponding to the row of input data;

a reciprocal unit 904, configured to determine a reciprocal mapping value of a sum of exponent mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for the transformer structure in advance;

a multiplication unit 905, configured to perform multiplication and shift operation on the inverse mapping value of the sum of the exponent mapping data corresponding to the truncated data of each row of input data and the exponent mapping data corresponding to the row of input data, to obtain a final result corresponding to the row of input data;

the output unit 906 is configured to output the final result to a subsequent neural network layer of the neural network layer to which the softmax module belongs.

In the arrangement shown in figure 9 of the drawings,

the adjusting unit 901 performs data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to an input data bit width of a global shared index mapping table configured for a transformer structure in advance, including:

determining the maximum value in the row of input data, and determining an adjusting value according to the bit width of the input data of the shared index mapping table;

calculating the difference value between the input data and the maximum value aiming at each input data in the row of input data, and adjusting the difference value by using the adjusting value to obtain the adjusting data of the input data;

the adjusting unit 901 performs truncation processing on the adjustment data of each line of input data to obtain truncated data of the line of input data, and includes:

and for each input data in the row of input data, if the adjustment data of the input data is less than 0, setting the truncation data of the input data to be 0, otherwise, setting the truncation data of the input data to be the adjustment data of the input data.

In the arrangement shown in figure 9 of the drawings,

the global shared index mapping table is generated according to a preset minimum scale S which can be expressed after fixed point quantization, an input data bit width t _ bits and an output data bit width e _ bits of the global shared index mapping table, and comprises 1< < t _ bits table entries, and the occupied storage space is (1< < t _ bits) multiplied by e _ bits divided by 8 bytes.

In the arrangement shown in figure 9 of the drawings,

the global shared reciprocal mapping function is a reciprocal mapping table generated according to a preset function, a linear function obtained by linearly fitting the preset function, or a polynomial function generated by polynomial fitting the preset function; wherein the preset function is f (x) =2^r_bitsThe value range of x is a preset value range, and r _ bits is the output value bit width of the reciprocal mapping function;

the reciprocal unit 904 determines a reciprocal mapping value of a sum of exponent mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for a transformer structure in advance, including:

carrying out data conversion on the sum of the exponential mapping data corresponding to the row of input data;

and inputting the conversion result into the global shared reciprocal mapping function to obtain a reciprocal mapping value of the sum of the exponential mapping data corresponding to the input data of the row.

In the arrangement shown in figure 9 of the drawings,

the reciprocal unit 904 performs data conversion on the sum of the exponent mapping data corresponding to the row of input data, and includes:

determining a first effective bit number required to be occupied by the index mapping value sum corresponding to the row of input data and a second effective bit number required to be occupied by the left boundary value representing a preset value interval, and determining a difference value between the first effective bit number and the second effective bit number as a first right shift number required to map the index mapping value sum corresponding to the row of input data to the preset value interval;

and right shifting the exponent mapping data sum corresponding to the row of input data according to the first right shift number, mapping the right shift result to a value in a value range determined by the bit width of the output data of the exponent mapping table, and taking the value as conversion data of the exponent mapping data sum corresponding to the row of input data.

In the arrangement shown in figure 9 of the drawings,

the multiplication unit 905 performs multiplication and shift operation on the inverse mapping value of the sum of the exponent mapping data corresponding to the truncated data of each row of input data and the exponent mapping data corresponding to the row of input data to obtain the final result corresponding to the row of input data, and includes:

for each input data in the row of input data, performing the following operations:

calculating the product of the inverse mapping value of the index mapping data corresponding to the truncation data of the input data and the sum of the index mapping data corresponding to the input data of the row, and rounding the product to obtain an intermediate result corresponding to the input data;

and determining a second right shift number required for mapping the intermediate result corresponding to the input data to a value range determined by the output data bit width of the softmax module according to the first right shift number, the output data bit width of the global shared reciprocal mapping function and the output data bit width of the softmax module, and performing right shift on the intermediate result corresponding to the input data according to the second right shift number to obtain a final result corresponding to the input data.

In the arrangement shown in figure 9 of the drawings,

when the input matrix is transmitted to the softmax module from the MatMul module, a pipeline mode is adopted for realization;

and the softmax module is used for temporarily storing a final result corresponding to each row of input data in the input matrix in a Static Random Access Memory (SRAM) through operation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A transformer structure-based softmax function quantization implementation method is applied to a softmax module in a transformer structure, and is characterized by comprising the following steps:

performing data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to the bit width of the input data of the global shared index mapping table configured for the transformer structure in advance to obtain adjustment data of the row of input data;

cutting off the adjustment data of each line of input data to obtain cut-off data of the line of input data;

2. The method of claim 1,

according to the input data bit width of a global shared index mapping table configured for a transformer structure in advance, performing data adjustment on each row of input data in an input matrix transmitted from a MatMul module to a softmax module, wherein the data adjustment comprises the following steps:

determining the maximum value in the row of input data, and determining an adjusting value according to the bit width of the input data of the global shared index mapping table;

the method for intercepting the adjustment data of each line of input data to obtain the intercepted data of the line of input data comprises the following steps:

3. The method of claim 1,

the global shared index mapping table is generated according to a preset minimum scale S which can be expressed after fixed point quantization, an input data bit width t _ bits and an output data bit width e _ bits of the global shared index mapping table, and comprises 1< < t _ bits table entries, the occupied storage space is (1< < t _ bits) × e _ bits/8 bytes, and < <' is a left shift operator.

4. The method of claim 1,

the global shared reciprocal mapping function is a reciprocal mapping table generated according to a preset function, a linear function obtained by linearly fitting the preset function, or a polynomial function generated by polynomial fitting the preset function; wherein the preset function is f (x) =2^r_bitsThe value range of x is a preset value interval, and r _ bits is the output data bit width of the global shared reciprocal mapping function;

determining a reciprocal mapping value of the sum of exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for a transformer structure in advance, wherein the reciprocal mapping value comprises the following steps:

5. The method of claim 4,

and performing data conversion on the index mapping data sum corresponding to the row of input data, wherein the data conversion comprises the following steps:

and performing right shift on the index mapping data sum corresponding to the input data of the row according to the first right shift number, mapping the right shift result to a numerical value in a value range determined by the output data bit width of the global shared index mapping table, and taking the numerical value as conversion data of the index mapping data sum corresponding to the input data of the row.

6. The method of claim 5,

performing multiplication and shift operation on the exponent mapping data corresponding to the truncation data of each row of input data and the reciprocal mapping value of the sum of the exponent mapping data corresponding to the row of input data to obtain a final result corresponding to the row of input data, wherein the method comprises the following steps:

7. The method of claim 1,

and the softmax module is used for temporarily storing a final result corresponding to each row of input data in the input matrix in a Static Random Access Memory (SRAM) in an operation mode.

8. A transformer structure-based softmax function quantization implementation device is applied to a softmax module in a transformer structure, and is characterized by comprising the following steps:

the index unit is used for searching the global shared index mapping table to determine index mapping data corresponding to the truncation data of each row of input data;

9. The apparatus of claim 8,

the adjusting unit performs data adjustment on each row of input data in an input matrix transmitted from the MatMul module to the softmax module according to an input data bit width of a global shared index mapping table configured for a transformer structure in advance, and includes:

the adjusting unit is used for intercepting the adjusting data of each line of input data to obtain the intercepted data of the line of input data, and comprises:

10. The apparatus of claim 8,

11. The apparatus of claim 8,

the global shared reciprocal mapping function is a reciprocal mapping table generated according to a preset function, a linear function obtained by linearly fitting the preset function, or multiple functions generated by polynomial fitting of the preset functionA polynomial function; wherein the preset function is f (x) =2^r_bitsThe value range of the/x is a preset value interval, and the r _ bits is the output value bit width of the reciprocal mapping function;

the reciprocal unit determines a reciprocal mapping value of a sum of exponential mapping data corresponding to each row of input data according to a global shared reciprocal mapping function configured for a transformer structure in advance, and includes:

12. The apparatus of claim 11,

the reciprocal unit performs data conversion on the sum of the exponential mapping data corresponding to the row of input data, and includes:

and right shifting the exponent mapping data sum corresponding to the row of input data according to the first right shift number, mapping the right shift result to a numerical value in a value range determined by the output data bit width of the exponent mapping table, and taking the numerical value as conversion data of the exponent mapping data sum corresponding to the row of input data.

13. The apparatus of claim 12,

the multiplication unit performs multiplication and shift operation on the inverse mapping value of the sum of the exponent mapping data corresponding to the truncated data of each row of input data and the exponent mapping data corresponding to the row of input data to obtain the final result corresponding to the row of input data, and includes:

14. The apparatus of claim 8,