CN116822616A

CN116822616A - Device for training Softmax function in large language model

Info

Publication number: CN116822616A
Application number: CN202310881111.7A
Authority: CN
Inventors: 王中风; 邵海阔; 鲁金铭
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-09-29

Abstract

The invention provides a device for training a Softmax function in a large language model, wherein the upper half part of the device is a forward propagation path, and the lower half part of the device is a reverse propagation path; the forward propagation path includes e ^x Exponential function unit, adder and divider, at e ^x Registers are inserted among the exponent function unit, the adder and the divider; the backward propagation path comprises two multipliers and an adder A ₁ And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B ₁ Right multiplier B ₂ The method comprises the steps of carrying out a first treatment on the surface of the The multiplexer MUX is used for changing the data flow inside the device; the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2. The invention can be applied to SoftmaThe x-function trains the computation at various stages, thereby better utilizing the computation and storage resources to achieve higher performance and energy efficiency.

Description

Device for training Softmax function in large language model

Technical Field

The invention relates to a device for training a Softmax function in a large language model.

Background

The Transformer is a classical model applied to NLP (Natural Language Processing ) proposed by Google's team in 2017, and uses Self-Attention mechanism, so that the model can be trained in parallel and can possess global information of samples. Now more popular BERT and GPT etc. are also models implemented based on the Transformer infrastructure.

In recent years, a Deep Neural Network (DNN) based on a transducer has achieved excellent results in fields such as Natural Language Processing (NLP), computer Vision (CV), and voice processing. The Transformer based model is typically pre-trained on a large scale dataset and then fine-tuned for downstream tasks. With the continuous expansion of application scenes of the Transformer model, the training (fine tuning) of the model on the edge platform becomes important in consideration of the requirements of data privacy and real-time processing. However, due to the huge parameter amount of the transducer model, the calculation complexity is high, and the deployment of the fine-tuning training process on the edge platform with limited resources faces many challenges. The transducer class model consists of a transducer layer in which a mechanism of attention (Attention Mechanism) called self-attention is used. With the continuous increase of the model scale and the continuous increase of the sample sequence length of model processing, the calculated amount and the ratio of Softmax calculation in the attention mechanism in the reasoning and training process are also continuously increased, and become one of the bottlenecks for restricting the deployment efficiency of the model. The hardware design related to Softmax is currently available in the model-oriented reasoning stage, and the application is mainly focused on the traditional convolutional neural network. The prior art scheme mainly uses mathematical transformation to transform complex exponential functions (e ^x ) And division operations are converted into schemes of lower complexity that are more suitable for hardware implementation.

Disclosure of Invention

The invention aims to: the technical problem to be solved by the invention is to provide a device for training a Softmax function in a large language model, wherein the large language model is a Transformer model, and the device is characterized in that the upper half part of the device is a forward propagation path and the lower half part of the device is a reverse propagation path;

the forward propagation path packetContaining e ^x Exponential function unit, adder and divider, at e ^x Registers are inserted among the exponent function unit, the adder and the divider;

the backward propagation path comprises two multipliers and an adder A ₁ And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B ₁ Right multiplier B ₂ ；

The multiplexer MUX is used for changing the data flow inside the device;

the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2.

In the forward propagation phase, the Softmax function is calculated row by row on the M matrix, a certain row vector m= (M ₁ ，m ₂ ，…，m _s )，m _s The s-th element in the vector m is represented, the subscript represents the position of the element in the vector, and the length of the vector m is s; softmax calculates the resulting vector n= (n) ₁ ，n ₂ ，…，n _s ) The ith element n in n _i The calculation process of (1) is expressed as follows:

wherein i is 1-s;

the calculation process of equation (1) includes three phases: of elementsCalculation of->Summing and dividing.

The forward propagation path uses serialization processing to complete the computation of forward propagation: the input data is element m in vector m ₁ ，m ₂ ，…，m _s Input data is stored in the RAM 1; after the calculation starts, m ₁ ，m ₂ ，…，m _s Sequentially taken out from the RAM1, firstly go through e ^x The exponential function unit completes the exponential operation, each element m ₁ ，m ₂ ，…，m _s E of (2) ^x Calculation resultComplete the accumulation in adder while e ^x The calculation result is stored in the RAM 2; after all elements in vector m have completed the exponential operation, the adder adds up to get +.>Stored in RAM 2; then (I)>Is transmitted to the divisor input port of the divider, < >>The divisor input ports are fetched from RAM2 and sequentially transferred to a divider whose output is n in turn ₁ ，n ₂ ，…，n _s 。

During the serialization process, the division calculation needs to wait until all elements in vector m complete e ^x The exponential operation can be started after the accumulated result is obtained, and the division calculation is divided into two stages: e, e ^x The calculation of the exponent operation and accumulation stage and the division operation stage is carried out in a pipeline mode.

Setting an input matrix in the forward propagation of a Softmax functionThe real number matrix representing s rows and s columns is set so that each row vector in the matrix M is represented as M in turn ₁ ，m ₂ ，…，m _s Wherein m is _s Vector representing the s-th row in matrix M, M _s ＝(m _s1 ，m _s2 ，…，m _ss )，m _ss Representing vector m _s The s-th element in (b), subscript indicates the position of the element in the vector, vector m _s Is s in length; when m is ₁ Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m ₂ Starting a new exponent and accumulation operation without m ₁ Waiting for completion of calculation; the two phases of computation are performed in parallel.

In the backward propagation path, performing a backward propagation phase, the gradient propagated by a subsequent network structure of Softmax is dn, wherein the subsequent network structure refers to calculation after a Softmax function in a Transformer model, such as matrix multiplication and the like; dn represents the gradient of the output data n of the Softmax function of the forward propagation stage relative to the final output result of the model; in the back propagation stage, the gradient dm of the vector m relative to the final output result of the model needs to be calculated, dm is calculated by the following formula:

dm＝dn·(diag(n)-n ^T ·n)

wherein diag represents a diagonal matrix function and T represents a matrix transpose; element dm in dm _i The calculation process of (1) is developed as follows:

wherein i is more than or equal to 1 and s is more than or equal to s.

When the backward propagation path of the apparatus is used for the backward propagation calculation, stored in the RAM1 is the calculation result n of the forward propagation ₁ ，n ₂ ，…，n _s Stored in RAM2 is a gradient value dn delivered by the Softmax subsequent network structure ₁ ，dn ₂ ，…，dn _s ，dn _s Representing the s-th element in the vector dn, the subscript representing the position of the element in the vector, the length of the vector dn being s.

In calculating the gradient dm of vector m, the counter-propagation path reconstructs the internal data stream through the multiplexer, thereby completing the calculation of each element in dm, e.g., any element dm in the calculation dm _i At this time, the calculation procedure was as follows (dm _i Representing the i-th element in the vector dm, the subscript i represents the position of the element in the vector dm, and the length of the vector dm is s. ):

when dm is calculated _i At the time of meteringThe connection mode of the counter propagation path of the 1 st clock period has the following characteristics:

at this time, the data read in the RAM1 is n _i The data read in RAM2 is dn _i ；

Left multiplier B ₁ The input of (2) is n _i And-n _i The output is-n _i ·n _i ；

Right multiplier B ₂ The input of (2) is n _i And dn _i ，B ₂ The output of (2) is n _i ·dn _i ；

Adder A ₁ Is n _i ·dn _i ，A ₁ The upper input of (a) is 0, indicating that accumulation is started, A ₁ The output of the current period is n _i ·dn _i ；

At computation dm ₁ The t clock period of (2) is not less than t is not more than s+1, and the connection mode of the reverse propagation path has the following characteristics:

at this time, the data read in the RAM1 is n _t ，n _t Representing the t-th element in vector n, the subscript representing the position of the element in the vector, all elements of vector n being stored in RAM 1; the data read in RAM2 is dn _t-1 ；dn _t-1 Representing the t-1 st element in the vector dn, the subscript representing the position of the element in the vector, all elements of the vector dn being stored in RAM 2;

left multiplier B ₁ Is kept at n by the control of the multiplexer MUX _i ，B ₁ The upper input of (2) is-n _t ，B ₁ The output of the current period is-n _i ·n _t 。

Right multiplier B ₂ Is dn _t-1 At this time B ₂ Left side input is B ₁ Output of last cycle-n _i ·n _t-1 ，B ₂ The output of (2) is-n _i ·n _t-1 ·dn _t-1 ；

Adder A ₁ Left side input is B ₂ Output-n of (2) _i ·n _t-1 ·dn _t-1 ，A ₁ Through multiplexing of the upper inputs of (2)The MUX of the device is controlled to be A ₁ Output of the last cycle of (a) at this time ₁ Has the function of an accumulator;

over s+1 cycles, A ₁ The accumulated output is At this time dm was calculated _i 。

Furthermore, the parallelization design can be used for further improving the calculation throughput of the hardware module, and the parallelism existing in the training process of the transducer model can be utilized by carrying out parallel calculation through more than two devices.

The beneficial effects are that: aiming at Softmax calculation in a Transformer model training process, the invention provides a flexible and efficient hardware architecture, and a pipeline design method is used, so that the method can be applied to calculation of Softmax functions in each training stage, and calculation and storage resources are better utilized, so that higher performance and energy efficiency are realized. The solution proposed by the invention is the first proposed and effective scheme at present.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

Fig. 1 is a block diagram of a Softmax training module.

Fig. 2 is a schematic diagram of parallelization design with 4 parallelism in a conventional scheme.

FIG. 3 is a forward propagation computation diagram of a serialization process.

FIG. 4 is a schematic diagram of forward propagation computation of a pipeline design.

Fig. 5 is a Softmax computation flow chart for each vector in the M matrix.

Fig. 6 is a schematic diagram of the reverse propagation path of the Softmax training module.

FIG. 7 is a calculation dm ₁ In the process, the data flow diagram of the 1 st clock cycle is shown.

FIG. 8 is a calculation dm ₁ In the process, the data flow diagram of the t clock period is shown.

Detailed Description

The invention provides a device for training a Softmax function in a large language model, and FIG. 1 shows a schematic structure of a Softmax module which can be used for training calculation at each stage, and comprises a basic arithmetic unit and a matched storage module. Two parts of random access memory (Random Access Memory, RAM) are used for caching data required for calculation or calculating intermediate results, and RAM1 and RAM2 are respectively used for storing different data in different stages of training and are shared by the stages. The upper half of the module is the forward propagation path and the lower half is the reverse propagation path.

The forward propagation path includes e ^x An exponential function unit (exp), an adder, a divider and other computing modules, wherein a register is inserted between the computing modules to realize pipeline processing so as to improve hardware frequency and throughput;

the back propagation path comprises two multipliers, an adder, and a Multiplexer (MUX) for reconstructing the data path, which multiplexer can change the data flow inside the module during different computation cycles of the back propagation phase to achieve accurate computation. The supported data format of the computing module may be adjusted as desired. For example, if the data is a floating point number, the exp unit may be implemented by a corresponding floating point arithmetic IP core, and if the data is a fixed point number, it may be implemented using a Look-Up Table (Look Up Table) or a corresponding mathematical approximation unit.

Forward propagation path of the flow:

in the forward propagation phase, the Softmax function is calculated on a row-by-row basis on an M matrix, with a certain row vector in the matrix M denoted as m= (M ₁ ，m ₂ ，…，m _s ) The result vector calculated by Softmax is n= (n) ₁ ，n ₂ ，…，n _s ) Element n of n _i The calculation process of (1) is expressed as follows:

the calculation can be divided into three phases: of elementsCalculation of->Summing and dividing.

Parallelization design methods are often adopted in hardware design related to Softmax reasoning in the operation of a conventional neural network accelerator to improve throughput, for example, a conventional design scheme with parallelism of 4 is shown in fig. 2: simultaneous computation by using multiple parallel modulesThe sum is summed using an addition tree and then the individual results are calculated using a plurality of dividers. The advantage of parallelization processing is that throughput can be continually improved by increasing parallelism. However, in the Transformer class model, the matrix dimensions in the self-attention mechanism are uncertain because the sample sequence length s processed by the model is constantly changing. The parallelism of parallelization processing is therefore an indeterminate quantity in the solidified hardware design. The parallelism is set higher, and when the sample sequence length s is smaller, the calculation cannot fully utilize hardware resources; the parallelism is set low and when the sample sequence length s is relatively large, the hardware cannot properly complete the required computation.

Considering the changeable characteristics of Softmax calculation in a Transformer model, the invention uses a serialization and pipelining design method to adapt to the calculation characteristics of the model, and meanwhile, the invention still has parallelization design to further improve the space of hardware module calculation throughput, and the parallelism provided by s×s matrix, multi-head attention and multi-batch training can be utilized by parallel calculation of a plurality of modules.

Serializing:

as shown in fig. 3, the forward propagation path uses a serialization process to complete the computation of the forward propagation. The input data is element m in vector m ₁ ，m ₂ ，…，m _s Stored in RAM 1. After the calculation starts, m ₁ ，m ₂ ，…，m _s Sequentially taking out from the RAM1, finishing exponential operation by an exp unit, and e of each element ^x Calculation resultComplete the accumulation in adder while e ^x The calculation result is stored in the RAM2. After all elements in vector m have completed the exponential operation, the adder adds up to get +.>Stored in RAM2. Then (I)>Is transmitted to the divisor input port of the divider, < >>Fetched from RAM2 and sequentially transferred to the dividend input port of the divider. Thus, the output of the divider is n in turn ₁ ，n ₂ ，…，n _s 。

And (3) pipeline design: in the serialization process above, the division calculation needs to wait until all elements in vector m complete e ^x The exponential operation can be started after the accumulated result is obtained, and the calculation waiting among the stages reduces the calculation efficiency. Thus, the present invention divides the computation into two phases: (1) e, e ^x Performing exponential operation and accumulation; (2) division operations. And pipeline is inserted between the two stages to improve computational efficiency and throughput.

As shown in fig. 4, each row vector in the setting matrix M is represented as M in turn ₁ ，m ₂ ，…，m _s Wherein m is ₁ ＝(m ₁₁ ，m ₁₂ ，…，m _1s )，m ₂ ＝(m ₂₁ ，m ₂₂ ，…，m _2s ) … …. When m is ₁ Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m ₂ Starting a new exponent and accumulation operation without m ₁ Waiting for the calculation to complete. The calculation of the two stages is performed in parallel, each hardware module does not need to wait, and the hardware utilization rate and the calculation throughput are effectively improved. A flow chart for the calculation of the vectors in the M matrix is shown in fig. 5.

Reconfigurable reverse propagation path:

in the back propagation phase, the gradient propagated by the subsequent network structure of Softmax is dn, dm can be calculated by the following formula:

dm＝dn·(diag(n)-n ^T ·n)

with element dm in dm _i For example, (1. Ltoreq.i.ltoreq.s), the calculation process can be expanded as follows:

for example:

the structure of the backward propagation training path of the Softmax module is shown in fig. 6, and when used for the backward propagation calculation, the forward propagation calculation result n is stored in the RAM1 ₁ ，n ₂ ，…，n _s Stored in RAM2 is a gradient value dn delivered by a Softmax subsequent network structure, … ₁ ，dn ₂ ，…，dn _s …. In calculating the gradient dm of vector m, the back propagation path reconstructs the internal data stream through the multiplexer, thereby accurately completing the calculation of each element in dm.

At dm ₁ For example, in the 1 st clock cycle of the calculation, the connection manner and data flow of the reverse propagation path are as shown in fig. 7:

(1) The solid black line in the figure indicates the active path controlled by the multiplexer and the dashed line indicates the unselected path.

(2) At this time, the data read in the RAM1 is n ₁ The data read in RAM2 is dn ₁ 。

(3) Left multiplier B ₁ The input of (2) is n ₁ And-n ₁ The output is-n ₁ ·n ₁ 。

(4) Right multiplier B ₂ The input of (2) is n ₁ And dn ₁ ，B ₂ The output of (2) is n ₁ ·dn ₁ 。

(5) Adder A ₁ Is n ₁ ·dn ₁ ，A ₁ The upper input of (a) is 0, indicating that accumulation is started, A ₁ The output of the current period is n ₁ ·dn ₁ 。

At computation dm ₁ The connection mode and data flow of the counter propagation path of the t-th clock period (2. Ltoreq.t. Ltoreq.s+1) are as shown in FIG. 8, and have the following features:

(1) At this time, the data read in the RAM1 is n _t The data read in RAM2 is dn _t-1 。

(2) Multiplier B ₁ The left side input is held at n by the control of the multiplexer ₁ ，B ₁ The upper input of (2) is-n _t ，B ₁ The output of the current period is-n ₁ ·n _t 。

(3) Multiplier B ₂ Is dn _t-1 At this time B ₂ Left side input is B ₁ Output of last cycle-n ₁ ·n _t-1 Thus B ₂ The output of (2) is-n ₁ ·n _t-1 ·dn _t-1 。

(4) Adder A ₁ Left side input is B ₂ The output of (a), i.e., -n ₁ ·n _t-1 ·dn _t-1 ，A ₁ Is controlled to A by a multiplexer ₁ Output of the last cycle of (a) at this time ₁ Having the function of an accumulator.

Over s+1 cycles, A ₁ The accumulated output is

Computation process of other elements in dm and dm ₁ The same applies.

Examples

Taking a training process of a transducer model in a natural language processing task as an example, the improvement of the calculation efficiency of a forward propagation path of a serialization and pipelining design relative to a traditional parallelization design scheme is illustrated. The sample length of the natural language processing data set is set to be between 1 and 128, namely the vector length which needs to be processed by the Softmax function is 1.ltoreq.s.ltoreq.128. In the conventional parallelization design, in order to correctly complete the Softmax function calculation, a hardware design with a parallelism of 128 is required (as shown in fig. 2). Assuming that the average value of all sample lengths s of the data set is mu, the hardware utilization rate of the parallelization design scheme is achieved in the whole training processIn the forward propagation path of the serialization and pipelining scheme provided by the invention, the utilization rate of hardware resources is increased in the whole training process>As can be seen by comparing the hardware utilization, η is found only when μ is greater than or equal to 127 ₁ ≥η ₂ . In the practical training process, the average value of the sample length s (s is more than or equal to 1 and less than or equal to 128) is mu and is far less than 128, so eta ₂ Is larger than eta ₁ . Therefore, the hardware resource utilization rate and the calculation efficiency of the forward propagation path of the serialization and pipelining design are superior to those of the traditional parallelization design scheme.

The invention provides a device for training a Softmax function in a large language model, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made, and the improvements and modifications should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A device for Softmax function training in a large language model, the large language model being a Transformer class model, characterized in that the upper half of the device is a forward propagation path and the lower half is a reverse propagation path;

the forward propagation path includes e ^x Exponential function unit, adder and divider, at e ^x Registers are inserted among the exponent function unit, the adder and the divider;

The multiplexer MUX is used for changing the data flow inside the device;

2. The apparatus of claim 1 wherein in the forward propagation phase, the Softmax function is calculated row-wise on an M matrix, a row vector m= (M) in the matrix M ₁ ,m ₂ ,…,m _s )，m _s The s-th element in the vector m is represented, the subscript represents the position of the element in the vector, and the length of the vector m is s; softmax calculates the resulting vector n= (n) ₁ ,n ₂ ,…,n _s ) The ith element n in n _i The calculation process of (1) is expressed as follows:

wherein i is 1-s.

3. The apparatus of claim 2, wherein the calculation of equation (1) comprises three stages: of elementsCalculation of->Summing and dividing.

4. The apparatus of claim 3, wherein the forward propagation path performs forward propagation computation using a serialization process: the input data is element m in vector m ₁ ,m ₂ ,…,m _s Input data is stored in the RAM 1; after the calculation starts, m ₁ ,m ₂ ,…,m _s Sequentially taken out from the RAM1, firstly go through e ^x The exponential function unit completes the exponential operation, each element m ₁ ,m ₂ ,…,m _s E of (2) ^x Calculation resultComplete the accumulation in adder while e ^x The calculation result is stored in the RAM 2; after all elements in vector m have completed the exponential operation, the adder adds up to get +.> Stored in RAM 2; then (I)>Is transmitted to the divisor input port of the divider,the divisor input ports are fetched from RAM2 and sequentially transferred to a divider whose output is n in turn ₁ ,n ₂ ,…,n _s ；

During the serialization process, the division calculation needs to wait until all of the vectors mElement completion e ^x The exponential operation can be started after the accumulated result is obtained, and the division calculation is divided into two stages: e, e ^x The calculation of the exponent operation and accumulation stage and the division operation stage is carried out in a pipeline mode.

5. The apparatus of claim 4, wherein the input matrix during the forward propagation of the Softmax function is set The real number matrix representing s rows and s columns is set so that each row vector in the matrix M is represented as M in turn ₁ ,m ₂ ,…,m _s Wherein m is _s Vector representing the s-th row in matrix M, M _s ＝(m _s1 ,m _s2 ,…,m _ss )，m _ss Representing vector m _s The s-th element in (b), subscript indicates the position of the element in the vector, vector m _s Is s in length; when m is ₁ Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m ₂ Starting a new exponent and accumulation operation without m ₁ Waiting for completion of calculation; the two phases of computation are performed in parallel.

6. The apparatus of claim 5, wherein the backward propagation phase is performed on a backward propagation path with a gradient dn propagated by a subsequent network structure of Softmax, the subsequent network structure being a calculation following a Softmax function in a Transformer class model; dn represents the gradient of the output data n of the Softmax function of the forward propagation stage relative to the final output result of the model; in the back propagation stage, the gradient dm of the vector m with respect to the final output result of the model needs to be calculated.

7. The apparatus of claim 6, wherein dm is calculated by the following formula:

dm＝dn·(diag(n)-n ^T ·n)

wherein i is more than or equal to 1 and s is more than or equal to s.

8. The apparatus according to claim 7, wherein when a backward propagation path of the apparatus is used for a backward propagation calculation, stored in the RAM1 is a calculation result n of forward propagation ₁ ,n ₂ ,…,n _s Stored in RAM2 is a gradient value dn delivered by the Softmax subsequent network structure ₁ ,dn ₂ ,…,dn _s ，dn _s Representing the s-th element in the vector dn, the subscript representing the position of the element in the vector, the length of the vector dn being s.

9. The apparatus according to claim 8, wherein during calculation of the gradient dm of the vector m, the counter-propagation path reconstructs the internal data stream through the multiplexer, thereby completing the calculation of each element dm, any element dm in the calculation dm _i The calculation process is as follows:

when dm is calculated _i At the 1 st clock cycle of calculation, the connection mode of the reverse propagation path has the following characteristics:

Adder A ₁ Is n _i ·dn _i ，A ₁ Upper delivery of (2)The entry is 0, indicating the beginning of accumulation, A ₁ The output of the current period is n _i ·dn _i ；

left multiplier B ₁ Is kept at n by the control of the multiplexer MUX _i ，B ₁ The upper input of (2) is-n _t ，B ₁ The output of the current period is-n _i ·n _t ；

Adder A ₁ Left side input is B ₂ Output-n of (2) _i ·n _t-1 ·dn _t-1 ，A ₁ Is controlled to a by a multiplexer MUX ₁ Output of the last cycle of (a) at this time ₁ Has the function of an accumulator;

10. The apparatus of claim 9, wherein the computation is performed in parallel by more than two devices.