CN116822616A - Device for training Softmax function in large language model - Google Patents
Device for training Softmax function in large language model Download PDFInfo
- Publication number
- CN116822616A CN116822616A CN202310881111.7A CN202310881111A CN116822616A CN 116822616 A CN116822616 A CN 116822616A CN 202310881111 A CN202310881111 A CN 202310881111A CN 116822616 A CN116822616 A CN 116822616A
- Authority
- CN
- China
- Prior art keywords
- vector
- calculation
- propagation path
- output
- adder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000006870 function Effects 0.000 title claims abstract description 34
- 238000012549 training Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 32
- 102100031584 Cell division cycle-associated 7-like protein Human genes 0.000 claims abstract description 18
- 101000777638 Homo sapiens Cell division cycle-associated 7-like protein Proteins 0.000 claims abstract description 18
- 101100325756 Arabidopsis thaliana BAM5 gene Proteins 0.000 claims abstract description 16
- 101150046378 RAM1 gene Proteins 0.000 claims abstract description 16
- 101100476489 Rattus norvegicus Slc20a2 gene Proteins 0.000 claims abstract description 16
- 230000015654 memory Effects 0.000 claims abstract description 3
- 238000004364 calculation method Methods 0.000 claims description 73
- 239000013598 vector Substances 0.000 claims description 61
- 230000008569 process Effects 0.000 claims description 24
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000009825 accumulation Methods 0.000 claims description 18
- 230000000644 propagated effect Effects 0.000 claims description 3
- 238000013461 design Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 5
- 230000006872 improvement Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
- G06F7/575—Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a device for training a Softmax function in a large language model, wherein the upper half part of the device is a forward propagation path, and the lower half part of the device is a reverse propagation path; the forward propagation path includes e x Exponential function unit, adder and divider, at e x Registers are inserted among the exponent function unit, the adder and the divider; the backward propagation path comprises two multipliers and an adder A 1 And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B 1 Right multiplier B 2 The method comprises the steps of carrying out a first treatment on the surface of the The multiplexer MUX is used for changing the data flow inside the device; the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2. The invention can be applied to SoftmaThe x-function trains the computation at various stages, thereby better utilizing the computation and storage resources to achieve higher performance and energy efficiency.
Description
Technical Field
The invention relates to a device for training a Softmax function in a large language model.
Background
The Transformer is a classical model applied to NLP (Natural Language Processing ) proposed by Google's team in 2017, and uses Self-Attention mechanism, so that the model can be trained in parallel and can possess global information of samples. Now more popular BERT and GPT etc. are also models implemented based on the Transformer infrastructure.
In recent years, a Deep Neural Network (DNN) based on a transducer has achieved excellent results in fields such as Natural Language Processing (NLP), computer Vision (CV), and voice processing. The Transformer based model is typically pre-trained on a large scale dataset and then fine-tuned for downstream tasks. With the continuous expansion of application scenes of the Transformer model, the training (fine tuning) of the model on the edge platform becomes important in consideration of the requirements of data privacy and real-time processing. However, due to the huge parameter amount of the transducer model, the calculation complexity is high, and the deployment of the fine-tuning training process on the edge platform with limited resources faces many challenges. The transducer class model consists of a transducer layer in which a mechanism of attention (Attention Mechanism) called self-attention is used. With the continuous increase of the model scale and the continuous increase of the sample sequence length of model processing, the calculated amount and the ratio of Softmax calculation in the attention mechanism in the reasoning and training process are also continuously increased, and become one of the bottlenecks for restricting the deployment efficiency of the model. The hardware design related to Softmax is currently available in the model-oriented reasoning stage, and the application is mainly focused on the traditional convolutional neural network. The prior art scheme mainly uses mathematical transformation to transform complex exponential functions (e x ) And division operations are converted into schemes of lower complexity that are more suitable for hardware implementation.
Disclosure of Invention
The invention aims to: the technical problem to be solved by the invention is to provide a device for training a Softmax function in a large language model, wherein the large language model is a Transformer model, and the device is characterized in that the upper half part of the device is a forward propagation path and the lower half part of the device is a reverse propagation path;
the forward propagation path packetContaining e x Exponential function unit, adder and divider, at e x Registers are inserted among the exponent function unit, the adder and the divider;
the backward propagation path comprises two multipliers and an adder A 1 And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B 1 Right multiplier B 2 ;
The multiplexer MUX is used for changing the data flow inside the device;
the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2.
In the forward propagation phase, the Softmax function is calculated row by row on the M matrix, a certain row vector m= (M 1 ,m 2 ,…,m s ),m s The s-th element in the vector m is represented, the subscript represents the position of the element in the vector, and the length of the vector m is s; softmax calculates the resulting vector n= (n) 1 ,n 2 ,…,n s ) The ith element n in n i The calculation process of (1) is expressed as follows:
wherein i is 1-s;
the calculation process of equation (1) includes three phases: of elementsCalculation of->Summing and dividing.
The forward propagation path uses serialization processing to complete the computation of forward propagation: the input data is element m in vector m 1 ,m 2 ,…,m s Input data is stored in the RAM 1; after the calculation starts, m 1 ,m 2 ,…,m s Sequentially taken out from the RAM1, firstly go through e x The exponential function unit completes the exponential operation, each element m 1 ,m 2 ,…,m s E of (2) x Calculation resultComplete the accumulation in adder while e x The calculation result is stored in the RAM 2; after all elements in vector m have completed the exponential operation, the adder adds up to get +.>Stored in RAM 2; then (I)>Is transmitted to the divisor input port of the divider, < >>The divisor input ports are fetched from RAM2 and sequentially transferred to a divider whose output is n in turn 1 ,n 2 ,…,n s 。
During the serialization process, the division calculation needs to wait until all elements in vector m complete e x The exponential operation can be started after the accumulated result is obtained, and the division calculation is divided into two stages: e, e x The calculation of the exponent operation and accumulation stage and the division operation stage is carried out in a pipeline mode.
Setting an input matrix in the forward propagation of a Softmax functionThe real number matrix representing s rows and s columns is set so that each row vector in the matrix M is represented as M in turn 1 ,m 2 ,…,m s Wherein m is s Vector representing the s-th row in matrix M, M s =(m s1 ,m s2 ,…,m ss ),m ss Representing vector m s The s-th element in (b), subscript indicates the position of the element in the vector, vector m s Is s in length; when m is 1 Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m 2 Starting a new exponent and accumulation operation without m 1 Waiting for completion of calculation; the two phases of computation are performed in parallel.
In the backward propagation path, performing a backward propagation phase, the gradient propagated by a subsequent network structure of Softmax is dn, wherein the subsequent network structure refers to calculation after a Softmax function in a Transformer model, such as matrix multiplication and the like; dn represents the gradient of the output data n of the Softmax function of the forward propagation stage relative to the final output result of the model; in the back propagation stage, the gradient dm of the vector m relative to the final output result of the model needs to be calculated, dm is calculated by the following formula:
dm=dn·(diag(n)-n T ·n)
wherein diag represents a diagonal matrix function and T represents a matrix transpose; element dm in dm i The calculation process of (1) is developed as follows:
wherein i is more than or equal to 1 and s is more than or equal to s.
When the backward propagation path of the apparatus is used for the backward propagation calculation, stored in the RAM1 is the calculation result n of the forward propagation 1 ,n 2 ,…,n s Stored in RAM2 is a gradient value dn delivered by the Softmax subsequent network structure 1 ,dn 2 ,…,dn s ,dn s Representing the s-th element in the vector dn, the subscript representing the position of the element in the vector, the length of the vector dn being s.
In calculating the gradient dm of vector m, the counter-propagation path reconstructs the internal data stream through the multiplexer, thereby completing the calculation of each element in dm, e.g., any element dm in the calculation dm i At this time, the calculation procedure was as follows (dm i Representing the i-th element in the vector dm, the subscript i represents the position of the element in the vector dm, and the length of the vector dm is s. ):
when dm is calculated i At the time of meteringThe connection mode of the counter propagation path of the 1 st clock period has the following characteristics:
at this time, the data read in the RAM1 is n i The data read in RAM2 is dn i ;
Left multiplier B 1 The input of (2) is n i And-n i The output is-n i ·n i ;
Right multiplier B 2 The input of (2) is n i And dn i ,B 2 The output of (2) is n i ·dn i ;
Adder A 1 Is n i ·dn i ,A 1 The upper input of (a) is 0, indicating that accumulation is started, A 1 The output of the current period is n i ·dn i ;
At computation dm 1 The t clock period of (2) is not less than t is not more than s+1, and the connection mode of the reverse propagation path has the following characteristics:
at this time, the data read in the RAM1 is n t ,n t Representing the t-th element in vector n, the subscript representing the position of the element in the vector, all elements of vector n being stored in RAM 1; the data read in RAM2 is dn t-1 ;dn t-1 Representing the t-1 st element in the vector dn, the subscript representing the position of the element in the vector, all elements of the vector dn being stored in RAM 2;
left multiplier B 1 Is kept at n by the control of the multiplexer MUX i ,B 1 The upper input of (2) is-n t ,B 1 The output of the current period is-n i ·n t 。
Right multiplier B 2 Is dn t-1 At this time B 2 Left side input is B 1 Output of last cycle-n i ·n t-1 ,B 2 The output of (2) is-n i ·n t-1 ·dn t-1 ;
Adder A 1 Left side input is B 2 Output-n of (2) i ·n t-1 ·dn t-1 ,A 1 Through multiplexing of the upper inputs of (2)The MUX of the device is controlled to be A 1 Output of the last cycle of (a) at this time 1 Has the function of an accumulator;
over s+1 cycles, A 1 The accumulated output is At this time dm was calculated i 。
Furthermore, the parallelization design can be used for further improving the calculation throughput of the hardware module, and the parallelism existing in the training process of the transducer model can be utilized by carrying out parallel calculation through more than two devices.
The beneficial effects are that: aiming at Softmax calculation in a Transformer model training process, the invention provides a flexible and efficient hardware architecture, and a pipeline design method is used, so that the method can be applied to calculation of Softmax functions in each training stage, and calculation and storage resources are better utilized, so that higher performance and energy efficiency are realized. The solution proposed by the invention is the first proposed and effective scheme at present.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
Fig. 1 is a block diagram of a Softmax training module.
Fig. 2 is a schematic diagram of parallelization design with 4 parallelism in a conventional scheme.
FIG. 3 is a forward propagation computation diagram of a serialization process.
FIG. 4 is a schematic diagram of forward propagation computation of a pipeline design.
Fig. 5 is a Softmax computation flow chart for each vector in the M matrix.
Fig. 6 is a schematic diagram of the reverse propagation path of the Softmax training module.
FIG. 7 is a calculation dm 1 In the process, the data flow diagram of the 1 st clock cycle is shown.
FIG. 8 is a calculation dm 1 In the process, the data flow diagram of the t clock period is shown.
Detailed Description
The invention provides a device for training a Softmax function in a large language model, and FIG. 1 shows a schematic structure of a Softmax module which can be used for training calculation at each stage, and comprises a basic arithmetic unit and a matched storage module. Two parts of random access memory (Random Access Memory, RAM) are used for caching data required for calculation or calculating intermediate results, and RAM1 and RAM2 are respectively used for storing different data in different stages of training and are shared by the stages. The upper half of the module is the forward propagation path and the lower half is the reverse propagation path.
The forward propagation path includes e x An exponential function unit (exp), an adder, a divider and other computing modules, wherein a register is inserted between the computing modules to realize pipeline processing so as to improve hardware frequency and throughput;
the back propagation path comprises two multipliers, an adder, and a Multiplexer (MUX) for reconstructing the data path, which multiplexer can change the data flow inside the module during different computation cycles of the back propagation phase to achieve accurate computation. The supported data format of the computing module may be adjusted as desired. For example, if the data is a floating point number, the exp unit may be implemented by a corresponding floating point arithmetic IP core, and if the data is a fixed point number, it may be implemented using a Look-Up Table (Look Up Table) or a corresponding mathematical approximation unit.
Forward propagation path of the flow:
in the forward propagation phase, the Softmax function is calculated on a row-by-row basis on an M matrix, with a certain row vector in the matrix M denoted as m= (M 1 ,m 2 ,…,m s ) The result vector calculated by Softmax is n= (n) 1 ,n 2 ,…,n s ) Element n of n i The calculation process of (1) is expressed as follows:
the calculation can be divided into three phases: of elementsCalculation of->Summing and dividing.
Parallelization design methods are often adopted in hardware design related to Softmax reasoning in the operation of a conventional neural network accelerator to improve throughput, for example, a conventional design scheme with parallelism of 4 is shown in fig. 2: simultaneous computation by using multiple parallel modulesThe sum is summed using an addition tree and then the individual results are calculated using a plurality of dividers. The advantage of parallelization processing is that throughput can be continually improved by increasing parallelism. However, in the Transformer class model, the matrix dimensions in the self-attention mechanism are uncertain because the sample sequence length s processed by the model is constantly changing. The parallelism of parallelization processing is therefore an indeterminate quantity in the solidified hardware design. The parallelism is set higher, and when the sample sequence length s is smaller, the calculation cannot fully utilize hardware resources; the parallelism is set low and when the sample sequence length s is relatively large, the hardware cannot properly complete the required computation.
Considering the changeable characteristics of Softmax calculation in a Transformer model, the invention uses a serialization and pipelining design method to adapt to the calculation characteristics of the model, and meanwhile, the invention still has parallelization design to further improve the space of hardware module calculation throughput, and the parallelism provided by s×s matrix, multi-head attention and multi-batch training can be utilized by parallel calculation of a plurality of modules.
Serializing:
as shown in fig. 3, the forward propagation path uses a serialization process to complete the computation of the forward propagation. The input data is element m in vector m 1 ,m 2 ,…,m s Stored in RAM 1. After the calculation starts, m 1 ,m 2 ,…,m s Sequentially taking out from the RAM1, finishing exponential operation by an exp unit, and e of each element x Calculation resultComplete the accumulation in adder while e x The calculation result is stored in the RAM2. After all elements in vector m have completed the exponential operation, the adder adds up to get +.>Stored in RAM2. Then (I)>Is transmitted to the divisor input port of the divider, < >>Fetched from RAM2 and sequentially transferred to the dividend input port of the divider. Thus, the output of the divider is n in turn 1 ,n 2 ,…,n s 。
And (3) pipeline design: in the serialization process above, the division calculation needs to wait until all elements in vector m complete e x The exponential operation can be started after the accumulated result is obtained, and the calculation waiting among the stages reduces the calculation efficiency. Thus, the present invention divides the computation into two phases: (1) e, e x Performing exponential operation and accumulation; (2) division operations. And pipeline is inserted between the two stages to improve computational efficiency and throughput.
As shown in fig. 4, each row vector in the setting matrix M is represented as M in turn 1 ,m 2 ,…,m s Wherein m is 1 =(m 11 ,m 12 ,…,m 1s ),m 2 =(m 21 ,m 22 ,…,m 2s ) … …. When m is 1 Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m 2 Starting a new exponent and accumulation operation without m 1 Waiting for the calculation to complete. The calculation of the two stages is performed in parallel, each hardware module does not need to wait, and the hardware utilization rate and the calculation throughput are effectively improved. A flow chart for the calculation of the vectors in the M matrix is shown in fig. 5.
Reconfigurable reverse propagation path:
in the back propagation phase, the gradient propagated by the subsequent network structure of Softmax is dn, dm can be calculated by the following formula:
dm=dn·(diag(n)-n T ·n)
with element dm in dm i For example, (1. Ltoreq.i.ltoreq.s), the calculation process can be expanded as follows:
for example:
the structure of the backward propagation training path of the Softmax module is shown in fig. 6, and when used for the backward propagation calculation, the forward propagation calculation result n is stored in the RAM1 1 ,n 2 ,…,n s Stored in RAM2 is a gradient value dn delivered by a Softmax subsequent network structure, … 1 ,dn 2 ,…,dn s …. In calculating the gradient dm of vector m, the back propagation path reconstructs the internal data stream through the multiplexer, thereby accurately completing the calculation of each element in dm.
At dm 1 For example, in the 1 st clock cycle of the calculation, the connection manner and data flow of the reverse propagation path are as shown in fig. 7:
(1) The solid black line in the figure indicates the active path controlled by the multiplexer and the dashed line indicates the unselected path.
(2) At this time, the data read in the RAM1 is n 1 The data read in RAM2 is dn 1 。
(3) Left multiplier B 1 The input of (2) is n 1 And-n 1 The output is-n 1 ·n 1 。
(4) Right multiplier B 2 The input of (2) is n 1 And dn 1 ,B 2 The output of (2) is n 1 ·dn 1 。
(5) Adder A 1 Is n 1 ·dn 1 ,A 1 The upper input of (a) is 0, indicating that accumulation is started, A 1 The output of the current period is n 1 ·dn 1 。
At computation dm 1 The connection mode and data flow of the counter propagation path of the t-th clock period (2. Ltoreq.t. Ltoreq.s+1) are as shown in FIG. 8, and have the following features:
(1) At this time, the data read in the RAM1 is n t The data read in RAM2 is dn t-1 。
(2) Multiplier B 1 The left side input is held at n by the control of the multiplexer 1 ,B 1 The upper input of (2) is-n t ,B 1 The output of the current period is-n 1 ·n t 。
(3) Multiplier B 2 Is dn t-1 At this time B 2 Left side input is B 1 Output of last cycle-n 1 ·n t-1 Thus B 2 The output of (2) is-n 1 ·n t-1 ·dn t-1 。
(4) Adder A 1 Left side input is B 2 The output of (a), i.e., -n 1 ·n t-1 ·dn t-1 ,A 1 Is controlled to A by a multiplexer 1 Output of the last cycle of (a) at this time 1 Having the function of an accumulator.
Over s+1 cycles, A 1 The accumulated output is
Computation process of other elements in dm and dm 1 The same applies.
Examples
Taking a training process of a transducer model in a natural language processing task as an example, the improvement of the calculation efficiency of a forward propagation path of a serialization and pipelining design relative to a traditional parallelization design scheme is illustrated. The sample length of the natural language processing data set is set to be between 1 and 128, namely the vector length which needs to be processed by the Softmax function is 1.ltoreq.s.ltoreq.128. In the conventional parallelization design, in order to correctly complete the Softmax function calculation, a hardware design with a parallelism of 128 is required (as shown in fig. 2). Assuming that the average value of all sample lengths s of the data set is mu, the hardware utilization rate of the parallelization design scheme is achieved in the whole training processIn the forward propagation path of the serialization and pipelining scheme provided by the invention, the utilization rate of hardware resources is increased in the whole training process>As can be seen by comparing the hardware utilization, η is found only when μ is greater than or equal to 127 1 ≥η 2 . In the practical training process, the average value of the sample length s (s is more than or equal to 1 and less than or equal to 128) is mu and is far less than 128, so eta 2 Is larger than eta 1 . Therefore, the hardware resource utilization rate and the calculation efficiency of the forward propagation path of the serialization and pipelining design are superior to those of the traditional parallelization design scheme.
The invention provides a device for training a Softmax function in a large language model, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made, and the improvements and modifications should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.
Claims (10)
1. A device for Softmax function training in a large language model, the large language model being a Transformer class model, characterized in that the upper half of the device is a forward propagation path and the lower half is a reverse propagation path;
the forward propagation path includes e x Exponential function unit, adder and divider, at e x Registers are inserted among the exponent function unit, the adder and the divider;
the backward propagation path comprises two multipliers and an adder A 1 And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B 1 Right multiplier B 2 ;
The multiplexer MUX is used for changing the data flow inside the device;
the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2.
2. The apparatus of claim 1 wherein in the forward propagation phase, the Softmax function is calculated row-wise on an M matrix, a row vector m= (M) in the matrix M 1 ,m 2 ,…,m s ),m s The s-th element in the vector m is represented, the subscript represents the position of the element in the vector, and the length of the vector m is s; softmax calculates the resulting vector n= (n) 1 ,n 2 ,…,n s ) The ith element n in n i The calculation process of (1) is expressed as follows:
wherein i is 1-s.
3. The apparatus of claim 2, wherein the calculation of equation (1) comprises three stages: of elementsCalculation of->Summing and dividing.
4. The apparatus of claim 3, wherein the forward propagation path performs forward propagation computation using a serialization process: the input data is element m in vector m 1 ,m 2 ,…,m s Input data is stored in the RAM 1; after the calculation starts, m 1 ,m 2 ,…,m s Sequentially taken out from the RAM1, firstly go through e x The exponential function unit completes the exponential operation, each element m 1 ,m 2 ,…,m s E of (2) x Calculation resultComplete the accumulation in adder while e x The calculation result is stored in the RAM 2; after all elements in vector m have completed the exponential operation, the adder adds up to get +.> Stored in RAM 2; then (I)>Is transmitted to the divisor input port of the divider,the divisor input ports are fetched from RAM2 and sequentially transferred to a divider whose output is n in turn 1 ,n 2 ,…,n s ;
During the serialization process, the division calculation needs to wait until all of the vectors mElement completion e x The exponential operation can be started after the accumulated result is obtained, and the division calculation is divided into two stages: e, e x The calculation of the exponent operation and accumulation stage and the division operation stage is carried out in a pipeline mode.
5. The apparatus of claim 4, wherein the input matrix during the forward propagation of the Softmax function is set The real number matrix representing s rows and s columns is set so that each row vector in the matrix M is represented as M in turn 1 ,m 2 ,…,m s Wherein m is s Vector representing the s-th row in matrix M, M s =(m s1 ,m s2 ,…,m ss ),m ss Representing vector m s The s-th element in (b), subscript indicates the position of the element in the vector, vector m s Is s in length; when m is 1 Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m 2 Starting a new exponent and accumulation operation without m 1 Waiting for completion of calculation; the two phases of computation are performed in parallel.
6. The apparatus of claim 5, wherein the backward propagation phase is performed on a backward propagation path with a gradient dn propagated by a subsequent network structure of Softmax, the subsequent network structure being a calculation following a Softmax function in a Transformer class model; dn represents the gradient of the output data n of the Softmax function of the forward propagation stage relative to the final output result of the model; in the back propagation stage, the gradient dm of the vector m with respect to the final output result of the model needs to be calculated.
7. The apparatus of claim 6, wherein dm is calculated by the following formula:
dm=dn·(diag(n)-n T ·n)
wherein diag represents a diagonal matrix function and T represents a matrix transpose; element dm in dm i The calculation process of (1) is developed as follows:
wherein i is more than or equal to 1 and s is more than or equal to s.
8. The apparatus according to claim 7, wherein when a backward propagation path of the apparatus is used for a backward propagation calculation, stored in the RAM1 is a calculation result n of forward propagation 1 ,n 2 ,…,n s Stored in RAM2 is a gradient value dn delivered by the Softmax subsequent network structure 1 ,dn 2 ,…,dn s ,dn s Representing the s-th element in the vector dn, the subscript representing the position of the element in the vector, the length of the vector dn being s.
9. The apparatus according to claim 8, wherein during calculation of the gradient dm of the vector m, the counter-propagation path reconstructs the internal data stream through the multiplexer, thereby completing the calculation of each element dm, any element dm in the calculation dm i The calculation process is as follows:
when dm is calculated i At the 1 st clock cycle of calculation, the connection mode of the reverse propagation path has the following characteristics:
at this time, the data read in the RAM1 is n i The data read in RAM2 is dn i ;
Left multiplier B 1 The input of (2) is n i And-n i The output is-n i ·n i ;
Right multiplier B 2 The input of (2) is n i And dn i ,B 2 The output of (2) is n i ·dn i ;
Adder A 1 Is n i ·dn i ,A 1 Upper delivery of (2)The entry is 0, indicating the beginning of accumulation, A 1 The output of the current period is n i ·dn i ;
At computation dm 1 The t clock period of (2) is not less than t is not more than s+1, and the connection mode of the reverse propagation path has the following characteristics:
at this time, the data read in the RAM1 is n t ,n t Representing the t-th element in vector n, the subscript representing the position of the element in the vector, all elements of vector n being stored in RAM 1; the data read in RAM2 is dn t-1 ;dn t-1 Representing the t-1 st element in the vector dn, the subscript representing the position of the element in the vector, all elements of the vector dn being stored in RAM 2;
left multiplier B 1 Is kept at n by the control of the multiplexer MUX i ,B 1 The upper input of (2) is-n t ,B 1 The output of the current period is-n i ·n t ;
Right multiplier B 2 Is dn t-1 At this time B 2 Left side input is B 1 Output of last cycle-n i ·n t-1 ,B 2 The output of (2) is-n i ·n t-1 ·dn t-1 ;
Adder A 1 Left side input is B 2 Output-n of (2) i ·n t-1 ·dn t-1 ,A 1 Is controlled to a by a multiplexer MUX 1 Output of the last cycle of (a) at this time 1 Has the function of an accumulator;
over s+1 cycles, A 1 The accumulated output is At this time dm was calculated i 。
10. The apparatus of claim 9, wherein the computation is performed in parallel by more than two devices.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310881111.7A CN116822616A (en) | 2023-07-18 | 2023-07-18 | Device for training Softmax function in large language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310881111.7A CN116822616A (en) | 2023-07-18 | 2023-07-18 | Device for training Softmax function in large language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116822616A true CN116822616A (en) | 2023-09-29 |
Family
ID=88139161
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310881111.7A Pending CN116822616A (en) | 2023-07-18 | 2023-07-18 | Device for training Softmax function in large language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116822616A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117349032A (en) * | 2023-12-05 | 2024-01-05 | 城云科技(中国)有限公司 | Method and device for improving throughput of large language model |
-
2023
- 2023-07-18 CN CN202310881111.7A patent/CN116822616A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117349032A (en) * | 2023-12-05 | 2024-01-05 | 城云科技(中国)有限公司 | Method and device for improving throughput of large language model |
CN117349032B (en) * | 2023-12-05 | 2024-02-20 | 城云科技(中国)有限公司 | Method and device for improving throughput of large language model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | Lite-hrnet: A lightweight high-resolution network | |
Liu et al. | Group fisher pruning for practical network compression | |
Chen et al. | ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks | |
KR101781057B1 (en) | Vector processing engine with merging circuitry between execution units and vector data memory, and related method | |
KR20160085337A (en) | Vector processing engines employing a tapped-delay line for filter vector processing operations, and related vector processor systems and methods | |
US11593907B2 (en) | System and methods for computing 2-D convolutions and cross-correlations | |
Xu et al. | Reconfigurable and low-complexity accelerator for convolutional and generative networks over finite fields | |
CN113033794B (en) | Light weight neural network hardware accelerator based on deep separable convolution | |
CN116822616A (en) | Device for training Softmax function in large language model | |
CN110766128A (en) | Convolution calculation unit, calculation method and neural network calculation platform | |
Stevens et al. | Manna: An accelerator for memory-augmented neural networks | |
US9058541B2 (en) | Object detection method, object detector and object detection computer program | |
CN112836813A (en) | Reconfigurable pulsation array system for mixed precision neural network calculation | |
Mao et al. | F-DNA: Fast convolution architecture for deconvolutional network acceleration | |
Niu et al. | SPEC2: Spectral sparse CNN accelerator on FPGAs | |
Xu et al. | Using Fermat number transform to accelerate convolutional neural network | |
Jin et al. | Sparse ternary connect: Convolutional neural networks using ternarized weights with enhanced sparsity | |
Nguyen-Thanh et al. | Energy efficient techniques using FFT for deep convolutional neural networks | |
Chiper et al. | An efficient unified framework for implementation of a prime-length DCT/IDCT with high throughput | |
CN109669666A (en) | Multiply accumulating processor | |
CN112528224B (en) | Matrix eigenvalue decomposition grouping circulation iteration flow realization method and system | |
CN115034360A (en) | Processing method and processing device for three-dimensional convolution neural network convolution layer | |
CN114021070A (en) | Deep convolution calculation method and system based on micro-architecture processor | |
Le et al. | An opencl-based sift accelerator for image features extraction on fpga in mobile edge computing environment | |
CN114022366A (en) | Image size adjusting structure based on data stream architecture, image size adjusting method based on data stream architecture and image size adjusting equipment based on data stream architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |