CN116822616A - Device for training Softmax function in large language model - Google Patents

Device for training Softmax function in large language model Download PDF

Info

Publication number
CN116822616A
CN116822616A CN202310881111.7A CN202310881111A CN116822616A CN 116822616 A CN116822616 A CN 116822616A CN 202310881111 A CN202310881111 A CN 202310881111A CN 116822616 A CN116822616 A CN 116822616A
Authority
CN
China
Prior art keywords
vector
calculation
propagation path
output
adder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310881111.7A
Other languages
Chinese (zh)
Inventor
王中风
邵海阔
鲁金铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202310881111.7A priority Critical patent/CN116822616A/en
Publication of CN116822616A publication Critical patent/CN116822616A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a device for training a Softmax function in a large language model, wherein the upper half part of the device is a forward propagation path, and the lower half part of the device is a reverse propagation path; the forward propagation path includes e x Exponential function unit, adder and divider, at e x Registers are inserted among the exponent function unit, the adder and the divider; the backward propagation path comprises two multipliers and an adder A 1 And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B 1 Right multiplier B 2 The method comprises the steps of carrying out a first treatment on the surface of the The multiplexer MUX is used for changing the data flow inside the device; the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2. The invention can be applied to SoftmaThe x-function trains the computation at various stages, thereby better utilizing the computation and storage resources to achieve higher performance and energy efficiency.

Description

Device for training Softmax function in large language model
Technical Field
The invention relates to a device for training a Softmax function in a large language model.
Background
The Transformer is a classical model applied to NLP (Natural Language Processing ) proposed by Google's team in 2017, and uses Self-Attention mechanism, so that the model can be trained in parallel and can possess global information of samples. Now more popular BERT and GPT etc. are also models implemented based on the Transformer infrastructure.
In recent years, a Deep Neural Network (DNN) based on a transducer has achieved excellent results in fields such as Natural Language Processing (NLP), computer Vision (CV), and voice processing. The Transformer based model is typically pre-trained on a large scale dataset and then fine-tuned for downstream tasks. With the continuous expansion of application scenes of the Transformer model, the training (fine tuning) of the model on the edge platform becomes important in consideration of the requirements of data privacy and real-time processing. However, due to the huge parameter amount of the transducer model, the calculation complexity is high, and the deployment of the fine-tuning training process on the edge platform with limited resources faces many challenges. The transducer class model consists of a transducer layer in which a mechanism of attention (Attention Mechanism) called self-attention is used. With the continuous increase of the model scale and the continuous increase of the sample sequence length of model processing, the calculated amount and the ratio of Softmax calculation in the attention mechanism in the reasoning and training process are also continuously increased, and become one of the bottlenecks for restricting the deployment efficiency of the model. The hardware design related to Softmax is currently available in the model-oriented reasoning stage, and the application is mainly focused on the traditional convolutional neural network. The prior art scheme mainly uses mathematical transformation to transform complex exponential functions (e x ) And division operations are converted into schemes of lower complexity that are more suitable for hardware implementation.
Disclosure of Invention
The invention aims to: the technical problem to be solved by the invention is to provide a device for training a Softmax function in a large language model, wherein the large language model is a Transformer model, and the device is characterized in that the upper half part of the device is a forward propagation path and the lower half part of the device is a reverse propagation path;
the forward propagation path packetContaining e x Exponential function unit, adder and divider, at e x Registers are inserted among the exponent function unit, the adder and the divider;
the backward propagation path comprises two multipliers and an adder A 1 And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B 1 Right multiplier B 2
The multiplexer MUX is used for changing the data flow inside the device;
the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2.
In the forward propagation phase, the Softmax function is calculated row by row on the M matrix, a certain row vector m= (M 1 ,m 2 ,…,m s ),m s The s-th element in the vector m is represented, the subscript represents the position of the element in the vector, and the length of the vector m is s; softmax calculates the resulting vector n= (n) 1 ,n 2 ,…,n s ) The ith element n in n i The calculation process of (1) is expressed as follows:
wherein i is 1-s;
the calculation process of equation (1) includes three phases: of elementsCalculation of->Summing and dividing.
The forward propagation path uses serialization processing to complete the computation of forward propagation: the input data is element m in vector m 1 ,m 2 ,…,m s Input data is stored in the RAM 1; after the calculation starts, m 1 ,m 2 ,…,m s Sequentially taken out from the RAM1, firstly go through e x The exponential function unit completes the exponential operation, each element m 1 ,m 2 ,…,m s E of (2) x Calculation resultComplete the accumulation in adder while e x The calculation result is stored in the RAM 2; after all elements in vector m have completed the exponential operation, the adder adds up to get +.>Stored in RAM 2; then (I)>Is transmitted to the divisor input port of the divider, < >>The divisor input ports are fetched from RAM2 and sequentially transferred to a divider whose output is n in turn 1 ,n 2 ,…,n s
During the serialization process, the division calculation needs to wait until all elements in vector m complete e x The exponential operation can be started after the accumulated result is obtained, and the division calculation is divided into two stages: e, e x The calculation of the exponent operation and accumulation stage and the division operation stage is carried out in a pipeline mode.
Setting an input matrix in the forward propagation of a Softmax functionThe real number matrix representing s rows and s columns is set so that each row vector in the matrix M is represented as M in turn 1 ,m 2 ,…,m s Wherein m is s Vector representing the s-th row in matrix M, M s =(m s1 ,m s2 ,…,m ss ),m ss Representing vector m s The s-th element in (b), subscript indicates the position of the element in the vector, vector m s Is s in length; when m is 1 Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m 2 Starting a new exponent and accumulation operation without m 1 Waiting for completion of calculation; the two phases of computation are performed in parallel.
In the backward propagation path, performing a backward propagation phase, the gradient propagated by a subsequent network structure of Softmax is dn, wherein the subsequent network structure refers to calculation after a Softmax function in a Transformer model, such as matrix multiplication and the like; dn represents the gradient of the output data n of the Softmax function of the forward propagation stage relative to the final output result of the model; in the back propagation stage, the gradient dm of the vector m relative to the final output result of the model needs to be calculated, dm is calculated by the following formula:
dm=dn·(diag(n)-n T ·n)
wherein diag represents a diagonal matrix function and T represents a matrix transpose; element dm in dm i The calculation process of (1) is developed as follows:
wherein i is more than or equal to 1 and s is more than or equal to s.
When the backward propagation path of the apparatus is used for the backward propagation calculation, stored in the RAM1 is the calculation result n of the forward propagation 1 ,n 2 ,…,n s Stored in RAM2 is a gradient value dn delivered by the Softmax subsequent network structure 1 ,dn 2 ,…,dn s ,dn s Representing the s-th element in the vector dn, the subscript representing the position of the element in the vector, the length of the vector dn being s.
In calculating the gradient dm of vector m, the counter-propagation path reconstructs the internal data stream through the multiplexer, thereby completing the calculation of each element in dm, e.g., any element dm in the calculation dm i At this time, the calculation procedure was as follows (dm i Representing the i-th element in the vector dm, the subscript i represents the position of the element in the vector dm, and the length of the vector dm is s. ):
when dm is calculated i At the time of meteringThe connection mode of the counter propagation path of the 1 st clock period has the following characteristics:
at this time, the data read in the RAM1 is n i The data read in RAM2 is dn i
Left multiplier B 1 The input of (2) is n i And-n i The output is-n i ·n i
Right multiplier B 2 The input of (2) is n i And dn i ,B 2 The output of (2) is n i ·dn i
Adder A 1 Is n i ·dn i ,A 1 The upper input of (a) is 0, indicating that accumulation is started, A 1 The output of the current period is n i ·dn i
At computation dm 1 The t clock period of (2) is not less than t is not more than s+1, and the connection mode of the reverse propagation path has the following characteristics:
at this time, the data read in the RAM1 is n t ,n t Representing the t-th element in vector n, the subscript representing the position of the element in the vector, all elements of vector n being stored in RAM 1; the data read in RAM2 is dn t-1 ;dn t-1 Representing the t-1 st element in the vector dn, the subscript representing the position of the element in the vector, all elements of the vector dn being stored in RAM 2;
left multiplier B 1 Is kept at n by the control of the multiplexer MUX i ,B 1 The upper input of (2) is-n t ,B 1 The output of the current period is-n i ·n t
Right multiplier B 2 Is dn t-1 At this time B 2 Left side input is B 1 Output of last cycle-n i ·n t-1 ,B 2 The output of (2) is-n i ·n t-1 ·dn t-1
Adder A 1 Left side input is B 2 Output-n of (2) i ·n t-1 ·dn t-1 ,A 1 Through multiplexing of the upper inputs of (2)The MUX of the device is controlled to be A 1 Output of the last cycle of (a) at this time 1 Has the function of an accumulator;
over s+1 cycles, A 1 The accumulated output is At this time dm was calculated i
Furthermore, the parallelization design can be used for further improving the calculation throughput of the hardware module, and the parallelism existing in the training process of the transducer model can be utilized by carrying out parallel calculation through more than two devices.
The beneficial effects are that: aiming at Softmax calculation in a Transformer model training process, the invention provides a flexible and efficient hardware architecture, and a pipeline design method is used, so that the method can be applied to calculation of Softmax functions in each training stage, and calculation and storage resources are better utilized, so that higher performance and energy efficiency are realized. The solution proposed by the invention is the first proposed and effective scheme at present.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
Fig. 1 is a block diagram of a Softmax training module.
Fig. 2 is a schematic diagram of parallelization design with 4 parallelism in a conventional scheme.
FIG. 3 is a forward propagation computation diagram of a serialization process.
FIG. 4 is a schematic diagram of forward propagation computation of a pipeline design.
Fig. 5 is a Softmax computation flow chart for each vector in the M matrix.
Fig. 6 is a schematic diagram of the reverse propagation path of the Softmax training module.
FIG. 7 is a calculation dm 1 In the process, the data flow diagram of the 1 st clock cycle is shown.
FIG. 8 is a calculation dm 1 In the process, the data flow diagram of the t clock period is shown.
Detailed Description
The invention provides a device for training a Softmax function in a large language model, and FIG. 1 shows a schematic structure of a Softmax module which can be used for training calculation at each stage, and comprises a basic arithmetic unit and a matched storage module. Two parts of random access memory (Random Access Memory, RAM) are used for caching data required for calculation or calculating intermediate results, and RAM1 and RAM2 are respectively used for storing different data in different stages of training and are shared by the stages. The upper half of the module is the forward propagation path and the lower half is the reverse propagation path.
The forward propagation path includes e x An exponential function unit (exp), an adder, a divider and other computing modules, wherein a register is inserted between the computing modules to realize pipeline processing so as to improve hardware frequency and throughput;
the back propagation path comprises two multipliers, an adder, and a Multiplexer (MUX) for reconstructing the data path, which multiplexer can change the data flow inside the module during different computation cycles of the back propagation phase to achieve accurate computation. The supported data format of the computing module may be adjusted as desired. For example, if the data is a floating point number, the exp unit may be implemented by a corresponding floating point arithmetic IP core, and if the data is a fixed point number, it may be implemented using a Look-Up Table (Look Up Table) or a corresponding mathematical approximation unit.
Forward propagation path of the flow:
in the forward propagation phase, the Softmax function is calculated on a row-by-row basis on an M matrix, with a certain row vector in the matrix M denoted as m= (M 1 ,m 2 ,…,m s ) The result vector calculated by Softmax is n= (n) 1 ,n 2 ,…,n s ) Element n of n i The calculation process of (1) is expressed as follows:
the calculation can be divided into three phases: of elementsCalculation of->Summing and dividing.
Parallelization design methods are often adopted in hardware design related to Softmax reasoning in the operation of a conventional neural network accelerator to improve throughput, for example, a conventional design scheme with parallelism of 4 is shown in fig. 2: simultaneous computation by using multiple parallel modulesThe sum is summed using an addition tree and then the individual results are calculated using a plurality of dividers. The advantage of parallelization processing is that throughput can be continually improved by increasing parallelism. However, in the Transformer class model, the matrix dimensions in the self-attention mechanism are uncertain because the sample sequence length s processed by the model is constantly changing. The parallelism of parallelization processing is therefore an indeterminate quantity in the solidified hardware design. The parallelism is set higher, and when the sample sequence length s is smaller, the calculation cannot fully utilize hardware resources; the parallelism is set low and when the sample sequence length s is relatively large, the hardware cannot properly complete the required computation.
Considering the changeable characteristics of Softmax calculation in a Transformer model, the invention uses a serialization and pipelining design method to adapt to the calculation characteristics of the model, and meanwhile, the invention still has parallelization design to further improve the space of hardware module calculation throughput, and the parallelism provided by s×s matrix, multi-head attention and multi-batch training can be utilized by parallel calculation of a plurality of modules.
Serializing:
as shown in fig. 3, the forward propagation path uses a serialization process to complete the computation of the forward propagation. The input data is element m in vector m 1 ,m 2 ,…,m s Stored in RAM 1. After the calculation starts, m 1 ,m 2 ,…,m s Sequentially taking out from the RAM1, finishing exponential operation by an exp unit, and e of each element x Calculation resultComplete the accumulation in adder while e x The calculation result is stored in the RAM2. After all elements in vector m have completed the exponential operation, the adder adds up to get +.>Stored in RAM2. Then (I)>Is transmitted to the divisor input port of the divider, < >>Fetched from RAM2 and sequentially transferred to the dividend input port of the divider. Thus, the output of the divider is n in turn 1 ,n 2 ,…,n s
And (3) pipeline design: in the serialization process above, the division calculation needs to wait until all elements in vector m complete e x The exponential operation can be started after the accumulated result is obtained, and the calculation waiting among the stages reduces the calculation efficiency. Thus, the present invention divides the computation into two phases: (1) e, e x Performing exponential operation and accumulation; (2) division operations. And pipeline is inserted between the two stages to improve computational efficiency and throughput.
As shown in fig. 4, each row vector in the setting matrix M is represented as M in turn 1 ,m 2 ,…,m s Wherein m is 1 =(m 11 ,m 12 ,…,m 1s ),m 2 =(m 21 ,m 22 ,…,m 2s ) … …. When m is 1 Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m 2 Starting a new exponent and accumulation operation without m 1 Waiting for the calculation to complete. The calculation of the two stages is performed in parallel, each hardware module does not need to wait, and the hardware utilization rate and the calculation throughput are effectively improved. A flow chart for the calculation of the vectors in the M matrix is shown in fig. 5.
Reconfigurable reverse propagation path:
in the back propagation phase, the gradient propagated by the subsequent network structure of Softmax is dn, dm can be calculated by the following formula:
dm=dn·(diag(n)-n T ·n)
with element dm in dm i For example, (1. Ltoreq.i.ltoreq.s), the calculation process can be expanded as follows:
for example:
the structure of the backward propagation training path of the Softmax module is shown in fig. 6, and when used for the backward propagation calculation, the forward propagation calculation result n is stored in the RAM1 1 ,n 2 ,…,n s Stored in RAM2 is a gradient value dn delivered by a Softmax subsequent network structure, … 1 ,dn 2 ,…,dn s …. In calculating the gradient dm of vector m, the back propagation path reconstructs the internal data stream through the multiplexer, thereby accurately completing the calculation of each element in dm.
At dm 1 For example, in the 1 st clock cycle of the calculation, the connection manner and data flow of the reverse propagation path are as shown in fig. 7:
(1) The solid black line in the figure indicates the active path controlled by the multiplexer and the dashed line indicates the unselected path.
(2) At this time, the data read in the RAM1 is n 1 The data read in RAM2 is dn 1
(3) Left multiplier B 1 The input of (2) is n 1 And-n 1 The output is-n 1 ·n 1
(4) Right multiplier B 2 The input of (2) is n 1 And dn 1 ,B 2 The output of (2) is n 1 ·dn 1
(5) Adder A 1 Is n 1 ·dn 1 ,A 1 The upper input of (a) is 0, indicating that accumulation is started, A 1 The output of the current period is n 1 ·dn 1
At computation dm 1 The connection mode and data flow of the counter propagation path of the t-th clock period (2. Ltoreq.t. Ltoreq.s+1) are as shown in FIG. 8, and have the following features:
(1) At this time, the data read in the RAM1 is n t The data read in RAM2 is dn t-1
(2) Multiplier B 1 The left side input is held at n by the control of the multiplexer 1 ,B 1 The upper input of (2) is-n t ,B 1 The output of the current period is-n 1 ·n t
(3) Multiplier B 2 Is dn t-1 At this time B 2 Left side input is B 1 Output of last cycle-n 1 ·n t-1 Thus B 2 The output of (2) is-n 1 ·n t-1 ·dn t-1
(4) Adder A 1 Left side input is B 2 The output of (a), i.e., -n 1 ·n t-1 ·dn t-1 ,A 1 Is controlled to A by a multiplexer 1 Output of the last cycle of (a) at this time 1 Having the function of an accumulator.
Over s+1 cycles, A 1 The accumulated output is
Computation process of other elements in dm and dm 1 The same applies.
Examples
Taking a training process of a transducer model in a natural language processing task as an example, the improvement of the calculation efficiency of a forward propagation path of a serialization and pipelining design relative to a traditional parallelization design scheme is illustrated. The sample length of the natural language processing data set is set to be between 1 and 128, namely the vector length which needs to be processed by the Softmax function is 1.ltoreq.s.ltoreq.128. In the conventional parallelization design, in order to correctly complete the Softmax function calculation, a hardware design with a parallelism of 128 is required (as shown in fig. 2). Assuming that the average value of all sample lengths s of the data set is mu, the hardware utilization rate of the parallelization design scheme is achieved in the whole training processIn the forward propagation path of the serialization and pipelining scheme provided by the invention, the utilization rate of hardware resources is increased in the whole training process>As can be seen by comparing the hardware utilization, η is found only when μ is greater than or equal to 127 1 ≥η 2 . In the practical training process, the average value of the sample length s (s is more than or equal to 1 and less than or equal to 128) is mu and is far less than 128, so eta 2 Is larger than eta 1 . Therefore, the hardware resource utilization rate and the calculation efficiency of the forward propagation path of the serialization and pipelining design are superior to those of the traditional parallelization design scheme.
The invention provides a device for training a Softmax function in a large language model, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made, and the improvements and modifications should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (10)

1. A device for Softmax function training in a large language model, the large language model being a Transformer class model, characterized in that the upper half of the device is a forward propagation path and the lower half is a reverse propagation path;
the forward propagation path includes e x Exponential function unit, adder and divider, at e x Registers are inserted among the exponent function unit, the adder and the divider;
the backward propagation path comprises two multipliers and an adder A 1 And a multiplexer MUX for reconstructing the data path; the two multipliers are respectively left multiplier B 1 Right multiplier B 2
The multiplexer MUX is used for changing the data flow inside the device;
the forward propagation path and the backward propagation path share two random access memories RAM1 and RAM2.
2. The apparatus of claim 1 wherein in the forward propagation phase, the Softmax function is calculated row-wise on an M matrix, a row vector m= (M) in the matrix M 1 ,m 2 ,…,m s ),m s The s-th element in the vector m is represented, the subscript represents the position of the element in the vector, and the length of the vector m is s; softmax calculates the resulting vector n= (n) 1 ,n 2 ,…,n s ) The ith element n in n i The calculation process of (1) is expressed as follows:
wherein i is 1-s.
3. The apparatus of claim 2, wherein the calculation of equation (1) comprises three stages: of elementsCalculation of->Summing and dividing.
4. The apparatus of claim 3, wherein the forward propagation path performs forward propagation computation using a serialization process: the input data is element m in vector m 1 ,m 2 ,…,m s Input data is stored in the RAM 1; after the calculation starts, m 1 ,m 2 ,…,m s Sequentially taken out from the RAM1, firstly go through e x The exponential function unit completes the exponential operation, each element m 1 ,m 2 ,…,m s E of (2) x Calculation resultComplete the accumulation in adder while e x The calculation result is stored in the RAM 2; after all elements in vector m have completed the exponential operation, the adder adds up to get +.> Stored in RAM 2; then (I)>Is transmitted to the divisor input port of the divider,the divisor input ports are fetched from RAM2 and sequentially transferred to a divider whose output is n in turn 1 ,n 2 ,…,n s
During the serialization process, the division calculation needs to wait until all of the vectors mElement completion e x The exponential operation can be started after the accumulated result is obtained, and the division calculation is divided into two stages: e, e x The calculation of the exponent operation and accumulation stage and the division operation stage is carried out in a pipeline mode.
5. The apparatus of claim 4, wherein the input matrix during the forward propagation of the Softmax function is set The real number matrix representing s rows and s columns is set so that each row vector in the matrix M is represented as M in turn 1 ,m 2 ,…,m s Wherein m is s Vector representing the s-th row in matrix M, M s =(m s1 ,m s2 ,…,m ss ),m ss Representing vector m s The s-th element in (b), subscript indicates the position of the element in the vector, vector m s Is s in length; when m is 1 Finishing the exponent and accumulation operation, dividing the exponent and the accumulation operation by m 2 Starting a new exponent and accumulation operation without m 1 Waiting for completion of calculation; the two phases of computation are performed in parallel.
6. The apparatus of claim 5, wherein the backward propagation phase is performed on a backward propagation path with a gradient dn propagated by a subsequent network structure of Softmax, the subsequent network structure being a calculation following a Softmax function in a Transformer class model; dn represents the gradient of the output data n of the Softmax function of the forward propagation stage relative to the final output result of the model; in the back propagation stage, the gradient dm of the vector m with respect to the final output result of the model needs to be calculated.
7. The apparatus of claim 6, wherein dm is calculated by the following formula:
dm=dn·(diag(n)-n T ·n)
wherein diag represents a diagonal matrix function and T represents a matrix transpose; element dm in dm i The calculation process of (1) is developed as follows:
wherein i is more than or equal to 1 and s is more than or equal to s.
8. The apparatus according to claim 7, wherein when a backward propagation path of the apparatus is used for a backward propagation calculation, stored in the RAM1 is a calculation result n of forward propagation 1 ,n 2 ,…,n s Stored in RAM2 is a gradient value dn delivered by the Softmax subsequent network structure 1 ,dn 2 ,…,dn s ,dn s Representing the s-th element in the vector dn, the subscript representing the position of the element in the vector, the length of the vector dn being s.
9. The apparatus according to claim 8, wherein during calculation of the gradient dm of the vector m, the counter-propagation path reconstructs the internal data stream through the multiplexer, thereby completing the calculation of each element dm, any element dm in the calculation dm i The calculation process is as follows:
when dm is calculated i At the 1 st clock cycle of calculation, the connection mode of the reverse propagation path has the following characteristics:
at this time, the data read in the RAM1 is n i The data read in RAM2 is dn i
Left multiplier B 1 The input of (2) is n i And-n i The output is-n i ·n i
Right multiplier B 2 The input of (2) is n i And dn i ,B 2 The output of (2) is n i ·dn i
Adder A 1 Is n i ·dn i ,A 1 Upper delivery of (2)The entry is 0, indicating the beginning of accumulation, A 1 The output of the current period is n i ·dn i
At computation dm 1 The t clock period of (2) is not less than t is not more than s+1, and the connection mode of the reverse propagation path has the following characteristics:
at this time, the data read in the RAM1 is n t ,n t Representing the t-th element in vector n, the subscript representing the position of the element in the vector, all elements of vector n being stored in RAM 1; the data read in RAM2 is dn t-1 ;dn t-1 Representing the t-1 st element in the vector dn, the subscript representing the position of the element in the vector, all elements of the vector dn being stored in RAM 2;
left multiplier B 1 Is kept at n by the control of the multiplexer MUX i ,B 1 The upper input of (2) is-n t ,B 1 The output of the current period is-n i ·n t
Right multiplier B 2 Is dn t-1 At this time B 2 Left side input is B 1 Output of last cycle-n i ·n t-1 ,B 2 The output of (2) is-n i ·n t-1 ·dn t-1
Adder A 1 Left side input is B 2 Output-n of (2) i ·n t-1 ·dn t-1 ,A 1 Is controlled to a by a multiplexer MUX 1 Output of the last cycle of (a) at this time 1 Has the function of an accumulator;
over s+1 cycles, A 1 The accumulated output is At this time dm was calculated i
10. The apparatus of claim 9, wherein the computation is performed in parallel by more than two devices.
CN202310881111.7A 2023-07-18 2023-07-18 Device for training Softmax function in large language model Pending CN116822616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310881111.7A CN116822616A (en) 2023-07-18 2023-07-18 Device for training Softmax function in large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310881111.7A CN116822616A (en) 2023-07-18 2023-07-18 Device for training Softmax function in large language model

Publications (1)

Publication Number Publication Date
CN116822616A true CN116822616A (en) 2023-09-29

Family

ID=88139161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310881111.7A Pending CN116822616A (en) 2023-07-18 2023-07-18 Device for training Softmax function in large language model

Country Status (1)

Country Link
CN (1) CN116822616A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349032A (en) * 2023-12-05 2024-01-05 城云科技(中国)有限公司 Method and device for improving throughput of large language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349032A (en) * 2023-12-05 2024-01-05 城云科技(中国)有限公司 Method and device for improving throughput of large language model
CN117349032B (en) * 2023-12-05 2024-02-20 城云科技(中国)有限公司 Method and device for improving throughput of large language model

Similar Documents

Publication Publication Date Title
Yu et al. Lite-hrnet: A lightweight high-resolution network
Liu et al. Group fisher pruning for practical network compression
Chen et al. ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks
KR101781057B1 (en) Vector processing engine with merging circuitry between execution units and vector data memory, and related method
KR20160085337A (en) Vector processing engines employing a tapped-delay line for filter vector processing operations, and related vector processor systems and methods
US11593907B2 (en) System and methods for computing 2-D convolutions and cross-correlations
Xu et al. Reconfigurable and low-complexity accelerator for convolutional and generative networks over finite fields
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN116822616A (en) Device for training Softmax function in large language model
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
Stevens et al. Manna: An accelerator for memory-augmented neural networks
US9058541B2 (en) Object detection method, object detector and object detection computer program
CN112836813A (en) Reconfigurable pulsation array system for mixed precision neural network calculation
Mao et al. F-DNA: Fast convolution architecture for deconvolutional network acceleration
Niu et al. SPEC2: Spectral sparse CNN accelerator on FPGAs
Xu et al. Using Fermat number transform to accelerate convolutional neural network
Jin et al. Sparse ternary connect: Convolutional neural networks using ternarized weights with enhanced sparsity
Nguyen-Thanh et al. Energy efficient techniques using FFT for deep convolutional neural networks
Chiper et al. An efficient unified framework for implementation of a prime-length DCT/IDCT with high throughput
CN109669666A (en) Multiply accumulating processor
CN112528224B (en) Matrix eigenvalue decomposition grouping circulation iteration flow realization method and system
CN115034360A (en) Processing method and processing device for three-dimensional convolution neural network convolution layer
CN114021070A (en) Deep convolution calculation method and system based on micro-architecture processor
Le et al. An opencl-based sift accelerator for image features extraction on fpga in mobile edge computing environment
CN114022366A (en) Image size adjusting structure based on data stream architecture, image size adjusting method based on data stream architecture and image size adjusting equipment based on data stream architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination