CN118095351A - Cooperative processing device and method for layer normalization calculation - Google Patents
Cooperative processing device and method for layer normalization calculation Download PDFInfo
- Publication number
- CN118095351A CN118095351A CN202410437757.0A CN202410437757A CN118095351A CN 118095351 A CN118095351 A CN 118095351A CN 202410437757 A CN202410437757 A CN 202410437757A CN 118095351 A CN118095351 A CN 118095351A
- Authority
- CN
- China
- Prior art keywords
- accumulation
- layer normalization
- unit
- local
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010606 normalization Methods 0.000 title claims abstract description 380
- 238000012545 processing Methods 0.000 title claims abstract description 303
- 238000004364 calculation method Methods 0.000 title claims abstract description 188
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000009825 accumulation Methods 0.000 claims abstract description 464
- 238000007781 pre-processing Methods 0.000 claims description 92
- 230000015654 memory Effects 0.000 claims description 39
- 230000009467 reduction Effects 0.000 claims description 25
- 238000003672 processing method Methods 0.000 claims description 4
- 239000004744 fabric Substances 0.000 claims 1
- 238000013528 artificial neural network Methods 0.000 abstract description 19
- 230000001133 acceleration Effects 0.000 abstract description 11
- 239000010410 layer Substances 0.000 description 345
- 238000010586 diagram Methods 0.000 description 23
- 238000003860 storage Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 230000005540 biological transmission Effects 0.000 description 8
- 230000002195 synergetic effect Effects 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 230000002829 reductive effect Effects 0.000 description 5
- 230000003247 decreasing effect Effects 0.000 description 4
- 230000036961 partial effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The disclosure relates to the technical field of neural networks, and in particular relates to a cooperative processing device and method for layer normalization calculation. The device comprises K layers of normalization processing units, wherein K is a positive integer greater than 1; the K layer normalization processing units are used for sharing respective local accumulation results and carrying out cooperative processing of layer normalization calculation, the local accumulation results comprise accumulation results of parameters related to input data, and the input data is local data input to the layer normalization processing units in original data. According to the embodiment of the disclosure, cooperative processing of layer normalization calculation can be realized by sharing respective local accumulation results through multiple nodes (namely K layer normalization processing units), so that distributed calculation acceleration is realized, and the layer normalization calculation efficiency of the neural network is improved.
Description
Technical Field
The disclosure relates to the technical field of neural networks, and in particular relates to a cooperative processing device and method for layer normalization calculation.
Background
The layer normalization operator is a nonlinear operator commonly used in neural network algorithms. The layer normalization operator reduces the data so that the neural network is more stable in the training process and the performance of the neural network is improved. In the reduction calculation process, all intermediate results of the calculation need to be counted to obtain calculation parameters, so that the reduction calculation is completed. Thus, reduction computation is one of the keys affecting the computational performance of the algorithm.
In the related art, for a large language model based on a transformation network (english: transform), reduction calculation becomes a bottleneck of calculation performance due to its huge scale and complicated calculation requirements. In the large model algorithm, there are a large number of matrix vector multiplication operations, and the result of these calculations requires reduction calculation, but because the vector is long, it takes a large amount of time to complete in one calculation unit, and the calculation performance of the calculation system is greatly reduced. A reasonable and effective treatment mode is not provided at present.
Disclosure of Invention
In view of this, the disclosure proposes a synergistic processing device and method for layer normalization computation.
According to an aspect of the present disclosure, there is provided a synergistic processing apparatus for layer normalization computation, the apparatus including K layer normalization processing units, the K being a positive integer greater than 1;
The K layer normalization processing units are used for sharing respective local accumulation results and carrying out cooperative processing of layer normalization calculation, the local accumulation results comprise accumulation results of parameters related to input data, and the input data is local data input to the layer normalization processing units in original data.
In one possible implementation manner, each layer normalization processing unit further comprises a layer normalization calculation processing unit and a layer normalization parameter router;
the layer normalization calculation processing unit is used for preprocessing the input data to obtain the local accumulation result;
The layer normalization parameter router is used for sending the local accumulation results to other layer normalization processing units for sharing and receiving K-1 local accumulation results sent by the other layer normalization processing units;
the layer normalization calculation processing unit is further used for performing reduction calculation on the input data by adopting a layer normalization operator according to the shared K local accumulation results to obtain a calculation result.
In another possible implementation manner, the K layer normalization parameter routers are connected by adopting a ring routing structure.
In another possible implementation manner, each layer normalization processing unit further comprises a memory unit and a direct memory access unit;
The memory unit is used for storing the input data;
The direct memory access unit is used for acquiring the stored input data and scheduling the input data to the layer normalization calculation processing unit;
the layer normalization computing processing unit is further configured to obtain the scheduled input data.
In another possible implementation manner, each layer normalization calculation processing unit includes a calculation unit, a preprocessing accumulation unit, and a preprocessing calculation unit;
the preprocessing accumulation unit is used for preprocessing the input data to obtain the local accumulation result;
The preprocessing accumulation unit is also used for accumulating the shared K local accumulation results to obtain a global accumulation result;
The preprocessing calculation unit is used for carrying out parameter calculation according to the global accumulation result to obtain intermediate parameters of the layer normalization calculation;
And the calculation unit is used for carrying out protocol calculation by adopting the layer normalization operator according to the intermediate parameter and the input data to obtain the calculation result.
In another possible implementation, the preprocessing accumulation unit includes a selector, an accumulation unit, and a local counter, the local accumulation result includes a parameter accumulation sum and a local accumulation number,
The selector is used for sending the input data to the accumulation unit;
the accumulation unit is used for accumulating parameters related to the input data to obtain the parameter accumulation sum;
The local counter is used for counting the accumulation times of the parameters related to the input data to obtain the local accumulation times.
In another possible implementation, the preprocessing accumulation unit further includes an accumulation completion unit,
The accumulation completion unit is used for generating a first effective signal when the local accumulation result is obtained through accumulation, and the first effective signal is used for indicating the accumulation unit to output the local accumulation result to a corresponding layer normalization parameter router.
In another possible implementation, the preprocessing accumulation unit further includes a waiting remote unit,
The local counter is further configured to control the waiting remote unit to generate a second valid signal, where the second valid signal is used to instruct the selector to send the received local accumulation result shared by the other layer normalization computation processing units to the accumulation unit;
the accumulation unit is also used for accumulating the received K-1 local accumulation results and the local accumulation results to obtain the global accumulation result.
In another possible implementation, the preprocessing accumulation unit further includes a remote completion unit,
The remote completion unit is used for controlling the accumulation completion unit to generate a third effective signal when the global accumulation result is obtained by accumulation, and the third effective signal is used for indicating the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;
The accumulation completion unit is also used for resetting and resetting the waiting remote unit, the local counter and the accumulation unit.
In another possible implementation, the layer normalized parameter router includes a remote parameter queue unit and a remote parameter register;
the remote parameter queue unit is used for receiving the local accumulation results sent by other layer normalization processing units;
The remote parameter register is used for registering the local accumulation results sent by other layer normalization processing units, and inputting the local accumulation results to the layer normalization calculation processing unit through a target data channel, wherein the target data channel is a data channel between the layer normalization parameter router and the layer normalization calculation processing unit.
In another possible implementation, the layer normalization parameter router further includes a local parameter register and a data selector;
the local parameter register is used for registering the local accumulation result of the current layer normalization processing unit;
The data selector is used for sending the local accumulation result to the layer normalization parameter routers of other layer normalization processing units for sharing.
In another possible implementation, the layer normalization parameter router further includes a data selector, a local register, and a remote register;
the local register is used for setting the value of the local register to be a first numerical value when the local accumulation result is input into the layer normalization calculation processing unit;
The remote register is used for setting the value of the remote register to be a first numerical value when the local accumulation result is sent to other layer normalization parameter routers through the data selector;
The remote parameter register is further configured to clear the remote parameter register when the values of the local register and the remote register are both the first value.
In another possible implementation, the local accumulation result is transmitted in the form of data packets between the layer normalized parameter routers, and the remote parameter queuing unit is further configured to:
receiving the data packet sent by other layer normalization processing units, wherein the data packet comprises a plurality of flits;
and splicing the plurality of flits in the data packet to obtain the local accumulation result.
In another possible implementation manner, the input data of each layer normalization processing unit includes a plurality of data, the parameter related to the input data includes a plurality of target parameters, the local accumulation result includes an accumulation sum of a plurality of the target parameters and a local accumulation number, the target parameters are squares of the data, and the local accumulation number is an accumulation total number of the plurality of target parameters; or alternatively
The parameters related to the input data include a plurality of the data and a plurality of the target parameters, and the local accumulation result includes an accumulation sum of the plurality of the data, an accumulation sum of the plurality of the target parameters, and the local accumulation number.
According to another aspect of the present disclosure, there is provided a cooperative processing method of layer normalization computation, for use in an apparatus provided in the first aspect or any one of the possible implementation manners of the first aspect, where the method includes:
Each layer normalization processing unit preprocesses the input data to obtain the local accumulation result;
Each layer normalization processing unit shares the local accumulation result to other layer normalization processing units;
And each layer normalization processing unit performs cooperative processing of layer normalization calculation according to the shared K local accumulation results.
In one possible implementation manner, each layer normalization processing unit includes a layer normalization parameter router, and each layer normalization processing unit shares the local accumulation result to other layer normalization processing units, including:
For each layer normalization processing unit, sending the local accumulation result to other layer normalization processing units for sharing through the layer normalization parameter router;
the method further comprises the steps of:
and for each layer normalization processing unit, receiving K-1 local accumulation results sent by other layer normalization processing units through the layer normalization parameter router.
In another possible implementation manner, the K layer normalization parameter routers are connected by adopting a ring routing structure.
In another possible implementation manner, each layer normalization processing unit includes a memory unit, a direct memory access unit, and a layer normalization calculation processing unit; the method further comprises the steps of:
For each of the layer normalization processing units, acquiring the input data stored in the memory unit through the direct memory access unit;
scheduling the input data to the layer normalization calculation processing unit through the direct memory access unit;
And for each layer normalization processing unit, acquiring the scheduled input data through the layer normalization computing processing unit.
In another possible implementation manner, each layer normalization calculation processing unit includes a calculation unit, a preprocessing accumulation unit, and a preprocessing calculation unit;
Each layer normalization processing unit performs preprocessing on the input data to obtain a local accumulation result, and the method comprises the following steps:
for each layer normalization processing unit, preprocessing the input data through the preprocessing accumulation unit to obtain the local accumulation result;
Each layer normalization processing unit performs cooperative processing of layer normalization calculation according to the shared K partial accumulation results, including:
For each layer normalization processing unit, accumulating the shared K local accumulation results through the preprocessing accumulation unit to obtain a global accumulation result;
After accumulation is completed, parameter calculation is carried out through the preprocessing calculation unit according to the global accumulation result to obtain intermediate parameters of the layer normalization calculation;
And carrying out protocol calculation by the calculation unit according to the intermediate parameters and the input data by adopting the layer normalization operator to obtain the calculation result.
In another possible implementation manner, the preprocessing accumulation unit includes a selector, an accumulation unit and a local counter, the local accumulation result includes a parameter accumulation sum and a local accumulation number, the preprocessing accumulation unit performs preprocessing on the input data to obtain the local accumulation result, and the method includes:
Transmitting the input data to the accumulating unit through the selector;
And accumulating the parameters related to the input data through the accumulation unit to obtain the parameter accumulation sum, and counting the accumulation times of the parameters related to the input data through the local counter to obtain the local accumulation times.
In another possible implementation manner, the preprocessing accumulation unit further includes an accumulation completion unit, and the method further includes:
when the local accumulation result is obtained through accumulation, a first effective signal is generated through the accumulation completion unit, and the first effective signal is used for indicating the accumulation unit to output the local accumulation result to a corresponding layer normalization parameter router.
In another possible implementation, the preprocessing accumulation unit further includes a waiting remote unit, the method further including:
Controlling the waiting remote unit to generate a second effective signal through the local counter, wherein the second effective signal is used for instructing the selector to send the received local accumulation results shared by other layer normalization calculation processing units to the accumulation unit;
The step of accumulating the shared K local accumulation results through the preprocessing accumulation unit to obtain a global accumulation result comprises the following steps:
and accumulating the received K-1 local accumulated results and the local accumulated results through the accumulating unit to obtain the global accumulated result.
In another possible implementation manner, the preprocessing accumulation unit further includes a remote completion unit, and the method further includes:
when the global accumulation result is obtained through accumulation, the remote completion unit controls the accumulation completion unit to generate a third effective signal, wherein the third effective signal is used for indicating the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;
And resetting the waiting remote unit, the local counter and the accumulation unit through the accumulation completion unit.
In another possible implementation, the layer normalized parameter router includes a remote parameter queue unit and a remote parameter register; the receiving, by the layer normalization parameter router, the K-1 local accumulation results sent by other layer normalization processing units, including:
Receiving the local accumulation results sent by other layer normalization processing units through the remote parameter queue unit for each of the K-1 local accumulation results;
the method further comprises the steps of:
Registering the local accumulation result in the remote parameter register;
And inputting the local accumulation result to a layer normalization calculation processing unit through a target data channel by the remote parameter register, wherein the target data channel is a data channel between the layer normalization parameter router and the layer normalization calculation processing unit.
In another possible implementation manner, the layer normalization parameter router further includes a local parameter register and a data selector, and the sending, by the layer normalization parameter router, the local accumulation result to other layer normalization processing units for sharing includes:
registering the local accumulation result of the current layer normalization processing unit through the local parameter register;
and sending the local accumulation result to the layer normalization parameter routers of other layer normalization processing units through the data selector for sharing.
In another possible implementation, the layer normalization parameter router further includes a data selector, a local register, and a remote register; the method further comprises the steps of:
when the local accumulation result is input into the layer normalization computation processing unit, setting the value of the local register to a first numerical value;
setting the value of the remote register to a first numerical value when the local accumulation result is sent to other layer normalization parameter routers through the data selector;
and when the values of the local register and the remote register are the first numerical value, resetting the remote parameter register.
In another possible implementation manner, the local accumulation results are transmitted between the layer normalization parameter routers in the form of data packets, and the receiving, by the remote parameter queue unit, the local accumulation results sent by the other layer normalization processing units includes:
Receiving the data packet sent by other layer normalization processing units through the remote parameter queue unit, wherein the data packet comprises a plurality of flits;
and splicing the plurality of flits in the data packet through the remote parameter queue unit to obtain the local accumulation result.
In another possible implementation, the input data of each of the layer normalization processing units comprises a plurality of data,
The parameters related to the input data comprise a plurality of target parameters, the local accumulation result comprises accumulation sums of a plurality of target parameters and local accumulation times, the target parameters are squares of the data, and the local accumulation times are accumulation total times of the plurality of target parameters; or alternatively
The parameters related to the input data include a plurality of the data and a plurality of the target parameters, and the local accumulation result includes an accumulation sum of the plurality of the data, an accumulation sum of the plurality of the target parameters, and the local accumulation number.
The embodiment of the disclosure designs a multi-node (namely K layers of normalization processing units) collaborative processing layer normalization computing architecture, designs a pretreatment mechanism aiming at layer normalization computing requirements, utilizes the pretreatment mechanism to carry out local data processing in each node (namely each layer of normalization processing unit), realizes collaborative processing of layer normalization computing by sharing respective local accumulation results by the multi-node after the local data processing is completed to obtain the local accumulation results, does not need to share intermediate data, realizes distributed computing acceleration, and improves the layer normalization computing efficiency of the neural network.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a schematic architecture diagram of a layer normalized computing co-processing apparatus provided in an exemplary embodiment of the present disclosure.
Fig. 2 illustrates an architecture diagram of a synergistic processing device for layer normalization computation according to another exemplary embodiment of the present disclosure.
Fig. 3 shows a schematic architecture diagram of a layer normalization processing unit provided in an exemplary embodiment of the present disclosure.
Fig. 4 shows a schematic architecture diagram of a preprocessing accumulation unit provided in an exemplary embodiment of the present disclosure.
Fig. 5 shows a schematic architecture diagram of a preprocessing accumulation unit provided in another exemplary embodiment of the present disclosure.
Fig. 6 shows a schematic architecture diagram of a layer normalized parameter router provided by an exemplary embodiment of the present disclosure.
Fig. 7 shows a flowchart of a collaborative processing method of layer normalization computation provided by an exemplary embodiment of the present disclosure.
FIG. 8 is a block diagram illustrating a synergistic processing arrangement of layer normalization computations, according to an exemplary embodiment.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
First, an application scenario related to an embodiment of the present disclosure will be described. Referring to fig. 1, a schematic architecture diagram of a synergistic processing apparatus for layer normalization computation according to an exemplary embodiment of the present disclosure is shown. The apparatus 10 may be implemented as a computing device, which may be a terminal or a server. The device 10 includes K acceleration units, where K is a positive integer greater than 1, and each acceleration unit may include one layer normalization processing unit, that is, the device 10 may include K layer normalization processing units (i.e., layer normalization processing unit 1, layer normalization processing unit 2, … … layer normalization processing unit K).
An acceleration unit is a hardware device that is used to accelerate a particular type of computing task. The acceleration units may be field programmable gate array (Field Programmable GATE ARRAY, FPGA) devices, graphics processing units (Graphic Processing Unit, GPU), central processing units (Central Processing Unit, CPU), neural processing units (Neural Processing Unit, NPU), or the like.
Each layer normalization processing unit is used for processing the calculation acceleration of the layer normalization operator and the reduction calculation of the layer normalization operator in the neural network calculation. The reduction calculation refers to an aggregation operation, such as summation, average, maximum, etc., on a set of data in the neural network. In the layer normalization operator, reduction calculations can be used to calculate the mean and variance of the data in order to normalize the input data.
The embodiment of the disclosure designs a special accelerating unit aiming at the calculation requirement of the layer normalization operator, realizes the cooperative processing of multi-node (namely K layer normalization processing units) on the layer normalization calculation through a pretreatment mechanism, and can realize distributed calculation only by sharing respective local accumulation results when the multi-node cooperative calculation is carried out.
Optionally, each layer normalization processing unit of the K layer normalization processing units is configured to obtain input data, where the input data is local data input to the layer normalization processing unit in the original data; preprocessing input data to obtain a local accumulation result, wherein the local accumulation result comprises an accumulation result of parameters related to the input data; sharing the local accumulation result to other layers of normalization processing units; and carrying out cooperative processing of layer normalization calculation according to the shared K partial accumulation results.
The original data may be data with a length greater than a preset length threshold. The raw data may include N data, the raw data being divided into K sets of input data, each set of input data including a plurality of data, N and K each being a positive integer greater than 1. For example, N is 1000 and k is 10, the original data includes 1000 data, and the original data is divided into 10 sets of input data, each set of input data including 100 data. Wherein one data may be an activation value in a large language model. The embodiments of the present disclosure are not limited in this regard.
Optionally, the K sets of input data are respectively input into the K layer normalization processing units, so that each layer normalization processing unit in the K layer normalization processing units obtains a set of input data. That is, the raw data includes input data of each of the K layer normalization processing units, and there is no intersection between the input data of each of the K layer normalization processing units.
Each layer of normalization processing unit is used for preprocessing input data, wherein the preprocessing comprises accumulation and calculation of preprocessing the input data, such as squaring each data in the input data, accumulating the squares of the input data, and counting the accumulation times of the squares of the input data, or directly accumulating the input data, so as to obtain a local accumulation result.
Wherein the local accumulation result comprises an accumulation result of a parameter associated with the input data. Because the input data of each layer of normalization processing units comprises a plurality of data, optionally, the parameters related to the input data can comprise a plurality of target parameters, and the corresponding local accumulation result comprises an accumulation sum and a local accumulation number of the target parameters, wherein one target parameter is the square of one data, and the local accumulation number is the accumulation total number of the target parameters. Optionally, the parameters related to the input data may also include a plurality of data and a plurality of target parameters, and the local accumulation result includes an accumulation sum of the plurality of data, an accumulation sum of the plurality of target parameters, and a local accumulation number.
Each layer normalization processing unit sends the local accumulation result to other layer normalization processing units for sharing, and receives remote K-1 local accumulation results sent by other layer normalization processing units, so that each layer normalization processing unit obtains shared K local accumulation results, wherein the K local accumulation results comprise the local accumulation results and the received remote K-1 local accumulation results, namely the K local accumulation results are shared in the K layer normalization processing units.
Data sharing among the K layers of normalized processing units includes, but is not limited to, several possible implementations: in one possible implementation manner, each layer normalization processing unit sends the local accumulation result to other K-1 layer normalization processing units respectively; in another possible implementation manner, the K layer normalization processing units perform data sharing through respective layer normalization parameter routers, the K layer normalization parameter routers may be connected by adopting a ring routing structure, and each layer normalization processing unit sends a local accumulation result to a layer normalization processing unit of a next stage based on the layer normalization parameter routers connected by the ring routing structure, so that the layer normalization processing unit of the next stage transmits the received local accumulation result to the layer normalization processing unit of the next stage, and so on, thereby sharing the local accumulation result to other K-1 layer normalization processing units.
In one possible implementation, as shown in fig. 2, each layer normalization processing unit includes a layer normalization calculation processing unit, and a layer normalization parameter router connected to the layer normalization calculation processing unit. That is, the apparatus 10 includes K layer normalization calculation processing units (i.e., layer normalization calculation processing unit 1, layer normalization calculation processing unit 2, … … layer normalization calculation processing unit K) and K layer normalization parameter routers (i.e., layer normalization parameter router 1, layer normalization parameter router 2, … … layer normalization parameter router K). The K layers of normalized parameter routers can be connected by adopting a ring-shaped routing structure or other routing structures. The embodiments of the present disclosure are not limited in this regard.
And each layer of normalization calculation processing unit is used for acquiring input data, and preprocessing the input data to obtain a local accumulation result.
Each layer normalization parameter router is used for acquiring parameters (namely local accumulation results) generated in protocol calculation by the connected layer normalization calculation processing unit, and transmitting the acquired parameters to other layer normalization processing units through other connected layer normalization parameter routers.
Each layer normalization parameter router is further configured to receive, via the connected other layer normalization parameter routers, parameters (i.e., local accumulation results) generated by the other layer normalization processing units in the reduction calculation. That is, parameters (i.e., local accumulation results) generated in the reduction calculation are shared among the K layers of normalized processing units of the apparatus 10, and can be automatically completed without intervention of a system-level scheduling controller, thereby improving the data synchronization efficiency of the apparatus 10 and reducing the calculation time.
Each layer normalization processing unit in the K layer normalization processing units is further used for carrying out cooperative processing of layer normalization calculation according to the shared K local accumulation results. Optionally, each layer normalization processing unit is further configured to accumulate the shared K local accumulation results to obtain a global accumulation result, and perform reduction calculation on the input data by using a layer normalization operator according to the global accumulation result to obtain a calculation result.
The global accumulation result accumulated by each layer normalization processing unit may be the same. The global accumulation result is the sum of the local accumulation results corresponding to the K groups of input data, where the K groups of input data form original data, that is, the K groups of input data include N data altogether, and in one possible implementation manner, the global accumulation result may include the accumulation sum of N target parameters and the accumulation total number N of N target parameters. In another possible implementation, the global accumulation result includes an accumulation sum of N data, an accumulation sum of N target parameters, and an accumulation total number N of N target parameters.
It should be noted that, the calculation result obtained by performing the reduction calculation on the input data using the layer normalization operator may refer to the details in the following embodiments, which are not described herein.
Aiming at the requirement of the device 10 for calculating a layer normalization operator in a neural network scene, the embodiment of the disclosure designs a scheme for supporting multi-node cooperative work and parameter sharing. The multi-node collaborative computing capability is provided for the layer normalization operator, so that the reduction computing efficiency in the layer normalization operator can be improved, and the computing performance of the computing equipment can be improved.
The neural network may be a deep learning network, a convolutional neural network (Convolutional Neural Network, CNN), a recurrent neural network (Recurrent Neural Network, RNN), a Long Short-term memory network (Long Short-Term Memor, LSTM), or the like. For example, the neural network is a neural network of a large language model.
The layer normalization operator is used for performing layer normalization calculation on the data, and the layer normalization operator comprises a layer normalization (Layer Normalization, layerNorm) operator or a root mean square layer normalization (Root Mean Square Layer Normalization, RMS-LayerNorm) operator. The data can be data to be subjected to layer normalization calculation, which is output by any layer in the neural network, for example, the data can be an activation value output by a full connection layer in the neural network. It should be appreciated that the data may be any data in the neural network that requires layer normalization calculations, which are not limited by the disclosed embodiments.
In summary, the embodiment of the disclosure designs a layer normalization computing architecture for cooperative processing of multiple nodes (i.e., K layer normalization processing units), designs a pretreatment mechanism for layer normalization computing requirements, utilizes the pretreatment mechanism to perform local data processing in each node (i.e., each layer normalization processing unit), and after the local data processing is completed to obtain a local accumulation result, the multiple nodes share the respective local accumulation results to realize cooperative processing of layer normalization computing, and do not need to share intermediate data, thereby realizing distributed computing acceleration and improving the layer normalization computing efficiency of the neural network.
In one possible implementation, an architecture diagram of each layer of normalized processing units is shown in fig. 3. The internal architecture of the layer normalization processing unit may comprise five parts, namely a memory unit, a direct memory access unit, a control unit, a layer normalization calculation processing unit and a layer normalization parameter router. The layer normalization computing processing unit can comprise three parts, namely a computing unit, a preprocessing accumulation unit and a preprocessing computing unit. The internal units of the layer normalization processing unit may be increased or decreased according to actual needs, which is not limited by the embodiments of the present disclosure.
When the layer normalization processing unit performs calculation processing, the direct memory access unit performs data access between the memory unit and the layer normalization calculation processing unit. Acquiring input data stored in a memory unit through a direct memory access unit; scheduling the input data to a layer normalization calculation processing unit through a direct memory access unit; and acquiring scheduled input data through a layer normalization computing processing unit.
After the input data enter the layer normalization calculation processing unit, preprocessing is carried out on the input parameters, namely, the preprocessing accumulation unit is used for preprocessing the input data to obtain local accumulation results, and accumulating the shared K local accumulation results to obtain global accumulation results. And after the accumulation is completed, the preprocessing calculation unit is used for carrying out parameter calculation according to the global accumulation result to obtain intermediate parameters of the layer normalization calculation. Alternatively, the intermediate parameter may include an average value of N data included in the original data.
After the preprocessing process of layer normalization is completed and the final intermediate parameters of layer normalization calculation are obtained, a calculation unit is used for carrying out reduction calculation on input data by adopting a layer normalization operator according to the intermediate parameters and the input data to obtain a calculation result. In the preprocessing process, if calculation needs to be performed in a plurality of layer normalization processing units at the same time, local accumulation results are shared through the layer normalization parameter router. The control unit controls data calculation and data sharing operations in the whole-layer normalization processing unit.
The layer normalization operator can be LayerNorm operators or RMS-LayerNorm operators, namely the method provided by the embodiment of the disclosure can support the calculation of the two layer normalization operators. Illustratively, the calculation formula for LayerNorm operator is as follows:
wherein x i is the ith data in the original data, x i 2 is the square of the ith data, i.e. the ith target parameter, i is a positive integer and the value range is 1 to N, N is a positive integer greater than 1, All are preset parameter values, and mu is the average value of N data.
In LayerNorm operators, the global accumulation result includes an accumulation sum of N dataAccumulated sum/>, of N target parametersAnd the accumulated total number N of N target parameters, the intermediate parameters may include μ and/>。
Illustratively, the calculation formula for the RMS-LayerNorm operator is as follows:
In LayerNorm operators, the global accumulation result includes an accumulation sum of N target parameters And the accumulated total number of N target parameters, the intermediate parameters may include/>。
When each layer normalization processing unit performs layer normalization calculation, firstly, the direct memory access unit dispatches the input data in the memory unit to the layer normalization calculation processing unit, and the layer normalization calculation processing unit performs accumulation and calculation on the input data. For the operation of multi-node cooperative computing, the accumulated local accumulation results are required to be transmitted to other layers of normalization processing units through a parameter sharing network (namely, a parameter sharing network formed by K layers of normalization parameter routers), and the local accumulation results of the other layers of normalization processing units are received for further accumulation, so that the final global accumulation results are obtained. After the accumulation is completed, the global accumulation result is sent to a preprocessing calculation unit for parameter calculation to obtain intermediate parameters of layer normalization calculation. After the parameter calculation is completed, the intermediate parameters are written into the parameter register of the preprocessing calculation unit to complete the preprocessing operation. And then reading input data from the memory unit, performing reduction calculation by adopting a layer normalization operator according to the intermediate parameters obtained by preprocessing to obtain a calculation result, and writing the calculation result back to the memory unit through the direct memory access unit.
In one possible implementation, the preprocessing accumulation unit may be divided into two parts. The first part is responsible for accumulation and sharing of a plurality of data included in the input data in the LayerNorm operator, and the second part is responsible for accumulation of a target parameter (i.e., square of data) in the LayerNorm operator or the RMS-LayerNorm operator, calculation of the total number of accumulated N, and sharing. These two parts are described separately below.
The first part is responsible for accumulation and sharing of data in LayerNorm operators in layer normalization, preprocessing accumulation unit, please refer to fig. 4, which illustrates a schematic architecture diagram of the preprocessing accumulation unit provided in an exemplary embodiment of the present disclosure, which may be implemented as part of the preprocessing accumulation unit. The internal architecture of the preprocessing accumulation unit may include several parts, namely a selector, an accumulation unit, a local counter, an accumulation completion unit, a waiting remote unit and a remote completion unit. The internal units of the preprocessing accumulation unit can be increased or decreased according to actual needs, and the embodiments of the present disclosure are not limited thereto.
When the preprocessing accumulation unit performs accumulation processing, there are two different processing modes: local accumulation processing and global accumulation processing. The local accumulation processing is used for processing local parameter accumulation, and the global accumulation processing is used for calculating parameter accumulation among normalization processing units of different layers. When the local accumulation processing is carried out, the input data is sent to an accumulation unit through a selector; and accumulating the plurality of data contained in the input data through an accumulation unit to obtain a local accumulation result, and counting the local accumulation times through a local counter, wherein the local accumulation times are equal to the total number of the data contained in the local input data. That is, when the last accumulation operation is completed, that is, when the local accumulation result is obtained by accumulation, the local counter performs an accumulation count, and the accumulation completion unit generates a first effective signal, where the first effective signal is used to instruct the accumulation unit to output the local accumulation result to the corresponding layer normalization parameter router, and at this time, the local accumulation result in the accumulation unit is output to the corresponding layer normalization parameter router through the data channel and is shared by the layer normalization parameter router to other layer normalization processing units for use.
When global accumulation processing is performed, after the local accumulation and sharing processing, the local counter is used for controlling the waiting remote unit to generate a second effective signal, the second effective signal acts on the selector, and the second effective signal is used for instructing the selector to send the received local accumulation result shared by the normalization processing units of other layers to the accumulation unit. Optionally, the states of the selector include a first state and a second state, the first state being different from the second state, the selector being configured to receive the input data via the first data channel when the selector is in the default first state; and under the condition that the selector receives the second valid signal, switching the state of the selector from the first state to the second state, wherein when the selector is in the second state, the selector is used for receiving the local accumulation result shared by other layers of normalization processing units through a second data channel, and the first data channel is different from the second data channel. The embodiments of the present disclosure do not limit the manner in which the selector is disposed.
When the accumulation unit receives the local accumulation result, the accumulation unit carries out accumulation operation again on the received local accumulation result and the local accumulation result. When the last accumulation is finished, namely the accumulation of K local accumulation results is finished to obtain a global accumulation result, the remote completion unit controls the accumulation completion unit to generate a third effective signal, and the third effective signal is used for indicating the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation, and at the moment, the global accumulation result in the accumulation unit is transmitted to the preprocessing calculation unit of the next stage for calculation processing through a data channel. And resetting the waiting remote unit, the local counter and the accumulation unit through the accumulation completion unit so as to facilitate the next calculation processing. To this end, the accumulation and sharing process of the plurality of data included in the input data in the LayerNorm operator is completed. The whole process does not need a system-level scheduling controller to participate, and a plurality of layers of normalization processing units can work in a distributed and cooperative mode.
The second part is responsible for the accumulation of the target parameters in LayerNorm operators or RMS-LayerNorm operators, the calculation of the total number of accumulated N, and the sharing process, and the first and second parts may be two parts that are not connected or connected to each other. The embodiments of the present disclosure are not limited in this regard. Referring to fig. 5, a schematic diagram of an architecture of a preprocessing accumulation unit provided in another exemplary embodiment of the present disclosure is shown, which may be implemented as part of the preprocessing accumulation unit. The internal architecture of the preprocessing accumulation unit may include several parts, namely a selector, an accumulation unit, a local counter, an element counter, an accumulation completion unit, a waiting remote unit and a remote completion unit. The internal units of the preprocessing accumulation unit can be increased or decreased according to actual needs, and the embodiments of the present disclosure are not limited thereto.
Under the condition that the preprocessing accumulation unit carries out accumulation processing of target parameters, input data (comprising a plurality of data) is sent to the square processing unit to carry out square computation of the plurality of data to obtain a plurality of target parameters, the plurality of target parameters are sent to the accumulation unit through the selector, and the accumulation unit is used for accumulating the plurality of target parameters to obtain a local accumulation result. The accumulation processing of the target parameter is also classified into a local accumulation processing and a global accumulation processing. The process of accumulating the target parameters can be analogically referred to the accumulated computation flow of the data. In the layer normalization calculation, it is necessary to know how many data are in total to participate in the calculation, so it is necessary to count the number of times of the accumulation operation, since the number of times of accumulation of the data and the target parameter is the same, and the accumulation calculation of the target parameter is slower, since the number of times of the accumulation operation is counted only in the accumulation calculation of the target parameter in the calculation process. Optionally, the total amount of data involved in the calculation, that is, the total amount of data N included in the original data, may be counted at the same time as the calculation. An element counter is added in the accumulation unit, and the total number of data included in the original data, that is, the global accumulation number is counted by the element counter. When the accumulation unit performs accumulation processing, the local counter performs a counting operation on the element counter every time the accumulation operation is performed during local accumulation processing. After the local accumulation processing is finished, if remote local accumulation results of other layer normalization processing units are required to be accumulated, the remote local accumulation results are input into a selector and are used for counting an element counter, and after the accumulation processing is finished completely, the element counter outputs the total recorded data to a layer normalization parameter router and is shared to other layer normalization processing units for use through the layer normalization parameter router.
It should be noted that, for the operation that does not need to perform the cooperative computation of the multiple layers of normalization processing units, the computation flow in the single-layer normalization processing unit is completely the same, but after the computation is completed, the data does not need to be shared, and meanwhile, the layer normalization processing unit does not need to wait for remote data, and after the local accumulation processing is completed, the computation result is directly output for the next parameter computation.
Each layer normalization processing unit comprises a layer normalization parameter router, and for each layer normalization processing unit, the local accumulation result is sent to other layer normalization processing units for sharing through the layer normalization parameter router; and for each layer of normalization processing units, receiving K-1 local accumulation results sent by other layers of normalization processing units through a layer normalization parameter router. The K layers of normalized parameter router units may form a routing loop. One of the layers is described as a normalized parameter router unit. In one possible implementation, please refer to fig. 6, which illustrates a schematic architecture diagram of a layer normalization parameter router provided by an exemplary embodiment of the present disclosure, which may be implemented as part of the layer normalization parameter router. The internal architecture of the layer normalized parameter router may include the following parts, namely a remote parameter queue element, a remote parameter register, a local parameter register, a selector, a local register and a remote register, respectively. The internal units of the layer normalization parameter router may be increased or decreased according to actual needs, which is not limited by the embodiments of the present disclosure.
When the device works, the local accumulation results (each local accumulation result in K-1 local accumulation results) sent by the normalization processing unit of the upper layer enter the remote parameter queue unit, the local accumulation results sent by the normalization processing units of other layers are received through the remote parameter queue unit, and the local accumulation results are registered in the remote parameter register. The index of the current layer normalization parameter router is marked by the broadcast counting unit, and is used for uniquely identifying the current layer normalization parameter router in the K layer normalization parameter routers, namely the index is used for indicating which layer normalization parameter router is the layer normalization calculation processing unit, the minimum value can be 1, and the maximum value can be K. And the local accumulation result is input to the layer normalization calculation processing unit through a target data channel by a remote parameter register, wherein the target data channel is a data channel between the layer normalization parameter router and the layer normalization calculation processing unit.
Optionally, the local accumulation result of the current layer normalization processing unit is registered through a local parameter register; and sending the local accumulation result to the layer normalization parameter routers of other layer normalization processing units through the data selector for sharing.
The local register is used for indicating the state of the remote parameter register, the remote register is used for indicating the state of the local parameter register, and the initial values of the local register and the remote register are both default second values, for example, the second value is 0. When the local accumulation result of the remote parameter register is input into the layer normalization calculation processing unit, setting the value of the local register to a first numerical value; when the local accumulation result of the local parameter register is sent to other layers of normalized parameter routers through the data selector, setting the value of the remote register to be a first numerical value; when the values of the local register and the remote register are both a first value, the remote parameter register is cleared, and the first value is different from a second value, for example, the first value is 1.
When data is transmitted between different layers of normalized parameter routers, the data link width can be configured according to resource requirements, and optionally, the data link width of a routing path between the pre-configured layers of normalized parameter routers is smaller than a preset width threshold, for example, the data link width is smaller than the data bit width of a local accumulation result to be transmitted, so that resource expenditure is reduced.
If the data link width is smaller than the data bit width, the data can be shifted and output when outputting the data, and one data can be converted into a data packet comprising a plurality of flits (English: flits). In the embodiment of the disclosure, a local accumulation result to be transmitted may be converted into a data packet including a plurality of flits for transmission, that is, the local accumulation result is transmitted between layer normalization parameter routers in the form of a data packet (including a plurality of flits in the data packet), and when a shared local accumulation result is received, the data packet sent by other layer normalization processing units is received through a remote parameter queue unit, and the plurality of flits in the data packet are spliced through the remote parameter queue unit, so as to obtain the shared local accumulation result.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules according to actual needs, that is, the content structure of the device is divided into different functional modules, so as to implement all or part of the functions described above.
In the related art, when a plurality of computing nodes perform reduction computation, the plurality of computing nodes need to carry together raw data to perform reduction operation. The handling of the original data causes higher delay and higher power consumption overhead, which is unfavorable for improving the calculation efficiency of the accelerator. In addition, for reduction calculation, a plurality of calculation nodes are required to communicate with and synchronize with the overall controller, so that calculation delay is increased, and calculation efficiency is reduced. In addition, the same data path is often used for data sharing among a plurality of computing nodes, so that data transmission conflict can be generated, and the transmission efficiency is reduced. The embodiment of the disclosure designs a special preprocessing accumulation unit aiming at the reduction calculation of the layer normalization operator, and the preprocessing accumulation unit can efficiently perform the accumulation operation in the layer normalization operator. On the other hand, the accumulation operation in the preprocessing supports both the reduction computation of a single node (i.e., a single layer normalization processing unit) and the collaborative reduction computation of multiple nodes (i.e., multiple layers normalization processing units). When the layer normalization calculation is carried out, the reduction calculation of the plurality of layer normalization processing units can work cooperatively, and the synchronization operation is automatically completed in the layer normalization calculation processing units without intervention of a system controller. On the other hand, for the collaborative reduction calculation of the layer normalization operators, local data processing can be performed in each layer normalization processing unit, and after the local data processing is completed, only local accumulation results are shared, and intermediate data calculated in the local data processing process are not required to be shared, so that the distributed calculation acceleration is realized. On the other hand, a special data path is designed for data sharing, and the special data path (routing path among layer normalization parameter routers) is used for sharing the local accumulation result, so that the data sharing efficiency is improved, and the conflict of data transmission is reduced. And, the data sharing path can be configured with different link widths, so that the circuit overhead is saved.
The following are method embodiments of the disclosed embodiments, and for parts of the method embodiments that are not described in detail, reference may be made to the technical details disclosed in the apparatus embodiments described above.
Referring to fig. 7, a flowchart of a cooperative processing method of layer normalization computation according to an exemplary embodiment of the present disclosure is shown, where the method is used in the cooperative processing apparatus of layer normalization computation described above for illustration. The method comprises the following steps.
In step 701, each layer normalization processing unit in the K layer normalization processing units acquires input data, where the input data is local data input to the layer normalization processing unit in the original data, and K is a positive integer greater than 1.
In step 702, each layer of normalization processing unit performs preprocessing on the input data to obtain a local accumulation result, where the local accumulation result includes an accumulation result of parameters related to the input data.
In step 703, each layer normalization processing unit shares the local accumulation result to other layer normalization processing units.
In step 704, each layer normalization processing unit performs cooperative processing of layer normalization calculation according to the shared K partial accumulation results.
In one possible implementation, each layer normalization processing unit includes a layer normalization parameter router, and each layer normalization processing unit shares the local accumulation result to other layer normalization processing units, including:
For each layer of normalization processing units, the local accumulation result is sent to other layers of normalization processing units for sharing through a layer normalization parameter router;
the method further comprises the steps of:
And for each layer of normalization processing units, receiving K-1 local accumulation results sent by other layers of normalization processing units through a layer normalization parameter router.
In another possible implementation, the K layers of normalized parameter routers are connected by a ring routing structure.
In another possible implementation manner, each layer normalization processing unit comprises a memory unit, a direct memory access unit and a layer normalization calculation processing unit; the method further comprises the steps of:
For each layer of normalization processing unit, acquiring input data stored in a memory unit through a direct memory access unit;
Scheduling the input data to a layer normalization calculation processing unit through a direct memory access unit;
And for each layer normalization processing unit, acquiring scheduled input data through a layer normalization calculation processing unit.
In another possible implementation manner, each layer of normalized calculation processing unit includes a calculation unit, a preprocessing accumulation unit, and a preprocessing calculation unit;
each layer of normalization processing unit is used for preprocessing input data to obtain a local accumulation result, and the method comprises the following steps:
for each layer of normalization processing unit, preprocessing the input data by a preprocessing accumulation unit to obtain a local accumulation result;
Each layer normalization processing unit performs cooperative processing of layer normalization calculation according to the shared K partial accumulation results, and the cooperative processing comprises the following steps:
For each layer of normalization processing units, accumulating the shared K local accumulation results through a preprocessing accumulation unit to obtain a global accumulation result;
After accumulation is completed, parameter calculation is carried out through a preprocessing calculation unit according to the global accumulation result to obtain intermediate parameters of layer normalization calculation;
and carrying out protocol calculation by a calculation unit according to the intermediate parameters and the input data by adopting a layer normalization operator to obtain a calculation result.
In another possible implementation manner, the preprocessing accumulation unit includes a selector, an accumulation unit and a local counter, the local accumulation result includes a parameter accumulation sum and a local accumulation number, and preprocessing is performed on the input data by the preprocessing accumulation unit to obtain the local accumulation result, including:
transmitting the input data to the accumulating unit through the selector;
The method comprises the steps of accumulating parameters related to input data through an accumulation unit to obtain a parameter accumulation sum, and counting the accumulation times of the parameters related to the input data through a local counter to obtain local accumulation times.
In another possible implementation manner, the preprocessing accumulation unit further includes an accumulation completion unit, and the method further includes:
When the local accumulation result is obtained by accumulation, a first effective signal is generated by an accumulation completion unit, and the first effective signal is used for indicating the accumulation unit to output the local accumulation result to the corresponding layer normalization parameter router.
In another possible implementation, the preprocessing accumulation unit further includes a waiting remote unit, the method further comprising:
The local counter is used for controlling the waiting remote unit to generate a second effective signal, and the second effective signal is used for instructing the selector to send the received local accumulation result shared by the normalization calculation processing units of other layers to the accumulation unit;
Accumulating the shared K local accumulation results through a preprocessing accumulation unit to obtain a global accumulation result, wherein the method comprises the following steps of:
And accumulating the received K-1 local accumulated results and the local accumulated results through an accumulating unit to obtain a global accumulated result.
In another possible implementation, the preprocessing accumulation unit further includes a remote completion unit, and the method further includes:
when the global accumulation result is obtained through accumulation, the remote completion unit controls the accumulation completion unit to generate a third effective signal, and the third effective signal is used for indicating the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;
And resetting the waiting remote unit, the local counter and the accumulation unit through the accumulation completion unit.
In another possible implementation, the preprocessing accumulation unit further includes an element counter, and the method further includes:
the total amount of data included in the original data is counted by the element counter.
In another possible implementation, the layer normalized parameter router includes a remote parameter queue unit and a remote parameter register; receiving K-1 local accumulation results sent by other layer normalization processing units through a layer normalization parameter router, wherein the K-1 local accumulation results comprise:
receiving a local accumulation result sent by other layer normalization processing units through a remote parameter queue unit for each local accumulation result in the K-1 local accumulation results;
the method further comprises the steps of:
registering the local accumulated result in a remote parameter register;
And the local accumulation result is input to the layer normalization calculation processing unit through a target data channel by a remote parameter register, wherein the target data channel is a data channel between the layer normalization parameter router and the layer normalization calculation processing unit.
In another possible implementation manner, the layer normalization parameter router further includes a local parameter register and a data selector, and the local accumulation result is sent to other layer normalization processing units for sharing through the layer normalization parameter router, including:
registering a local accumulation result of the current layer normalization processing unit through a local parameter register;
And sending the local accumulation result to the layer normalization parameter routers of other layer normalization processing units through the data selector for sharing.
In another possible implementation, the layer normalization parameter router further includes a data selector, a local register, and a remote register; the method further comprises the steps of:
When the local accumulation result is input into the layer normalization calculation processing unit, setting the value of the local register to be a first numerical value;
When the local accumulation result is sent to other layers of normalized parameter routers through the data selector, setting the value of a remote register to be a first numerical value;
when the values of the local register and the remote register are the first numerical value, the remote parameter register is cleared.
In another possible implementation manner, the local accumulation results are transmitted between the layer normalization parameter routers in the form of data packets, and the local accumulation results sent by other layer normalization processing units are received through a remote parameter queue unit, including:
receiving a data packet sent by other layers of normalization processing units through a remote parameter queue unit, wherein the data packet comprises a plurality of flits;
And splicing the plurality of flits in the data packet through the remote parameter queue unit to obtain a local accumulation result.
In another possible implementation, the input data of each layer normalization processing element comprises a plurality of data,
The parameters related to the input data comprise a plurality of target parameters, the local accumulation result comprises accumulation sums of the plurality of target parameters and local accumulation times, the target parameters are squares of the data, and the local accumulation times are accumulation total times of the plurality of target parameters; or alternatively
The parameters related to the input data include a plurality of data and a plurality of target parameters, and the local accumulation result includes an accumulation sum of the plurality of data, an accumulation sum of the plurality of target parameters, and a local accumulation number.
In another possible implementation, the layer normalization operator comprises LayerNorm operators or RMS-LayerNorm operators.
It should be noted that, regarding the method in the above embodiment, the specific manner in which the operations are performed by the steps has been described in detail in the embodiment related to the apparatus, and will not be described in detail herein.
FIG. 8 is a block diagram illustrating a synergistic processing arrangement of layer normalization computations, according to an exemplary embodiment. For example, the apparatus 10 may be provided as a server or terminal device. Referring to fig. 8, the apparatus 10 includes a processing component 822 that further includes one or more processors and memory resources, represented by memory 832, for storing instructions, such as application programs, executable by the processing component 822. The application programs stored in memory 832 may include one or more modules each corresponding to a set of instructions. Furthermore, the processing component 822 may include a plurality of acceleration units, each acceleration unit including 1 layer normalization processing unit, the processing component 822 being configured to execute instructions to perform the above-described method.
The apparatus 10 may also include a power component 826 configured to perform power management of the apparatus 10, a wired or wireless network interface 850 configured to connect the apparatus 10 to a network, and an input/output interface 858 (I/O interface). The device 10 may operate based on an operating system stored in the memory 832, such as Windows Server TM,Mac OS XTM,UnixTM, LinuxTM,FreeBSDTM or the like.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (15)
1. The cooperative processing device for layer normalization calculation is characterized by comprising K layer normalization processing units, wherein K is a positive integer greater than 1;
The K layer normalization processing units are used for sharing respective local accumulation results and carrying out cooperative processing of layer normalization calculation, the local accumulation results comprise accumulation results of parameters related to input data, and the input data is local data input to the layer normalization processing units in original data.
2. The apparatus of claim 1, wherein each of the layer normalization processing units further comprises a layer normalization calculation processing unit and a layer normalization parameter router;
the layer normalization calculation processing unit is used for preprocessing the input data to obtain the local accumulation result;
The layer normalization parameter router is used for sending the local accumulation results to other layer normalization processing units for sharing and receiving K-1 local accumulation results sent by the other layer normalization processing units;
the layer normalization calculation processing unit is further used for performing reduction calculation on the input data by adopting a layer normalization operator according to the shared K local accumulation results to obtain a calculation result.
3. The apparatus of claim 2, wherein K of the layer normalization parameter routers are connected by a ring routing fabric.
4. The apparatus of claim 2, wherein each of the layer normalization processing units further comprises a memory unit and a direct memory access unit;
The memory unit is used for storing the input data;
The direct memory access unit is used for acquiring the stored input data and scheduling the input data to the layer normalization calculation processing unit;
the layer normalization computing processing unit is further configured to obtain the scheduled input data.
5. The apparatus of claim 2, wherein each of the layer normalization computation processing units comprises a computation unit, a preprocessing accumulation unit, and a preprocessing computation unit;
the preprocessing accumulation unit is used for preprocessing the input data to obtain the local accumulation result;
The preprocessing accumulation unit is also used for accumulating the shared K local accumulation results to obtain a global accumulation result;
The preprocessing calculation unit is used for carrying out parameter calculation according to the global accumulation result to obtain intermediate parameters of the layer normalization calculation;
And the calculation unit is used for carrying out protocol calculation by adopting the layer normalization operator according to the intermediate parameter and the input data to obtain the calculation result.
6. The apparatus of claim 5, wherein the preprocessing accumulation unit comprises a selector, an accumulation unit and a local counter, the local accumulation result comprises a parameter accumulation sum and a local accumulation number,
The selector is used for sending the input data to the accumulation unit;
the accumulation unit is used for accumulating parameters related to the input data to obtain the parameter accumulation sum;
The local counter is used for counting the accumulation times of the parameters related to the input data to obtain the local accumulation times.
7. The apparatus of claim 6, wherein the preprocessing accumulation unit further comprises an accumulation completion unit,
The accumulation completion unit is used for generating a first effective signal when the local accumulation result is obtained through accumulation, and the first effective signal is used for indicating the accumulation unit to output the local accumulation result to a corresponding layer normalization parameter router.
8. The apparatus of claim 6, wherein the preprocessing accumulation unit further comprises a waiting remote unit,
The local counter is further configured to control the waiting remote unit to generate a second valid signal, where the second valid signal is used to instruct the selector to send the received local accumulation result shared by the other layer normalization computation processing units to the accumulation unit;
the accumulation unit is also used for accumulating the received K-1 local accumulation results and the local accumulation results to obtain the global accumulation result.
9. The apparatus of claim 7, wherein the preprocessing accumulation unit further comprises a remote completion unit,
The remote completion unit is used for controlling the accumulation completion unit to generate a third effective signal when the global accumulation result is obtained by accumulation, and the third effective signal is used for indicating the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;
The accumulation completion unit is also used for resetting and resetting the waiting remote unit, the local counter and the accumulation unit.
10. The apparatus of claim 2, wherein the layer normalized parameter router comprises a remote parameter queue element and a remote parameter register;
the remote parameter queue unit is used for receiving the local accumulation results sent by other layer normalization processing units;
The remote parameter register is used for registering the local accumulation results sent by other layer normalization processing units, and inputting the local accumulation results to the layer normalization calculation processing unit through a target data channel, wherein the target data channel is a data channel between the layer normalization parameter router and the layer normalization calculation processing unit.
11. The apparatus of claim 2, wherein the layer normalization parameter router further comprises a local parameter register and a data selector;
the local parameter register is used for registering the local accumulation result of the current layer normalization processing unit;
The data selector is used for sending the local accumulation result to the layer normalization parameter routers of other layer normalization processing units for sharing.
12. The apparatus of claim 10, wherein the layer normalization parameter router further comprises a data selector, a local register, and a remote register;
the local register is used for setting the value of the local register to be a first numerical value when the local accumulation result is input into the layer normalization calculation processing unit;
The remote register is used for setting the value of the remote register to be a first numerical value when the local accumulation result is sent to other layer normalization parameter routers through the data selector;
The remote parameter register is further configured to clear the remote parameter register when the values of the local register and the remote register are both the first value.
13. The apparatus of claim 10, wherein the local accumulation results are communicated in data packets between the layer normalized parameter routers, the remote parameter queue unit further to:
receiving the data packet sent by other layer normalization processing units, wherein the data packet comprises a plurality of flits;
and splicing the plurality of flits in the data packet to obtain the local accumulation result.
14. The apparatus according to any one of claims 1 to 13, wherein the input data of each of the layer normalization processing units comprises a plurality of data,
The parameters related to the input data comprise a plurality of target parameters, the local accumulation result comprises accumulation sums of a plurality of target parameters and local accumulation times, the target parameters are squares of the data, and the local accumulation times are accumulation total times of the plurality of target parameters; or alternatively
The parameters related to the input data include a plurality of the data and a plurality of the target parameters, and the local accumulation result includes an accumulation sum of the plurality of the data, an accumulation sum of the plurality of the target parameters, and the local accumulation number.
15. A co-processing method of layer normalization computation, for use in the apparatus of any one of claims 1 to 14, the method comprising:
Each layer normalization processing unit preprocesses the input data to obtain the local accumulation result;
Each layer normalization processing unit shares the local accumulation result to other layer normalization processing units;
And each layer normalization processing unit performs cooperative processing of layer normalization calculation according to the shared K local accumulation results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410437757.0A CN118095351B (en) | 2024-04-12 | 2024-04-12 | Cooperative processing device and method for layer normalization calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410437757.0A CN118095351B (en) | 2024-04-12 | 2024-04-12 | Cooperative processing device and method for layer normalization calculation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118095351A true CN118095351A (en) | 2024-05-28 |
CN118095351B CN118095351B (en) | 2024-07-02 |
Family
ID=91147726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410437757.0A Active CN118095351B (en) | 2024-04-12 | 2024-04-12 | Cooperative processing device and method for layer normalization calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118095351B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190012170A1 (en) * | 2017-07-05 | 2019-01-10 | Deep Vision, Inc. | Deep vision processor |
CN109726806A (en) * | 2017-10-30 | 2019-05-07 | 上海寒武纪信息科技有限公司 | Information processing method and terminal device |
CN111144556A (en) * | 2019-12-31 | 2020-05-12 | 中国人民解放军国防科技大学 | Hardware circuit of range batch processing normalization algorithm for deep neural network training and reasoning |
CN112789627A (en) * | 2018-09-30 | 2021-05-11 | 华为技术有限公司 | Neural network processor, data processing method and related equipment |
US11237880B1 (en) * | 2020-12-18 | 2022-02-01 | SambaNova Systems, Inc. | Dataflow all-reduce for reconfigurable processor systems |
US20230394823A1 (en) * | 2022-06-03 | 2023-12-07 | Nvidia Corporation | Techniques to perform trajectory predictions |
CN117648454A (en) * | 2023-11-01 | 2024-03-05 | 西安工程大学 | Text-guided clothing image retrieval method based on feature enhancement and multi-granularity matching |
CN117651953A (en) * | 2021-07-21 | 2024-03-05 | 高通股份有限公司 | Hybrid machine learning architecture with neural processing unit and in-memory computing processing |
-
2024
- 2024-04-12 CN CN202410437757.0A patent/CN118095351B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190012170A1 (en) * | 2017-07-05 | 2019-01-10 | Deep Vision, Inc. | Deep vision processor |
CN109726806A (en) * | 2017-10-30 | 2019-05-07 | 上海寒武纪信息科技有限公司 | Information processing method and terminal device |
CN112789627A (en) * | 2018-09-30 | 2021-05-11 | 华为技术有限公司 | Neural network processor, data processing method and related equipment |
CN111144556A (en) * | 2019-12-31 | 2020-05-12 | 中国人民解放军国防科技大学 | Hardware circuit of range batch processing normalization algorithm for deep neural network training and reasoning |
US11237880B1 (en) * | 2020-12-18 | 2022-02-01 | SambaNova Systems, Inc. | Dataflow all-reduce for reconfigurable processor systems |
CN117651953A (en) * | 2021-07-21 | 2024-03-05 | 高通股份有限公司 | Hybrid machine learning architecture with neural processing unit and in-memory computing processing |
US20230394823A1 (en) * | 2022-06-03 | 2023-12-07 | Nvidia Corporation | Techniques to perform trajectory predictions |
CN117648454A (en) * | 2023-11-01 | 2024-03-05 | 西安工程大学 | Text-guided clothing image retrieval method based on feature enhancement and multi-granularity matching |
Non-Patent Citations (2)
Title |
---|
ZELIN WU 等: ""Transmission Line Fault Location Based on the Stacked Sparse Auto-Encoder Deep Neural Network"", 2021 IEEE 5TH CONFERENCE ON ENERGY INTERNET AND ENERGY SYSTEM INTEGRATION (EI2), 25 February 2022 (2022-02-25) * |
ZHIHUI ZHANG 等: ""Neural Noise Embedding for End-To-End Speech Enhancement with Conditional Layer Normalization"", ICASSP 2021 - 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 13 May 2021 (2021-05-13) * |
Also Published As
Publication number | Publication date |
---|---|
CN118095351B (en) | 2024-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110389826B (en) | Method, apparatus and computer program product for processing a computing task | |
EP2189903B1 (en) | Barrier synchronization apparatus, barrier synchronization system, and barrier synchronization method | |
US10831536B2 (en) | Task scheduling using improved weighted round robin techniques | |
US20140310712A1 (en) | Sequential cooperation between map and reduce phases to improve data locality | |
US11055157B2 (en) | Method and apparatus for graph-based computing | |
US11556450B2 (en) | Hybrid data-model parallelism for efficient deep learning | |
US11086668B2 (en) | Method, electronic device and computer program product for processing task | |
JP2020505666A (en) | Neural network board with high area efficiency, resettable, high energy efficiency, high speed efficiency | |
CN111126613A (en) | Method, apparatus and computer program product for deep learning | |
CN112084027A (en) | Network-on-chip data transmission method, device, network-on-chip, equipment and medium | |
CN111859775A (en) | Software and hardware co-design for accelerating deep learning inference | |
CN111352711A (en) | Multi-computing engine scheduling method, device, equipment and storage medium | |
CN112418389A (en) | Data processing method and device, electronic equipment and computer readable storage medium | |
CN118095351B (en) | Cooperative processing device and method for layer normalization calculation | |
US20230156520A1 (en) | Coordinated load balancing in mobile edge computing network | |
CN115994040A (en) | Computing system, method for data broadcasting and data reduction, and storage medium | |
CN112925739B (en) | Communication method applied to many-core chip, many-core chip and storage medium | |
CN111756833B (en) | Node processing method, node processing device, electronic equipment and computer readable medium | |
CN114666263A (en) | High-dynamic intelligent route determining method and device, electronic equipment and storage medium | |
CN114546633A (en) | Distributed task processing method and system | |
CN114095289B (en) | Data multicast circuit, method, electronic device, and computer-readable storage medium | |
CN117251035B (en) | Heat dissipation control method, heat dissipation control device, electronic equipment and computer readable medium | |
CN112437021B (en) | Routing control method, device, routing equipment and storage medium | |
CN115208769B (en) | Ring communication method suitable for Dragon topology | |
CN117170986B (en) | Chip consistency processing system, method, device, equipment and medium thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |