CN118095351A

CN118095351A - Collaborative processing device and method for layer normalization calculation

Info

Publication number: CN118095351A
Application number: CN202410437757.0A
Authority: CN
Inventors: 汪玉; 马文恒; 杨华中
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-05-28
Anticipated expiration: 2044-04-12
Also published as: CN118095351B

Abstract

The disclosure relates to the technical field of neural networks, and in particular relates to a cooperative processing device and method for layer normalization calculation. The device comprises K layers of normalization processing units, wherein K is a positive integer greater than 1; the K layer normalization processing units are used for sharing respective local accumulation results and carrying out cooperative processing of layer normalization calculation, the local accumulation results comprise accumulation results of parameters related to input data, and the input data is local data input to the layer normalization processing units in original data. According to the embodiment of the disclosure, cooperative processing of layer normalization calculation can be realized by sharing respective local accumulation results through multiple nodes (namely K layer normalization processing units), so that distributed calculation acceleration is realized, and the layer normalization calculation efficiency of the neural network is improved.

Description

Collaborative processing device and method for layer normalization calculation

技术领域Technical Field

本公开涉及神经网络技术领域，尤其涉及一种层归一化计算的协同处理装置及方法。The present disclosure relates to the field of neural network technology, and in particular to a collaborative processing device and method for layer normalization calculation.

背景技术Background technique

层归一化算子是神经网络算法中常用的非线性算子。层归一化算子通过对数据进行归约（英文：Reduction）计算，使得神经网络在训练过程中更加稳定，并有助于提高神经网络的性能。在归约计算过程中，需要对计算的所有中间结果进行统计，得到计算参数，从而完成归约计算。因此，归约计算是影响算法计算性能的关键之一。The layer normalization operator is a commonly used nonlinear operator in neural network algorithms. The layer normalization operator reduces the data, making the neural network more stable during training and helping to improve the performance of the neural network. During the reduction calculation process, it is necessary to count all the intermediate results of the calculation to obtain the calculation parameters, so as to complete the reduction calculation. Therefore, the reduction calculation is one of the keys affecting the algorithm's calculation performance.

相关技术中，对于基于变换网络（英文：Transformer）的大语言模型来说，由于其庞大的规模和复杂的计算需求，归约计算成为了计算性能的一个瓶颈。在大模型算法中，存在大量的矩阵向量乘法操作，这些计算的结果需要进行归约计算，但由于向量较长，在一个计算单元内完成会占用大量的时间，大大降低了计算系统的计算性能。目前尚未提供一种合理且有效的处理方式。In the related art, for large language models based on transformer networks, reduction calculations have become a bottleneck in computing performance due to their large scale and complex computing requirements. In large model algorithms, there are a large number of matrix-vector multiplication operations, and the results of these calculations need to be reduced. However, due to the long vectors, it takes a lot of time to complete them within one computing unit, which greatly reduces the computing performance of the computing system. Currently, no reasonable and effective processing method has been provided.

发明内容Summary of the invention

有鉴于此，本公开提出了一种层归一化计算的协同处理装置及方法。In view of this, the present disclosure proposes a collaborative processing device and method for layer normalization calculation.

根据本公开的一方面，提供了一种层归一化计算的协同处理装置，所述装置包括K个层归一化处理单元，所述K为大于1的正整数；According to one aspect of the present disclosure, a collaborative processing device for layer normalization calculation is provided, the device comprising K layer normalization processing units, where K is a positive integer greater than 1;

所述K个层归一化处理单元用于共享各自的局部累加结果，进行层归一化计算的协同处理，所述局部累加结果包括与输入数据相关的参数的累加结果，所述输入数据为原始数据中输入至所述层归一化处理单元的局部数据。The K layer normalization processing units are used to share their respective local accumulation results and perform collaborative processing of layer normalization calculations. The local accumulation results include accumulation results of parameters related to input data, and the input data is local data in the original data input to the layer normalization processing unit.

在一种可能的实现方式中，每个所述层归一化处理单元还包括层归一化计算处理单元和层归一化参数路由器；In a possible implementation, each of the layer normalization processing units further includes a layer normalization calculation processing unit and a layer normalization parameter router;

所述层归一化计算处理单元用于对所述输入数据进行预处理得到所述局部累加结果；The layer normalization calculation processing unit is used to preprocess the input data to obtain the local accumulation result;

所述层归一化参数路由器用于将所述局部累加结果发送至其他所述层归一化处理单元进行共享，并接收其他所述层归一化处理单元发送的K-1个所述局部累加结果；The layer normalization parameter router is used to send the local accumulation result to other layer normalization processing units for sharing, and receive K-1 local accumulation results sent by other layer normalization processing units;

所述层归一化计算处理单元还用于根据共享的K个所述局部累加结果，采用层归一化算子对所述输入数据进行规约计算得到计算结果。The layer normalization calculation processing unit is also used to use a layer normalization operator to perform a reduction calculation on the input data to obtain a calculation result based on the shared K local accumulation results.

在另一种可能的实现方式中，K个所述层归一化参数路由器之间采用环形路由结构连接。In another possible implementation, the K layer normalization parameter routers are connected using a ring routing structure.

在另一种可能的实现方式中，每个所述层归一化处理单元还包括存储器单元和直接存储器访问单元；In another possible implementation, each of the layer normalization processing units further includes a memory unit and a direct memory access unit;

所述存储器单元用于存储所述输入数据；The memory unit is used to store the input data;

所述直接存储器访问单元用于获取存储的所述输入数据，将所述输入数据调度至所述层归一化计算处理单元；The direct memory access unit is used to obtain the stored input data and dispatch the input data to the layer normalization calculation processing unit;

所述层归一化计算处理单元还用于获取调度的所述输入数据。The layer normalization calculation processing unit is also used to obtain the scheduled input data.

在另一种可能的实现方式中，每个所述层归一化计算处理单元包括计算单元、预处理累加单元和预处理计算单元；In another possible implementation, each of the layer normalization calculation processing units includes a calculation unit, a preprocessing accumulation unit and a preprocessing calculation unit;

所述预处理累加单元用于对所述输入数据进行预处理得到所述局部累加结果；The preprocessing accumulation unit is used to preprocess the input data to obtain the local accumulation result;

所述预处理累加单元还用于将共享的K个所述局部累加结果进行累加得到全局累加结果；The preprocessing accumulation unit is further used to accumulate the shared K local accumulation results to obtain a global accumulation result;

所述预处理计算单元用于根据所述全局累加结果进行参数计算得到所述层归一化计算的中间参数；The preprocessing calculation unit is used to perform parameter calculation according to the global accumulation result to obtain the intermediate parameters of the layer normalization calculation;

所述计算单元用于根据所述中间参数和所述输入数据，采用所述层归一化算子进行规约计算得到所述计算结果。The calculation unit is used to perform a reduction calculation using the layer normalization operator according to the intermediate parameters and the input data to obtain the calculation result.

在另一种可能的实现方式中，所述预处理累加单元包括选择器、累加单元和本地计数器，所述局部累加结果包括参数累加和以及局部累加次数，In another possible implementation, the preprocessing accumulation unit includes a selector, an accumulation unit and a local counter, and the local accumulation result includes a parameter accumulation sum and a local accumulation number.

所述选择器用于将所述输入数据发送至所述累加单元；The selector is used to send the input data to the accumulating unit;

所述累加单元用于对与所述输入数据相关的参数进行累加得到所述参数累加和；The accumulating unit is used to accumulate the parameters related to the input data to obtain the parameter cumulative sum;

所述本地计数器用于对与所述输入数据相关的参数的累加次数进行统计得到所述局部累加次数。The local counter is used to count the accumulation times of the parameters related to the input data to obtain the local accumulation times.

在另一种可能的实现方式中，所述预处理累加单元还包括累加完成单元，In another possible implementation, the preprocessing accumulation unit further includes an accumulation completion unit.

所述累加完成单元用于当累加得到所述局部累加结果时产生第一有效信号，所述第一有效信号用于指示所述累加单元将所述局部累加结果输出至对应的层归一化参数路由器。The accumulation completion unit is used to generate a first valid signal when the local accumulation result is accumulated, and the first valid signal is used to instruct the accumulation unit to output the local accumulation result to the corresponding layer normalization parameter router.

在另一种可能的实现方式中，所述预处理累加单元还包括等待远程单元，In another possible implementation, the pre-processing accumulation unit further includes waiting for a remote unit,

所述本地计数器还用于控制所述等待远程单元产生第二有效信号，所述第二有效信号用于指示所述选择器将接收到的其他所述层归一化计算处理单元共享的所述局部累加结果发送至所述累加单元；The local counter is also used to control the waiting remote unit to generate a second valid signal, and the second valid signal is used to instruct the selector to send the received local accumulation results shared by other layer normalization calculation processing units to the accumulation unit;

所述累加单元还用于将接收到的K-1个所述局部累加结果与本地的所述局部累加结果进行累加得到所述全局累加结果。The accumulation unit is further used to accumulate the received K-1 local accumulation results with the local local accumulation result to obtain the global accumulation result.

在另一种可能的实现方式中，所述预处理累加单元还包括远程完成单元，In another possible implementation, the pre-processing accumulation unit further includes a remote completion unit.

所述远程完成单元用于当累加得到所述全局累加结果时，控制所述累加完成单元产生第三有效信号，所述第三有效信号用于指示所述累加单元将所述全局累加结果传输至所述预处理计算单元进行计算；The remote completion unit is used to control the accumulation completion unit to generate a third valid signal when the global accumulation result is accumulated, and the third valid signal is used to instruct the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;

所述累加完成单元还用于对等待远程单元、所述本地计数器和所述累加单元进行复位清零。The accumulation completion unit is also used to reset the waiting remote unit, the local counter and the accumulation unit.

在另一种可能的实现方式中，所述层归一化参数路由器包括远程参数队列单元和远程参数寄存器；In another possible implementation, the layer normalization parameter router includes a remote parameter queue unit and a remote parameter register;

所述远程参数队列单元用于接收其他所述层归一化处理单元发送的所述局部累加结果；The remote parameter queue unit is used to receive the local accumulation results sent by other layer normalization processing units;

所述远程参数寄存器用于寄存其他所述层归一化处理单元发送的所述局部累加结果，将所述局部累加结果通过目标数据通道输入到层归一化计算处理单元，所述目标数据通道为所述层归一化参数路由器与所述层归一化计算处理单元之间的数据通道。The remote parameter register is used to store the local accumulation results sent by other layer normalization processing units, and input the local accumulation results to the layer normalization calculation processing unit through the target data channel. The target data channel is the data channel between the layer normalization parameter router and the layer normalization calculation processing unit.

在另一种可能的实现方式中，所述层归一化参数路由器还包括本地参数寄存器和数据选择器；In another possible implementation, the layer normalization parameter router further includes a local parameter register and a data selector;

所述本地参数寄存器用于寄存当前所述层归一化处理单元的所述局部累加结果；The local parameter register is used to store the local accumulation result of the current layer normalization processing unit;

所述数据选择器用于将所述局部累加结果发送至其他所述层归一化处理单元的所述层归一化参数路由器进行共享。The data selector is used to send the local accumulation result to the layer normalization parameter routers of other layer normalization processing units for sharing.

在另一种可能的实现方式中，所述层归一化参数路由器还包括数据选择器、本地寄存器和远程寄存器；In another possible implementation, the layer normalization parameter router further includes a data selector, a local register, and a remote register;

所述本地寄存器用于当所述局部累加结果输入到所述层归一化计算处理单元中时，将所述本地寄存器的值设置为第一数值；The local register is used to set the value of the local register to a first value when the local accumulation result is input into the layer normalization calculation processing unit;

所述远程寄存器用于当所述局部累加结果通过所述数据选择器发送至其他所述层归一化参数路由器时，将所述远程寄存器的值设置为第一数值；The remote register is used to set the value of the remote register to a first value when the local accumulation result is sent to other layer normalization parameter routers through the data selector;

所述远程参数寄存器还用于当所述本地寄存器和所述远程寄存器的值均为所述第一数值时，将所述远程参数寄存器清零。The remote parameter register is further configured to clear the remote parameter register to zero when the values of the local register and the remote register are both the first value.

在另一种可能的实现方式中，所述局部累加结果是以数据包的形式在所述层归一化参数路由器之间进行传输的，所述远程参数队列单元还用于：In another possible implementation, the local accumulation result is transmitted between the layer normalization parameter routers in the form of a data packet, and the remote parameter queue unit is further used to:

接收其他所述层归一化处理单元发送的所述数据包，所述数据包中包括多个微片；receiving the data packet sent by the other layer normalization processing units, wherein the data packet includes a plurality of microchips;

将所述数据包中的所述多个微片进行拼接，得到所述局部累加结果。The multiple micro-slices in the data packet are spliced to obtain the local accumulation result.

在另一种可能的实现方式中，每个所述层归一化处理单元的所述输入数据包括多个数据，所述与输入数据相关的参数包括多个目标参数，所述局部累加结果包括多个所述目标参数的累加和以及局部累加次数，所述目标参数为所述数据的平方，所述局部累加次数为多个所述目标参数的累加总次数；或者，In another possible implementation, the input data of each layer normalization processing unit includes multiple data, the parameters related to the input data include multiple target parameters, the local accumulation result includes the cumulative sum of the multiple target parameters and the number of local accumulations, the target parameter is the square of the data, and the number of local accumulations is the total number of accumulations of the multiple target parameters; or,

所述与输入数据相关的参数包括多个所述数据和多个所述目标参数，所述局部累加结果包括多个所述数据的累加和、多个所述目标参数的累加和、以及所述局部累加次数。The parameters related to the input data include a plurality of the data and a plurality of the target parameters, and the local accumulation result includes a cumulative sum of the plurality of the data, a cumulative sum of the plurality of the target parameters, and the number of local accumulations.

根据本公开的另一方面，提供了一种层归一化计算的协同处理方法，用于上述第一方面或第一方面的任意一种可能的实现方式所提供的装置中，所述方法包括：According to another aspect of the present disclosure, a collaborative processing method for layer normalization calculation is provided, which is used in the device provided by the first aspect or any possible implementation of the first aspect, and the method includes:

每个所述层归一化处理单元对所述输入数据进行预处理得到所述局部累加结果；Each of the layer normalization processing units preprocesses the input data to obtain the local accumulation result;

每个所述层归一化处理单元将所述局部累加结果共享至其他所述层归一化处理单元；Each of the layer normalization processing units shares the local accumulation result with other layer normalization processing units;

每个所述层归一化处理单元根据共享的K个所述局部累加结果，进行层归一化计算的协同处理。Each of the layer normalization processing units performs collaborative processing of layer normalization calculations based on the shared K local accumulation results.

在一种可能的实现方式中，每个所述层归一化处理单元包括层归一化参数路由器，所述每个所述层归一化处理单元将所述局部累加结果共享至其他所述层归一化处理单元，包括：In a possible implementation, each of the layer normalization processing units includes a layer normalization parameter router, and each of the layer normalization processing units shares the local accumulation result with other layer normalization processing units, including:

对于每个所述层归一化处理单元，通过所述层归一化参数路由器将所述局部累加结果发送至其他所述层归一化处理单元进行共享；For each of the layer normalization processing units, the local accumulation result is sent to other layer normalization processing units for sharing through the layer normalization parameter router;

所述方法还包括：The method further comprises:

对于每个所述层归一化处理单元，通过所述层归一化参数路由器接收其他所述层归一化处理单元发送的K-1个所述局部累加结果。For each of the layer normalization processing units, K-1 local accumulation results sent by other layer normalization processing units are received through the layer normalization parameter router.

在另一种可能的实现方式中，所述K个所述层归一化参数路由器之间采用环形路由结构连接。In another possible implementation, the K layer normalization parameter routers are connected using a ring routing structure.

在另一种可能的实现方式中，每个所述层归一化处理单元包括存储器单元、直接存储器访问单元、层归一化计算处理单元；所述方法还包括：In another possible implementation, each of the layer normalization processing units includes a memory unit, a direct memory access unit, and a layer normalization calculation processing unit; and the method further includes:

对于每个所述层归一化处理单元，通过所述直接存储器访问单元获取所述存储器单元中存储的所述输入数据；For each of the layer normalization processing units, obtaining the input data stored in the memory unit through the direct memory access unit;

通过所述直接存储器访问单元将所述输入数据调度至所述层归一化计算处理单元；dispatching the input data to the layer normalization calculation processing unit through the direct memory access unit;

对于每个所述层归一化处理单元，通过所述层归一化计算处理单元获取调度的所述输入数据。For each of the layer normalization processing units, the scheduled input data is obtained through the layer normalization calculation processing unit.

所述每个所述层归一化处理单元对所述输入数据进行预处理得到局部累加结果，包括：Each of the layer normalization processing units preprocesses the input data to obtain a local accumulation result, including:

对于每个所述层归一化处理单元，通过所述预处理累加单元对所述输入数据进行预处理得到所述局部累加结果；For each of the layer normalization processing units, preprocessing the input data by the preprocessing accumulation unit to obtain the local accumulation result;

所述每个所述层归一化处理单元根据共享的K个所述局部累加结果，进行层归一化计算的协同处理，包括：Each of the layer normalization processing units performs collaborative processing of layer normalization calculation according to the shared K local accumulation results, including:

对于每个所述层归一化处理单元，通过所述预处理累加单元将共享的K个所述局部累加结果进行累加得到全局累加结果；For each of the layer normalization processing units, the shared K local accumulation results are accumulated through the preprocessing accumulation unit to obtain a global accumulation result;

在累加完成后，通过所述预处理计算单元根据所述全局累加结果进行参数计算得到所述层归一化计算的中间参数；After the accumulation is completed, the preprocessing calculation unit performs parameter calculation according to the global accumulation result to obtain the intermediate parameters of the layer normalization calculation;

通过所述计算单元根据所述中间参数和所述输入数据，采用所述层归一化算子进行规约计算得到所述计算结果。The calculation result is obtained by performing a reduction calculation using the layer normalization operator according to the intermediate parameters and the input data by the calculation unit.

在另一种可能的实现方式中，所述预处理累加单元包括选择器、累加单元和本地计数器，所述局部累加结果包括参数累加和以及局部累加次数，所述通过所述预处理累加单元对所述输入数据进行预处理得到所述局部累加结果，包括：In another possible implementation, the preprocessing accumulation unit includes a selector, an accumulation unit, and a local counter, the local accumulation result includes a parameter accumulation sum and a local accumulation number, and the preprocessing of the input data by the preprocessing accumulation unit to obtain the local accumulation result includes:

通过所述选择器将所述输入数据发送至所述累加单元；Sending the input data to the accumulation unit through the selector;

通过所述累加单元对与所述输入数据相关的参数进行累加得到所述参数累加和，并通过所述本地计数器对与所述输入数据相关的参数的累加次数进行统计得到所述局部累加次数。The accumulation unit accumulates the parameters related to the input data to obtain the parameter accumulation sum, and the local counter counts the accumulation times of the parameters related to the input data to obtain the local accumulation times.

在另一种可能的实现方式中，所述预处理累加单元还包括累加完成单元，所述方法还包括：In another possible implementation, the preprocessing accumulation unit further includes an accumulation completion unit, and the method further includes:

当累加得到所述局部累加结果时，通过所述累加完成单元产生第一有效信号，所述第一有效信号用于指示所述累加单元将所述局部累加结果输出至对应的层归一化参数路由器。When the local accumulation result is obtained by accumulation, a first valid signal is generated by the accumulation completion unit, and the first valid signal is used to instruct the accumulation unit to output the local accumulation result to the corresponding layer normalization parameter router.

在另一种可能的实现方式中，所述预处理累加单元还包括等待远程单元，所述方法还包括：In another possible implementation, the preprocessing accumulation unit further includes waiting for a remote unit, and the method further includes:

通过所述本地计数器控制所述等待远程单元产生第二有效信号，所述第二有效信号用于指示所述选择器将接收到的其他所述层归一化计算处理单元共享的所述局部累加结果发送至所述累加单元；Controlling the waiting remote unit to generate a second valid signal through the local counter, wherein the second valid signal is used to instruct the selector to send the received local accumulation result shared by other layer normalization calculation processing units to the accumulation unit;

所述通过所述预处理累加单元将共享的K个所述局部累加结果进行累加得到全局累加结果，包括：The step of accumulating the shared K local accumulation results by the preprocessing accumulation unit to obtain a global accumulation result includes:

通过所述累加单元将接收到的K-1个所述局部累加结果与本地的所述局部累加结果进行累加得到所述全局累加结果。The accumulation unit accumulates the received K-1 local accumulation results and the local local accumulation result to obtain the global accumulation result.

在另一种可能的实现方式中，所述预处理累加单元还包括远程完成单元，所述方法还包括：In another possible implementation, the preprocessing accumulation unit further includes a remote completion unit, and the method further includes:

当累加得到所述全局累加结果时，通过所述远程完成单元控制所述累加完成单元产生第三有效信号，所述第三有效信号用于指示所述累加单元将所述全局累加结果传输至所述预处理计算单元进行计算；When the global accumulation result is accumulated, the accumulation completion unit is controlled by the remote completion unit to generate a third valid signal, wherein the third valid signal is used to instruct the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;

通过所述累加完成单元对所述等待远程单元、所述本地计数器和所述累加单元进行复位清零。The waiting remote unit, the local counter and the accumulation unit are reset and cleared by the accumulation completion unit.

在另一种可能的实现方式中，所述层归一化参数路由器包括远程参数队列单元和远程参数寄存器；所述通过所述层归一化参数路由器接收其他所述层归一化处理单元发送的K-1个所述局部累加结果，包括：In another possible implementation, the layer normalization parameter router includes a remote parameter queue unit and a remote parameter register; the receiving, through the layer normalization parameter router, K-1 local accumulation results sent by other layer normalization processing units includes:

对K-1个所述局部累加结果中的每个所述局部累加结果，通过所述远程参数队列单元接收其他所述层归一化处理单元发送的所述局部累加结果；For each of the K-1 local accumulation results, receiving the local accumulation result sent by other layer normalization processing units through the remote parameter queue unit;

所述方法还包括：The method further comprises:

将所述局部累加结果寄存在所述远程参数寄存器中；storing the local accumulation result in the remote parameter register;

通过所述远程参数寄存器将所述局部累加结果通过目标数据通道输入到层归一化计算处理单元，所述目标数据通道为所述层归一化参数路由器与所述层归一化计算处理单元之间的数据通道。The local accumulation result is input into the layer normalization calculation processing unit through the target data channel via the remote parameter register, and the target data channel is the data channel between the layer normalization parameter router and the layer normalization calculation processing unit.

在另一种可能的实现方式中，所述层归一化参数路由器还包括本地参数寄存器和数据选择器，所述通过所述层归一化参数路由器将所述局部累加结果发送至其他所述层归一化处理单元进行共享，包括：In another possible implementation, the layer normalization parameter router further includes a local parameter register and a data selector, and the sending of the local accumulation result to other layer normalization processing units for sharing through the layer normalization parameter router includes:

通过所述本地参数寄存器寄存当前所述层归一化处理单元的所述局部累加结果；Storing the local accumulation result of the current layer normalization processing unit through the local parameter register;

通过所述数据选择器将所述局部累加结果发送至其他所述层归一化处理单元的所述层归一化参数路由器进行共享。The local accumulation result is sent to the layer normalization parameter routers of other layer normalization processing units through the data selector for sharing.

在另一种可能的实现方式中，所述层归一化参数路由器还包括数据选择器、本地寄存器和远程寄存器；所述方法还包括：In another possible implementation, the layer normalization parameter router further includes a data selector, a local register, and a remote register; and the method further includes:

当所述局部累加结果输入到所述层归一化计算处理单元中时，将所述本地寄存器的值设置为第一数值；When the local accumulation result is input into the layer normalization calculation processing unit, the value of the local register is set to a first value;

当所述局部累加结果通过所述数据选择器发送至其他所述层归一化参数路由器时，将所述远程寄存器的值设置为第一数值；When the local accumulation result is sent to other layer normalization parameter routers through the data selector, the value of the remote register is set to a first value;

当所述本地寄存器和所述远程寄存器的值均为所述第一数值时，将所述远程参数寄存器清零。When the values of the local register and the remote register are both the first value, the remote parameter register is cleared.

在另一种可能的实现方式中，所述局部累加结果是以数据包的形式在所述层归一化参数路由器之间进行传输的，所述通过所述远程参数队列单元接收其他所述层归一化处理单元发送的所述局部累加结果，包括：In another possible implementation, the local accumulation result is transmitted between the layer normalization parameter routers in the form of a data packet, and the receiving, through the remote parameter queue unit, the local accumulation result sent by the other layer normalization processing units includes:

通过所述远程参数队列单元接收其他所述层归一化处理单元发送的所述数据包，所述数据包中包括多个微片；Receiving the data packet sent by the other layer normalization processing units through the remote parameter queue unit, wherein the data packet includes a plurality of micro slices;

通过所述远程参数队列单元将所述数据包中的所述多个微片进行拼接，得到所述局部累加结果。The multiple micro-slices in the data packet are spliced together through the remote parameter queue unit to obtain the local accumulation result.

在另一种可能的实现方式中，每个所述层归一化处理单元的所述输入数据包括多个数据，In another possible implementation, the input data of each layer normalization processing unit includes multiple data.

所述与输入数据相关的参数包括多个目标参数，所述局部累加结果包括多个所述目标参数的累加和以及局部累加次数，所述目标参数为所述数据的平方，所述局部累加次数为多个所述目标参数的累加总次数；或者，The parameters related to the input data include multiple target parameters, the local accumulation result includes the cumulative sum of the multiple target parameters and the number of local accumulations, the target parameter is the square of the data, and the number of local accumulations is the total number of accumulations of the multiple target parameters; or,

本公开实施例设计了一种多节点（即K个层归一化处理单元）协同处理的层归一化计算架构，针对层归一化计算需求设计了预处理机制，利用预处理机制使得在每个节点（即每个层归一化处理单元）内进行局部数据处理，当局部数据处理完成得到局部累加结果后，多节点共享各自的局部累加结果即可实现层归一化计算的协同处理，且无需共享中间数据，实现了分布式计算加速，提升了神经网络中层归一化计算效率。The disclosed embodiment designs a layer normalization computing architecture for collaborative processing of multiple nodes (i.e., K layer normalization processing units), and designs a preprocessing mechanism according to the layer normalization computing requirements. The preprocessing mechanism is utilized to perform local data processing in each node (i.e., each layer normalization processing unit). When the local data processing is completed and the local accumulation result is obtained, multiple nodes share their respective local accumulation results to realize collaborative processing of layer normalization computing without sharing intermediate data, thereby realizing distributed computing acceleration and improving the efficiency of layer normalization computing in neural networks.

根据下面参考附图对示例性实施例的详细说明，本公开的其它特征及方面将变得清楚。Further features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the attached drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面，并且用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

图1示出了本公开一个示例性实施例提供的层归一化计算的协同处理装置的架构示意图。FIG1 shows a schematic diagram of the architecture of a collaborative processing device for layer normalization calculation provided by an exemplary embodiment of the present disclosure.

图2示出了本公开另一个示例性实施例提供的层归一化计算的协同处理装置的架构示意图。FIG2 shows a schematic diagram of the architecture of a collaborative processing device for layer normalization calculation provided by another exemplary embodiment of the present disclosure.

图3示出了本公开一个示例性实施例提供的层归一化处理单元的架构示意图。FIG3 shows a schematic diagram of the architecture of a layer normalization processing unit provided by an exemplary embodiment of the present disclosure.

图4示出了本公开一个示例性实施例提供的预处理累加单元的架构示意图。FIG. 4 shows a schematic diagram of the architecture of a preprocessing accumulation unit provided by an exemplary embodiment of the present disclosure.

图5示出了本公开另一个示例性实施例提供的预处理累加单元的架构示意图。FIG5 shows a schematic diagram of the architecture of a preprocessing accumulation unit provided by another exemplary embodiment of the present disclosure.

图6示出了本公开一个示例性实施例提供的层归一化参数路由器的架构示意图。FIG6 shows a schematic diagram of the architecture of a layer normalization parameter router provided by an exemplary embodiment of the present disclosure.

图7示出了本公开一个示例性实施例提供的层归一化计算的协同处理方法的流程图。FIG. 7 shows a flowchart of a collaborative processing method for layer normalization calculation provided by an exemplary embodiment of the present disclosure.

图8是根据一示例性实施例示出的一种层归一化计算的协同处理装置的框图。Fig. 8 is a block diagram of a collaborative processing device for layer normalization calculation according to an exemplary embodiment.

具体实施方式Detailed ways

以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面，但是除非特别指出，不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numerals in the accompanying drawings represent elements with the same or similar functions. Although various aspects of the embodiments are shown in the accompanying drawings, the drawings are not necessarily drawn to scale unless otherwise specified.

在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word “exemplary” is used exclusively herein to mean “serving as an example, example, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.

另外，为了更好的说明本公开，在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解，没有某些具体细节，本公开同样可以实施。在一些实例中，对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述，以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific embodiments. It should be understood by those skilled in the art that the present disclosure can also be implemented without certain specific details. In some examples, methods, means, components and circuits well known to those skilled in the art are not described in detail in order to highlight the main purpose of the present disclosure.

首先，对本公开实施例涉及的应用场景进行介绍。请参考图1，其示出了本公开一个示例性实施例提供的层归一化计算的协同处理装置的架构示意图。该装置10可以实现为计算设备，计算设备可以是终端或服务器。该装置10中包括K个加速单元，K为大于1的正整数，每个加速单元中可以包括一个层归一化处理单元，也就是说装置10可以包括K个层归一化处理单元（即层归一化处理单元1、层归一化处理单元2、……层归一化处理单元K）。First, the application scenarios involved in the embodiments of the present disclosure are introduced. Please refer to Figure 1, which shows an architectural diagram of a collaborative processing device for layer normalization calculation provided by an exemplary embodiment of the present disclosure. The device 10 can be implemented as a computing device, and the computing device can be a terminal or a server. The device 10 includes K acceleration units, K is a positive integer greater than 1, and each acceleration unit may include a layer normalization processing unit, that is, the device 10 may include K layer normalization processing units (i.e., layer normalization processing unit 1, layer normalization processing unit 2, ... layer normalization processing unit K).

加速单元为用于加速特定类型计算任务的硬件设备。加速单元可以是现场可编程门阵列（Field Programmable Gate Array，FPGA）器件、图形处理单元（GraphicProcessing Unit，GPU）、中央处理单元（Central Processing Unit，CPU）、神经处理单元（Neural Processing Unit，NPU）等。An accelerator is a hardware device used to accelerate a specific type of computing task. An accelerator can be a Field Programmable Gate Array (FPGA) device, a Graphic Processing Unit (GPU), a Central Processing Unit (CPU), a Neural Processing Unit (NPU), etc.

每个层归一化处理单元用于处理神经网络计算中层归一化算子的计算加速以及层归一化算子的归约计算。归约计算是指在神经网络中对一组数据进行聚合操作，例如求和、平均值、最大值等。在层归一化算子中，归约计算可用于计算数据的均值和方差，以便对输入数据进行归一化处理。Each layer normalization processing unit is used to process the computational acceleration of the layer normalization operator in the neural network calculation and the reduction calculation of the layer normalization operator. Reduction calculation refers to the aggregation operation of a set of data in the neural network, such as sum, average, maximum, etc. In the layer normalization operator, the reduction calculation can be used to calculate the mean and variance of the data in order to normalize the input data.

本公开实施例针对层归一化算子的计算需求设计了专用的加速单元，通过预处理机制实现了多节点（即K个层归一化处理单元）对于层归一化计算的协同处理，在进行多节点协同计算时，仅需要共享各自的局部累加结果即可实现分布式计算。The disclosed embodiment designs a dedicated acceleration unit according to the computing requirements of the layer normalization operator, and realizes the collaborative processing of layer normalization calculations by multiple nodes (i.e., K layer normalization processing units) through a preprocessing mechanism. When performing multi-node collaborative computing, distributed computing can be achieved by only sharing their respective local accumulation results.

可选的，K个层归一化处理单元中的每个层归一化处理单元用于获取输入数据，输入数据为原始数据中输入至层归一化处理单元的局部数据；对输入数据进行预处理得到局部累加结果，局部累加结果包括与输入数据相关的参数的累加结果；将局部累加结果共享至其他层归一化处理单元；根据共享的K个局部累加结果，进行层归一化计算的协同处理。Optionally, each of the K layer normalization processing units is used to obtain input data, where the input data is local data in the original data input to the layer normalization processing unit; preprocess the input data to obtain a local accumulation result, where the local accumulation result includes the accumulation result of parameters related to the input data; share the local accumulation result with other layer normalization processing units; and perform collaborative processing of layer normalization calculations based on the shared K local accumulation results.

其中，原始数据可以是长度大于预设长度阈值的数据。原始数据可以包括N个数据，将原始数据划分为K组输入数据，每组输入数据包括多个数据，N和K均为大于1的正整数。比如，N为1000，K为10，原始数据包括1000个数据，将原始数据划分为10组输入数据，每组输入数据包括100个数据。其中，一个数据可以是大语言模型中的一个激活值。本公开实施例对此不加以限定。The original data may be data whose length is greater than a preset length threshold. The original data may include N data, and the original data is divided into K groups of input data, each group of input data includes multiple data, and N and K are both positive integers greater than 1. For example, N is 1000, K is 10, the original data includes 1000 data, and the original data is divided into 10 groups of input data, each group of input data includes 100 data. One data may be an activation value in a large language model. The embodiments of the present disclosure are not limited to this.

可选的，将K组输入数据分别输入至K个层归一化处理单元中，以使得K个层归一化处理单元中的每个层归一化处理单元获取一组输入数据。也就是说，原始数据包括K个层归一化处理单元各自的输入数据，且K个层归一化处理单元各自的输入数据之间不存在交集。Optionally, K groups of input data are respectively input into K layer normalization processing units, so that each of the K layer normalization processing units obtains a group of input data. In other words, the original data includes the input data of each of the K layer normalization processing units, and there is no intersection between the input data of each of the K layer normalization processing units.

每个层归一化处理单元用于对输入数据进行预处理，预处理包括对输入数据进行预处理的累加和计算，比如将输入数据中的每个数据求平方，将输入数据的平方进行累加，并将输入数据的平方的累加次数进行统计，或者还可以直接将输入数据进行累加，从而得到局部累加结果。Each layer normalization processing unit is used to preprocess the input data. The preprocessing includes accumulation and calculation of the input data, such as squaring each data in the input data, accumulating the square of the input data, and counting the number of accumulations of the square of the input data, or directly accumulating the input data to obtain a local accumulation result.

其中，局部累加结果包括与输入数据相关的参数的累加结果。由于每个层归一化处理单元的输入数据包括多个数据，可选的，与输入数据相关的参数可以包括多个目标参数，对应的，局部累加结果包括多个目标参数的累加和以及局部累加次数，其中一个目标参数为一个数据的平方，局部累加次数为多个目标参数的累加总次数。可选的，与输入数据相关的参数也可以包括多个数据和多个目标参数，局部累加结果包括多个数据的累加和、多个目标参数的累加和、以及局部累加次数。Among them, the local accumulation result includes the accumulation result of the parameters related to the input data. Since the input data of each layer normalization processing unit includes multiple data, optionally, the parameters related to the input data may include multiple target parameters, and correspondingly, the local accumulation result includes the cumulative sum of the multiple target parameters and the number of local accumulations, wherein one target parameter is the square of one data, and the number of local accumulations is the total number of accumulations of the multiple target parameters. Optionally, the parameters related to the input data may also include multiple data and multiple target parameters, and the local accumulation result includes the cumulative sum of the multiple data, the cumulative sum of the multiple target parameters, and the number of local accumulations.

每个层归一化处理单元将本地的局部累加结果发送至其他层归一化处理单元进行共享，并接收其他层归一化处理单元发送的远程的K-1个局部累加结果，从而使得每个层归一化处理单元得到共享的K个局部累加结果，K个局部累加结果包括本地的局部累加结果和接收到的远程的K-1个局部累加结果，也即K个局部累加结果在K个层归一化处理单元中进行共享。Each layer normalization processing unit sends the local local accumulation results to other layer normalization processing units for sharing, and receives K-1 remote local accumulation results sent by other layer normalization processing units, so that each layer normalization processing unit obtains K shared local accumulation results, and the K local accumulation results include the local local accumulation results and the received remote K-1 local accumulation results, that is, the K local accumulation results are shared among the K layer normalization processing units.

K个层归一化处理单元之间的数据共享方式，包括但不限于几种可能的实现方式：在一种可能的实现方式中，每个层归一化处理单元将本地的局部累加结果分别发送至其他的K-1个层归一化处理单元；在另一种可能的实现方式中，K个层归一化处理单元通过各自的层归一化参数路由器进行数据共享，K个层归一化参数路由器之间可以采用环形路由结构连接，基于环形路由结构连接的层归一化参数路由器，每个层归一化处理单元将本地的局部累加结果发送至下一级的层归一化处理单元，以使得下一级的层归一化处理单元将接收到的局部累加结果再向下一级的层归一化处理单元进行传输，以此类推，从而将局部累加结果共享至其他的K-1个层归一化处理单元。The data sharing method among K layer normalization processing units includes but is not limited to several possible implementation methods: in one possible implementation method, each layer normalization processing unit sends the local local accumulation result to other K-1 layer normalization processing units respectively; in another possible implementation method, the K layer normalization processing units share data through their respective layer normalization parameter routers, and the K layer normalization parameter routers can be connected by a ring routing structure. Based on the layer normalization parameter routers connected by the ring routing structure, each layer normalization processing unit sends the local local accumulation result to the layer normalization processing unit of the next level, so that the layer normalization processing unit of the next level transmits the received local accumulation result to the layer normalization processing unit of the next level, and so on, thereby sharing the local accumulation result with other K-1 layer normalization processing units.

在一个可能的实现方式中，如图2所示，每一个层归一化处理单元包括一个层归一化计算处理单元，和与该层归一化计算处理单元连接的层归一化参数路由器。也就是说，装置10中包括K个层归一化计算处理单元（即层归一化计算处理单元1、层归一化计算处理单元2、……层归一化计算处理单元K）和K个层归一化参数路由器（即层归一化参数路由器1、层归一化参数路由器2、……层归一化参数路由器K）。K个层归一化参数路由器之间可以采用环形路由结构连接，也可以采用其他路由结构连接。本公开实施例对此不加以限定。In a possible implementation, as shown in FIG2 , each layer normalization processing unit includes a layer normalization calculation processing unit and a layer normalization parameter router connected to the layer normalization calculation processing unit. That is, the device 10 includes K layer normalization calculation processing units (i.e., layer normalization calculation processing unit 1, layer normalization calculation processing unit 2, ... layer normalization calculation processing unit K) and K layer normalization parameter routers (i.e., layer normalization parameter router 1, layer normalization parameter router 2, ... layer normalization parameter router K). The K layer normalization parameter routers may be connected by a ring routing structure or other routing structures. This is not limited in the embodiments of the present disclosure.

每个层归一化计算处理单元用于获取输入数据，对输入数据进行预处理得到局部累加结果。Each layer normalization calculation processing unit is used to obtain input data, pre-process the input data to obtain a local accumulation result.

每个层归一化参数路由器用于获取连接的层归一化计算处理单元在规约计算中产生的参数（即局部累加结果），将获取到的参数通过连接的其他层归一化参数路由器发送至其他层归一化处理单元。Each layer normalization parameter router is used to obtain the parameters (ie, local accumulation results) generated by the connected layer normalization calculation processing unit in the protocol calculation, and send the obtained parameters to other layer normalization processing units through other connected layer normalization parameter routers.

每个层归一化参数路由器还用于通过连接的其他层归一化参数路由器，接收其他层归一化处理单元在归约计算中产生的参数（即局部累加结果）。也就是说，归约计算中产生的参数（即局部累加结果）在装置10的K个层归一化处理单元之间进行共享，不需要系统级调度控制器介入，可以自动完成，提升了装置10的数据同步效率，减少了计算时间。Each layer normalization parameter router is also used to receive parameters (i.e., local accumulation results) generated by other layer normalization processing units in the reduction calculation through other connected layer normalization parameter routers. In other words, the parameters (i.e., local accumulation results) generated in the reduction calculation are shared among the K layer normalization processing units of the device 10, without the intervention of the system-level scheduling controller, and can be completed automatically, thereby improving the data synchronization efficiency of the device 10 and reducing the calculation time.

K个层归一化处理单元中的每个层归一化处理单元还用于根据共享的K个局部累加结果，进行层归一化计算的协同处理。可选的，每个层归一化处理单元还用于将共享的K个局部累加结果进行累加得到全局累加结果，根据全局累加结果采用层归一化算子对输入数据进行规约计算得到计算结果。Each of the K layer normalization processing units is also used to perform collaborative processing of layer normalization calculations based on the shared K local accumulation results. Optionally, each layer normalization processing unit is also used to accumulate the shared K local accumulation results to obtain a global accumulation result, and use a layer normalization operator to perform a reduction calculation on the input data based on the global accumulation result to obtain a calculation result.

每个层归一化处理单元累加得到的全局累加结果可以是相同的。全局累加结果为K组输入数据各自对应的局部累加结果之和，K组输入数据构成原始数据，即K组输入数据一共包括N个数据，在一种可能的实现方式中，全局累加结果可以包括N个目标参数的累加和、以及N个目标参数的累加总次数N。在另一种可能的实现方式中，全局累加结果包括N个数据的累加和、N个目标参数的累加和、以及N个目标参数的累加总次数N。The global accumulation result accumulated by the normalization processing unit of each layer can be the same. The global accumulation result is the sum of the local accumulation results corresponding to each of the K groups of input data. The K groups of input data constitute the original data, that is, the K groups of input data include a total of N data. In one possible implementation, the global accumulation result may include the accumulation sum of the N target parameters and the total number of accumulations of the N target parameters N. In another possible implementation, the global accumulation result includes the accumulation sum of the N data, the accumulation sum of the N target parameters, and the total number of accumulations of the N target parameters N.

需要说明的是，采用层归一化算子对输入数据进行规约计算得到计算结果可参考下面实施例中的相关细节，在此先不介绍。It should be noted that the calculation result obtained by reducing the input data using the layer normalization operator can be referred to the relevant details in the following embodiments, which will not be introduced here.

本公开实施例针对装置10计算神经网络场景中层归一化算子的需求，设计了一种支持多节点协同工作及参数共享的方案。针对层归一化算子提供多节点协同计算的能力，可以提升层归一化算子中归约计算的效率，提高计算设备的计算性能。The embodiment of the present disclosure designs a solution that supports multi-node collaborative work and parameter sharing in response to the need for the device 10 to calculate the layer normalization operator in the neural network scenario. Providing the ability of multi-node collaborative computing for the layer normalization operator can improve the efficiency of the reduction computing in the layer normalization operator and improve the computing performance of the computing device.

神经网络可以是深度学习网络、卷积神经网络（Convolutional Neural Network，CNN）、循环神经网络（Recurrent Neural Network、RNN）、长短时记忆网络（Long Short-Term Memor，LSTM）等。比如，神经网络是大语言模型的神经网络。The neural network can be a deep learning network, a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory network (LSTM), etc. For example, the neural network is a neural network of a large language model.

层归一化算子用于对数据进行层归一化计算，层归一化算子包括层归一化（LayerNormalization，LayerNorm）算子或均方根层归一化（Root Mean Square LayerNormalization，RMS-LayerNorm）算子。数据可以是神经网络中任意一层输出的待进行层归一化计算的数据，比如数据可以是神经网络中全连接层输出的激活值。应当理解的是，数据可以是神经网络中需要进行层归一化计算的任何数据，本公开实施例对此不加以限定。The layer normalization operator is used to perform layer normalization calculation on the data, and the layer normalization operator includes a layer normalization (LayerNormalization, LayerNorm) operator or a root mean square layer normalization (Root Mean Square LayerNormalization, RMS-LayerNorm) operator. The data can be data to be layer normalized and outputted by any layer in the neural network, for example, the data can be the activation value outputted by the fully connected layer in the neural network. It should be understood that the data can be any data in the neural network that needs to be layer normalized, and the embodiments of the present disclosure are not limited to this.

综上所述，本公开实施例设计了一种多节点（即K个层归一化处理单元）协同处理的层归一化计算架构，针对层归一化计算需求设计了预处理机制，利用预处理机制使得在每个节点（即每个层归一化处理单元）内进行局部数据处理，当局部数据处理完成得到局部累加结果后，多节点共享各自的局部累加结果即可实现层归一化计算的协同处理，且无需共享中间数据，实现了分布式计算加速，提升了神经网络中层归一化计算效率。In summary, the disclosed embodiment designs a layer normalization computing architecture for collaborative processing of multiple nodes (i.e., K layer normalization processing units), and designs a preprocessing mechanism according to the layer normalization computing requirements. The preprocessing mechanism is used to perform local data processing in each node (i.e., each layer normalization processing unit). When the local data processing is completed and the local accumulation result is obtained, multiple nodes share their respective local accumulation results to achieve collaborative processing of layer normalization calculations without sharing intermediate data, thereby achieving distributed computing acceleration and improving the efficiency of layer normalization calculations in neural networks.

在一种可能的实现方式中，每个层归一化处理单元的架构示意图如图3所示。层归一化处理单元的内部架构可以包括五部分，分别为存储器单元、直接存储器访问单元、控制单元、层归一化计算处理单元和层归一化参数路由器。其中层归一化计算处理单元可以包括三部分，分别为计算单元、预处理累加单元和预处理计算单元。根据实际需要，层归一化处理单元的内部单元可以增加或减少，本公开实施例对此不加以限定。In a possible implementation, the architecture diagram of each layer normalization processing unit is shown in Figure 3. The internal architecture of the layer normalization processing unit may include five parts, namely, a memory unit, a direct memory access unit, a control unit, a layer normalization calculation processing unit, and a layer normalization parameter router. The layer normalization calculation processing unit may include three parts, namely, a calculation unit, a preprocessing accumulation unit, and a preprocessing calculation unit. According to actual needs, the internal units of the layer normalization processing unit may be increased or decreased, and the embodiments of the present disclosure are not limited to this.

层归一化处理单元进行计算处理时，通过直接存储器访问单元在存储器单元与层归一化计算处理单元之间进行数据访问。通过直接存储器访问单元获取存储器单元中存储的输入数据；通过直接存储器访问单元将输入数据调度至层归一化计算处理单元；通过层归一化计算处理单元获取调度的输入数据。When the layer normalization processing unit performs calculation processing, data access is performed between the memory unit and the layer normalization calculation processing unit through the direct memory access unit. The input data stored in the memory unit is obtained through the direct memory access unit; the input data is dispatched to the layer normalization calculation processing unit through the direct memory access unit; and the dispatched input data is obtained through the layer normalization calculation processing unit.

输入数据进入层归一化计算处理单元后，对输入参数进行预处理，即预处理累加单元用于对输入数据进行预处理得到局部累加结果，以及将共享的K个局部累加结果进行累加得到全局累加结果。在累加完成后，预处理计算单元用于根据全局累加结果进行参数计算得到层归一化计算的中间参数。可选的，中间参数可以包括原始数据所包括的N个数据的平均值。After the input data enters the layer normalization calculation processing unit, the input parameters are preprocessed, that is, the preprocessing accumulation unit is used to preprocess the input data to obtain a local accumulation result, and to accumulate the shared K local accumulation results to obtain a global accumulation result. After the accumulation is completed, the preprocessing calculation unit is used to perform parameter calculation according to the global accumulation result to obtain the intermediate parameters of the layer normalization calculation. Optionally, the intermediate parameters may include the average value of the N data included in the original data.

层归一化的预处理过程完成且得到最终的层归一化计算的中间参数后，通过计算单元根据中间参数和输入数据，采用层归一化算子对输入数据进行规约计算得到计算结果。在预处理过程中，如果计算需要在多个层归一化处理单元中同时进行，则局部累加结果通过层归一化参数路由器进行共享。控制单元控制整个层归一化处理单元中的数据计算以及数据共享操作。After the preprocessing process of layer normalization is completed and the intermediate parameters of the final layer normalization calculation are obtained, the calculation unit uses the layer normalization operator to reduce the input data according to the intermediate parameters and input data to obtain the calculation result. During the preprocessing process, if the calculation needs to be performed simultaneously in multiple layer normalization processing units, the local accumulation results are shared through the layer normalization parameter router. The control unit controls the data calculation and data sharing operations in the entire layer normalization processing unit.

其中，层归一化算子可以是LayerNorm算子，也可以是RMS-LayerNorm算子，即本公开实施例提供的方法可以支持这两种层归一化算子的计算。示意性的，LayerNorm算子的计算公式如下：The layer normalization operator may be a LayerNorm operator or a RMS-LayerNorm operator, that is, the method provided in the embodiment of the present disclosure may support the calculation of these two layer normalization operators. Schematically, the calculation formula of the LayerNorm operator is as follows:

其中，x_i为原始数据中的第i个数据，x_i ²为第i个数据的平方即第i个目标参数，i为正整数且取值范围为1到N，N为大于1的正整数，均为预设的参数值，μ为N个数据的平均值。Wherein, _xi is the i-th data in the original data, _xi2 is the square of the i-th data ^, i.e., the i-th target parameter, i is a positive integer ranging from 1 to N, and N is a positive integer greater than 1. are all preset parameter values, and μ is the average value of N data.

在LayerNorm算子中，全局累加结果包括N个数据的累加和、N个目标参数的累加和/>以及N个目标参数的累加总次数N，中间参数可以包括μ和/>。In the LayerNorm operator, the global accumulation result includes the sum of N data. , the cumulative sum of N target parameters/> and the total number of N target parameters accumulated, the intermediate parameters may include μ and /> .

示意性的，RMS-LayerNorm算子的计算公式如下：Schematically, the calculation formula of the RMS-LayerNorm operator is as follows:

在LayerNorm算子中，全局累加结果包括N个目标参数的累加和、以及N个目标参数的累加总次数N，中间参数可以包括/>。In the LayerNorm operator, the global accumulation result includes the sum of N target parameters. , and the total number of N target parameters accumulated, the intermediate parameters may include/> .

在每个层归一化处理单元进行层归一化计算时，首先通过直接存储器访问单元将存储器单元中的输入数据调度到层归一化计算处理单元，通过层归一化计算处理单元对输入数据进行累加和计算。对于多节点协同计算的操作，则需要将累加的局部累加结果通过参数共享网络（即K个层归一化参数路由器构成的参数共享网络）传递到其他层归一化处理单元，并接收其他层归一化处理单元的局部累加结果，进行进一步累加，得到最终的全局累加结果。累加完成后，会将全局累加结果送入预处理计算单元，进行参数计算得到层归一化计算的中间参数。参数计算完成后，会将中间参数写入预处理计算单元的参数寄存器中，完成预处理操作。然后再从存储器单元中读取输入数据，根据预处理得到的中间参数，采用层归一化算子进行规约计算得到计算结果，通过直接存储器访问单元将计算结果写回存储器单元。When each layer normalization processing unit performs layer normalization calculation, the input data in the memory unit is first dispatched to the layer normalization calculation processing unit through the direct memory access unit, and the input data is accumulated and calculated through the layer normalization calculation processing unit. For the operation of multi-node collaborative calculation, it is necessary to transfer the accumulated local accumulation results to other layer normalization processing units through the parameter sharing network (i.e., the parameter sharing network composed of K layer normalization parameter routers), and receive the local accumulation results of other layer normalization processing units for further accumulation to obtain the final global accumulation result. After the accumulation is completed, the global accumulation result will be sent to the preprocessing calculation unit to perform parameter calculation to obtain the intermediate parameters of the layer normalization calculation. After the parameter calculation is completed, the intermediate parameters will be written into the parameter register of the preprocessing calculation unit to complete the preprocessing operation. Then the input data is read from the memory unit, and the layer normalization operator is used to perform the reduction calculation according to the intermediate parameters obtained by the preprocessing to obtain the calculation result, and the calculation result is written back to the memory unit through the direct memory access unit.

在一种可能的实现方式中，预处理累加单元可以分为两部分。第一部分负责LayerNorm算子中输入数据所包括的多个数据的累加以及共享处理，第二部分负责LayerNorm算子或RMS-LayerNorm算子中目标参数（即数据的平方）的累加、累加总次数N的计算以及共享处理。下面分别对这两个部分进行介绍。In a possible implementation, the preprocessing accumulation unit can be divided into two parts. The first part is responsible for the accumulation and sharing of multiple data included in the input data of the LayerNorm operator, and the second part is responsible for the accumulation of the target parameter (i.e., the square of the data) in the LayerNorm operator or the RMS-LayerNorm operator, the calculation of the total number of accumulations N, and the sharing. The two parts are introduced below.

第一部分负责层归一化中LayerNorm算子中数据的累加以及共享处理，预处理累加单元，请参考图4，其示出了本公开一个示例性实施例提供的预处理累加单元的架构示意图，该架构可以实现成为预处理累加单元的一部分。预处理累加单元的内部架构可以包括如下几个部分，分别为选择器、累加单元、本地计数器、累加完成单元、等待远程单元和远程完成单元。根据实际需要，预处理累加单元的内部单元可以增加或减少，本公开实施例对此不加以限定。The first part is responsible for the accumulation and shared processing of data in the LayerNorm operator in layer normalization, the preprocessing accumulation unit, please refer to Figure 4, which shows an architectural schematic diagram of a preprocessing accumulation unit provided by an exemplary embodiment of the present disclosure, and the architecture can be implemented as a part of the preprocessing accumulation unit. The internal architecture of the preprocessing accumulation unit may include the following parts, namely, a selector, an accumulation unit, a local counter, an accumulation completion unit, a waiting remote unit, and a remote completion unit. According to actual needs, the internal units of the preprocessing accumulation unit can be increased or decreased, and the embodiments of the present disclosure are not limited to this.

在预处理累加单元进行累加处理时，有两种不同的处理方式：局部累加处理和全局累加处理。局部累加处理用于处理本地的参数累加，全局累加处理用于计算不同层归一化处理单元之间的参数累加。当进行局部累加处理时，通过选择器将输入数据发送至累加单元；通过累加单元对输入数据中包括的多个数据进行累加得到局部累加结果，并通过本地计数器统计局部累加次数，局部累加次数等于本地的输入数据中包括的数据总数量。也就是说，每进行一次累加操作，本地计数器会进行一次累加计数，当完成最后一次累加操作，即累加得到局部累加结果时，通过累加完成单元产生第一有效信号，第一有效信号用于指示累加单元将局部累加结果输出至对应的层归一化参数路由器，此时累加单元中的局部累加结果会通过数据通道输出至对应的层归一化参数路由器，通过层归一化参数路由器共享至其他层归一化处理单元使用。When the preprocessing accumulation unit performs accumulation processing, there are two different processing methods: local accumulation processing and global accumulation processing. Local accumulation processing is used to process local parameter accumulation, and global accumulation processing is used to calculate parameter accumulation between different layer normalization processing units. When performing local accumulation processing, the input data is sent to the accumulation unit through the selector; the accumulation unit accumulates multiple data included in the input data to obtain a local accumulation result, and the local accumulation number is counted by the local counter, and the local accumulation number is equal to the total number of data included in the local input data. In other words, each time an accumulation operation is performed, the local counter will perform an accumulation count. When the last accumulation operation is completed, that is, the local accumulation result is accumulated, the accumulation completion unit generates a first valid signal, and the first valid signal is used to instruct the accumulation unit to output the local accumulation result to the corresponding layer normalization parameter router. At this time, the local accumulation result in the accumulation unit will be output to the corresponding layer normalization parameter router through the data channel, and shared to other layer normalization processing units through the layer normalization parameter router.

当进行全局累加处理时，在上述的局部累加和共享处理之后，通过本地计数器控制等待远程单元产生第二有效信号，第二有效信号会作用在选择器上，第二有效信号用于指示选择器将接收到的其他层归一化处理单元共享的局部累加结果发送至累加单元。可选的，选择器的状态包括第一状态和第二状态，第一状态不同于第二状态，当选择器处于默认的第一状态时，选择器用于通过第一数据通道接收输入数据；在选择器接收到第二有效信号的情况下，将选择器的状态从第一状态切换为第二状态，当选择器处于第二状态时，选择器用于通过第二数据通道接收其他层归一化处理单元共享的局部累加结果，第一数据通道不同于第二数据通道。本公开实施例对选择器的设置方式不加以限定。When performing global accumulation processing, after the above-mentioned local accumulation and sharing processing, the local counter is controlled to wait for the remote unit to generate a second valid signal, and the second valid signal will act on the selector. The second valid signal is used to instruct the selector to send the received local accumulation results shared by the normalization processing units of other layers to the accumulation unit. Optionally, the state of the selector includes a first state and a second state, and the first state is different from the second state. When the selector is in the default first state, the selector is used to receive input data through the first data channel; when the selector receives the second valid signal, the state of the selector is switched from the first state to the second state. When the selector is in the second state, the selector is used to receive the local accumulation results shared by the normalization processing units of other layers through the second data channel. The first data channel is different from the second data channel. The embodiment of the present disclosure does not limit the setting method of the selector.

当累加单元接收到局部累加结果时，将接收到的局部累加结果与本地的局部累加结果再次进行累加操作。当完成最后一次累加，即K个局部累加结果的累加完成得到全局累加结果时，通过远程完成单元控制累加完成单元产生第三有效信号，第三有效信号用于指示累加单元将全局累加结果传输至预处理计算单元进行计算，此时累加单元中的全局累加结果通过数据通道传输到后一级的预处理计算单元中进行下一步计算处理。然后，通过累加完成单元对等待远程单元、本地计数器和累加单元进行复位清零，以便于进行下一次的计算处理。至此， LayerNorm算子中输入数据所包括的多个数据的累加以及共享处理完成。整个过程都不需要系统级的调度控制器参与，多个层归一化处理单元可以分布式协同工作。When the accumulation unit receives the local accumulation result, the received local accumulation result and the local local accumulation result are accumulated again. When the last accumulation is completed, that is, the accumulation of K local accumulation results is completed to obtain the global accumulation result, the accumulation completion unit is controlled by the remote completion unit to generate a third valid signal, and the third valid signal is used to instruct the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation. At this time, the global accumulation result in the accumulation unit is transmitted to the preprocessing calculation unit of the next level through the data channel for the next step of calculation. Then, the waiting remote unit, the local counter and the accumulation unit are reset and cleared by the accumulation completion unit to facilitate the next calculation. At this point, the accumulation and sharing of multiple data included in the input data of the LayerNorm operator are completed. The entire process does not require the participation of the system-level scheduling controller, and multiple layer normalization processing units can work in a distributed and collaborative manner.

第二部分负责LayerNorm算子或RMS-LayerNorm算子中目标参数的累加、累加总次数N的计算以及共享处理，第一部分和第二部分可以是彼此不连接的或者连接的两个部分。本公开实施例对此不加以限定。请参考图5，其示出了本公开另一个示例性实施例提供的预处理累加单元的架构示意图，该架构可以实现成为预处理累加单元的一部分。预处理累加单元的内部架构可以包括如下几个部分，分别为选择器、累加单元、本地计数器、元素计数器、累加完成单元、等待远程单元和远程完成单元。根据实际需要，预处理累加单元的内部单元可以增加或减少，本公开实施例对此不加以限定。The second part is responsible for the accumulation of target parameters in the LayerNorm operator or the RMS-LayerNorm operator, the calculation of the total number of accumulations N, and the shared processing. The first part and the second part can be two parts that are not connected to each other or connected. The embodiments of the present disclosure are not limited to this. Please refer to Figure 5, which shows an architectural schematic diagram of a preprocessing accumulation unit provided by another exemplary embodiment of the present disclosure, and the architecture can be implemented as a part of the preprocessing accumulation unit. The internal architecture of the preprocessing accumulation unit may include the following parts, namely, a selector, an accumulation unit, a local counter, an element counter, an accumulation completion unit, a waiting remote unit, and a remote completion unit. According to actual needs, the internal units of the preprocessing accumulation unit can be increased or decreased, and the embodiments of the present disclosure are not limited to this.

在预处理累加单元进行目标参数的累加处理的情况下，先将输入数据（包括多个数据）送入平方处理单元进行多个数据的平方计算得到多个目标参数，再通过选择器将多个目标参数发送至累加单元，通过累加单元对多个目标参数进行累加得到局部累加结果。目标参数的累加处理也分为局部累加处理和全局累加处理。目标参数的累加计算过程可类比参考上述数据的累加计算流程。在层归一化计算中，需要知道共有多少个数据参与计算，因此需要统计累加操作发生的次数，由于数据和目标参数的累加次数相同，且目标参数的累加计算更慢，因在计算过程中仅在目标参数的累加计算过程中统计累加操作次数。可选的，在计算的同时还可以统计参与计算的数据总数量即原始数据中包括的数据总数量N。在累加单元中增加了元素计数器，通过元素计数器统计原始数据中包括的数据总数量，也就是全局累加次数。当累加单元在进行累加处理时，在局部累加处理时每进行一次累加操作，本地计数器对元素计数器进行一次计数操作。局部累加处理完成后，若需要累加其他层归一化处理单元的远程的局部累加结果，则远程的局部累加结果在输入选择器的同时对元素计数器进行计数操作，当累加处理完全结束后，元素计数器将记录的数据总数量输出至层归一化参数路由器，通过层归一化参数路由器共享至其他层归一化处理单元使用。In the case where the preprocessing accumulation unit performs accumulation processing of the target parameters, the input data (including multiple data) is first sent to the square processing unit to perform square calculation of multiple data to obtain multiple target parameters, and then the multiple target parameters are sent to the accumulation unit through the selector, and the accumulation unit accumulates the multiple target parameters to obtain a local accumulation result. The accumulation processing of the target parameters is also divided into local accumulation processing and global accumulation processing. The accumulation calculation process of the target parameters can be analogously referred to the accumulation calculation process of the above data. In the layer normalization calculation, it is necessary to know how many data are involved in the calculation, so it is necessary to count the number of accumulation operations. Since the number of accumulations of data and target parameters is the same, and the accumulation calculation of target parameters is slower, the number of accumulation operations is only counted in the accumulation calculation process of target parameters during the calculation process. Optionally, the total number of data involved in the calculation, that is, the total number of data included in the original data, N, can also be counted during the calculation. An element counter is added to the accumulation unit, and the element counter is used to count the total number of data included in the original data, that is, the number of global accumulations. When the accumulation unit is performing accumulation processing, the local counter counts the element counter once for each accumulation operation during the local accumulation processing. After the local accumulation processing is completed, if the remote local accumulation results of other layer normalization processing units need to be accumulated, the remote local accumulation results count the element counter while inputting the selector. When the accumulation processing is completely completed, the element counter outputs the total number of recorded data to the layer normalization parameter router, which is shared with other layer normalization processing units through the layer normalization parameter router.

需要说明的是，对于不需要进行多个层归一化处理单元协同计算的操作，单层归一化处理单元内的计算流程完全相同，但是计算完成后，数据不需要共享，同时层归一化处理单元也不需要等待远程数据，在局部累加处理完成后，直接输出计算结果，用于下一步的参数计算。It should be noted that for operations that do not require collaborative calculations of multiple layer normalization processing units, the calculation process within a single layer normalization processing unit is exactly the same, but after the calculation is completed, the data does not need to be shared, and the layer normalization processing unit does not need to wait for remote data. After the local accumulation processing is completed, the calculation results are directly output for the next step of parameter calculation.

每个层归一化处理单元包括一个层归一化参数路由器，对于每个层归一化处理单元，通过层归一化参数路由器将局部累加结果发送至其他层归一化处理单元进行共享；对于每个层归一化处理单元，通过层归一化参数路由器接收其他层归一化处理单元发送的K-1个局部累加结果。K个层归一化参数路由器单元可以构成一个路由环路。以其中的一个层归一化参数路由器单元为例进行说明。在一种可能的实现方式中，请参考图6，其示出了本公开一个示例性实施例提供的层归一化参数路由器的架构示意图，该架构可以实现成为层归一化参数路由器的一部分。层归一化参数路由器的内部架构可以包括如下几个部分，分别为远程参数队列单元、远程参数寄存器、本地参数寄存器、选择器、本地寄存器和远程寄存器。根据实际需要，层归一化参数路由器的内部单元可以增加或减少，本公开实施例对此不加以限定。Each layer normalization processing unit includes a layer normalization parameter router. For each layer normalization processing unit, the local accumulation result is sent to other layer normalization processing units for sharing through the layer normalization parameter router; for each layer normalization processing unit, the K-1 local accumulation results sent by other layer normalization processing units are received through the layer normalization parameter router. K layer normalization parameter router units can constitute a routing loop. Take one of the layer normalization parameter router units as an example for explanation. In a possible implementation, please refer to Figure 6, which shows an architectural schematic diagram of a layer normalization parameter router provided by an exemplary embodiment of the present disclosure, and the architecture can be implemented as a part of the layer normalization parameter router. The internal architecture of the layer normalization parameter router may include the following parts, namely, a remote parameter queue unit, a remote parameter register, a local parameter register, a selector, a local register and a remote register. According to actual needs, the internal units of the layer normalization parameter router can be increased or decreased, and the embodiment of the present disclosure is not limited to this.

当进行工作时，上一级层归一化计算处理单元发送的局部累加结果（K-1个局部累加结果中的每个局部累加结果）进入远程参数队列单元，通过远程参数队列单元接收其他层归一化处理单元发送的局部累加结果，将局部累加结果寄存在远程参数寄存器中。通过广播计数单元标记当前的层归一化参数路由器的索引，索引用于在K个层归一化参数路由器中唯一标识当前的层归一化参数路由器，即索引用于指示是哪一级层归一化计算处理单元的层归一化参数路由器，最小值可以为1，最大值可以为K。通过远程参数寄存器将局部累加结果通过目标数据通道输入到层归一化计算处理单元，目标数据通道为层归一化参数路由器与层归一化计算处理单元之间的数据通道。When working, the local accumulation results (each of the K-1 local accumulation results) sent by the previous layer normalization calculation processing unit enter the remote parameter queue unit, receive the local accumulation results sent by other layer normalization processing units through the remote parameter queue unit, and store the local accumulation results in the remote parameter register. The index of the current layer normalization parameter router is marked by the broadcast counting unit. The index is used to uniquely identify the current layer normalization parameter router among the K layer normalization parameter routers, that is, the index is used to indicate which level of layer normalization calculation processing unit the layer normalization parameter router is. The minimum value can be 1 and the maximum value can be K. The local accumulation results are input to the layer normalization calculation processing unit through the remote parameter register through the target data channel. The target data channel is the data channel between the layer normalization parameter router and the layer normalization calculation processing unit.

可选的，通过本地参数寄存器寄存当前层归一化处理单元的局部累加结果；通过数据选择器将局部累加结果发送至其他层归一化处理单元的层归一化参数路由器进行共享。Optionally, the local accumulation result of the normalization processing unit of the current layer is stored in a local parameter register; and the local accumulation result is sent to the layer normalization parameter router of other layer normalization processing units for sharing through a data selector.

本地寄存器用于指示远程参数寄存器的状态，远程寄存器用于指示本地参数寄存器的状态，本地寄存器和远程寄存器的初始值均为默认的第二数值，比如第二数值为0。当远程参数寄存器的局部累加结果输入到层归一化计算处理单元中时，将本地寄存器的值设置为第一数值；当本地参数寄存器的局部累加结果通过数据选择器发送至其他层归一化参数路由器时，将远程寄存器的值设置为第一数值；当本地寄存器和远程寄存器的值均为第一数值时，将远程参数寄存器清零，第一数值不同于第二数值，比如第一数值为1。The local register is used to indicate the state of the remote parameter register, and the remote register is used to indicate the state of the local parameter register. The initial values of the local register and the remote register are both the default second value, such as 0. When the local accumulation result of the remote parameter register is input into the layer normalization calculation processing unit, the value of the local register is set to the first value; when the local accumulation result of the local parameter register is sent to other layer normalization parameter routers through the data selector, the value of the remote register is set to the first value; when the values of the local register and the remote register are both the first value, the remote parameter register is cleared, and the first value is different from the second value, such as 1.

数据在不同层归一化参数路由器之间进行传输时，数据链路宽度可以根据资源需求进行配置，可选的，预先配置层归一化参数路由器之间的路由通路的数据链路宽度小于预设宽度阈值，比如数据链路宽度小于待传输的局部累加结果的数据位宽，从而减少资源开销。When data is transmitted between normalized parameter routers at different layers, the data link width can be configured according to resource requirements. Optionally, the data link width of the routing path between the pre-configured layer normalized parameter routers is smaller than a preset width threshold, such as the data link width is smaller than the data bit width of the local accumulation result to be transmitted, thereby reducing resource overhead.

如果数据链路宽度小于数据位宽，则在输出数据时，可以对数据进行移位输出，可以将一个数据转化为包括若干个微片（英文：flit）的数据包。在本公开实施例中，可以将待传输的局部累加结果转化为包括若干个微片的数据包进行传输，即局部累加结果是以数据包（数据包中包括多个微片）的形式在层归一化参数路由器之间进行传输的，在接收到共享的局部累加结果时，通过远程参数队列单元接收其他层归一化处理单元发送的数据包，通过远程参数队列单元将数据包中的多个微片进行拼接，得到共享的局部累加结果。If the data link width is smaller than the data bit width, the data can be shifted and output when outputting data, and one data can be converted into a data packet including several flits. In the disclosed embodiment, the local accumulation result to be transmitted can be converted into a data packet including several flits for transmission, that is, the local accumulation result is transmitted between layer normalization parameter routers in the form of a data packet (the data packet includes multiple flits). When receiving the shared local accumulation result, the data packet sent by other layer normalization processing units is received through the remote parameter queue unit, and the multiple flits in the data packet are spliced through the remote parameter queue unit to obtain the shared local accumulation result.

需要说明的是，上述实施例提供的装置在实现其功能时，仅以上述各个功能模块的划分进行举例说明，实际应用中，可以根据实际需要而将上述功能分配由不同的功能模块完成，即将设备的内容结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。It should be noted that the device provided in the above embodiment only uses the division of the above-mentioned functional modules as an example to implement its functions. In actual applications, the above-mentioned functions can be assigned to different functional modules according to actual needs, that is, the content structure of the device can be divided into different functional modules to complete all or part of the functions described above.

相关技术中，多个计算节点在进行归约计算时，需要将原始数据搬运到一起，进行归约操作。原始数据的搬运会造成较高的延迟和较大的功耗开销，不利于提升加速器的计算效率。并且，对于归约计算，需要多个计算节点与总控制器进行通信以及同步，会增加计算延迟，降低计算效率。并且，多个计算节点间的数据共享往往使用相同的数据通路，会产生数据传输冲突，降低传输效率。而本公开实施例针对层归一化算子的归约计算设计了专门的预处理累加单元，预处理累加单元可以高效地进行层归一化算子中累加的操作。另一方面，预处理中的累加操作同时支持单节点（即单个层归一化处理单元）的归约计算和多节点（即多个层归一化处理单元）的协同归约计算。在进行层归一化计算时多个层归一化处理单元的归约计算可以协同工作，同步操作在层归一化计算处理单元内自动完成，无需系统控制器介入同步操作。另一方面，对于层归一化算子的协同归约计算，可以在每个层归一化处理单元内部进行局部数据处理，当局部数据处理完成后，仅共享局部累加结果，无需共享局部数据处理过程中计算的中间数据，实现了分布式计算加速。另一方面，对于数据共享设计了专用的数据通路，局部累加结果的共享使用专用的数据通路（层归一化参数路由器之间的路由通路），提升了数据共享效率，减少了数据传输的冲突。并且，数据共享的通路可以配置不同的链路宽度，节省电路开销。In the related art, when multiple computing nodes perform reduction calculations, it is necessary to move the original data together to perform reduction operations. The movement of original data will cause high latency and high power consumption, which is not conducive to improving the computing efficiency of the accelerator. In addition, for reduction calculations, multiple computing nodes are required to communicate and synchronize with the master controller, which will increase the computing delay and reduce the computing efficiency. In addition, data sharing between multiple computing nodes often uses the same data path, which will cause data transmission conflicts and reduce transmission efficiency. However, the embodiment of the present disclosure designs a special pre-processing accumulation unit for the reduction calculation of the layer normalization operator, and the pre-processing accumulation unit can efficiently perform the accumulation operation in the layer normalization operator. On the other hand, the accumulation operation in the pre-processing simultaneously supports the reduction calculation of a single node (i.e., a single layer normalization processing unit) and the collaborative reduction calculation of multiple nodes (i.e., multiple layer normalization processing units). When performing layer normalization calculations, the reduction calculations of multiple layer normalization processing units can work together, and the synchronization operation is automatically completed in the layer normalization calculation processing unit, without the need for the system controller to intervene in the synchronization operation. On the other hand, for the collaborative reduction calculation of the layer normalization operator, local data processing can be performed inside each layer normalization processing unit. When the local data processing is completed, only the local accumulation results are shared, and there is no need to share the intermediate data calculated during the local data processing process, thus achieving distributed computing acceleration. On the other hand, a dedicated data path is designed for data sharing, and the sharing of local accumulation results uses a dedicated data path (the routing path between the layer normalization parameter routers), which improves data sharing efficiency and reduces data transmission conflicts. In addition, the data sharing path can be configured with different link widths to save circuit overhead.

以下为本公开实施例的方法实施例，对于方法实施例中未详细阐述的部分，可以参考上述装置实施例中公开的技术细节。The following is a method embodiment of the present disclosure. For parts not described in detail in the method embodiment, reference may be made to the technical details disclosed in the above-mentioned device embodiment.

请参考图7，其示出了本公开一个示例性实施例提供的层归一化计算的协同处理方法的流程图，本实施例以该方法用于上述的层归一化计算的协同处理装置中来举例说明。该方法包括以下几个步骤。Please refer to Figure 7, which shows a flowchart of a method for co-processing layer normalization calculation provided by an exemplary embodiment of the present disclosure. This embodiment uses the method used in the above-mentioned co-processing device for layer normalization calculation as an example. The method includes the following steps.

步骤701，K个层归一化处理单元中的每个层归一化处理单元获取输入数据，输入数据为原始数据中输入至层归一化处理单元的局部数据，K为大于1的正整数。Step 701: Each of the K layer normalization processing units obtains input data, where the input data is local data in the original data input to the layer normalization processing unit, and K is a positive integer greater than 1.

步骤702，每个层归一化处理单元对输入数据进行预处理得到局部累加结果，局部累加结果包括与输入数据相关的参数的累加结果。Step 702: Each layer normalization processing unit preprocesses the input data to obtain a local accumulation result, where the local accumulation result includes an accumulation result of parameters related to the input data.

步骤703，每个层归一化处理单元将局部累加结果共享至其他层归一化处理单元。Step 703: Each layer normalization processing unit shares the local accumulation result with other layer normalization processing units.

步骤704，每个层归一化处理单元根据共享的K个局部累加结果，进行层归一化计算的协同处理。Step 704: Each layer normalization processing unit performs collaborative processing of layer normalization calculations based on the shared K local accumulation results.

在一种可能的实现方式中，每个层归一化处理单元包括层归一化参数路由器，每个层归一化处理单元将局部累加结果共享至其他层归一化处理单元，包括：In a possible implementation, each layer normalization processing unit includes a layer normalization parameter router, and each layer normalization processing unit shares the local accumulation result with other layer normalization processing units, including:

对于每个层归一化处理单元，通过层归一化参数路由器将局部累加结果发送至其他层归一化处理单元进行共享；For each layer normalization processing unit, the local accumulation result is sent to other layer normalization processing units for sharing through the layer normalization parameter router;

该方法还包括：The method further includes:

对于每个层归一化处理单元，通过层归一化参数路由器接收其他层归一化处理单元发送的K-1个局部累加结果。For each layer normalization processing unit, K-1 local accumulation results sent by other layer normalization processing units are received through the layer normalization parameter router.

在另一种可能的实现方式中，K个层归一化参数路由器之间采用环形路由结构连接。In another possible implementation, the K layer normalization parameter routers are connected using a ring routing structure.

在另一种可能的实现方式中，每个层归一化处理单元包括存储器单元、直接存储器访问单元、层归一化计算处理单元；该方法还包括：In another possible implementation, each layer normalization processing unit includes a memory unit, a direct memory access unit, and a layer normalization calculation processing unit; the method further includes:

对于每个层归一化处理单元，通过直接存储器访问单元获取存储器单元中存储的输入数据；For each layer normalization processing unit, the input data stored in the memory unit is obtained through the direct memory access unit;

通过直接存储器访问单元将输入数据调度至层归一化计算处理单元；dispatching input data to a layer normalization calculation processing unit via a direct memory access unit;

对于每个层归一化处理单元，通过层归一化计算处理单元获取调度的输入数据。For each layer normalization processing unit, the scheduled input data is obtained through the layer normalization calculation processing unit.

在另一种可能的实现方式中，每个层归一化计算处理单元包括计算单元、预处理累加单元和预处理计算单元；In another possible implementation, each layer normalization calculation processing unit includes a calculation unit, a preprocessing accumulation unit, and a preprocessing calculation unit;

每个层归一化处理单元对输入数据进行预处理得到局部累加结果，包括：Each layer normalization processing unit preprocesses the input data to obtain local accumulation results, including:

对于每个层归一化处理单元，通过预处理累加单元对输入数据进行预处理得到局部累加结果；For each layer normalization processing unit, the input data is preprocessed by the preprocessing accumulation unit to obtain a local accumulation result;

每个层归一化处理单元根据共享的K个局部累加结果，进行层归一化计算的协同处理，包括：Each layer normalization processing unit performs collaborative processing of layer normalization calculation based on the shared K local accumulation results, including:

对于每个层归一化处理单元，通过预处理累加单元将共享的K个局部累加结果进行累加得到全局累加结果；For each layer normalization processing unit, the shared K local accumulation results are accumulated through the preprocessing accumulation unit to obtain the global accumulation result;

在累加完成后，通过预处理计算单元根据全局累加结果进行参数计算得到层归一化计算的中间参数；After the accumulation is completed, the preprocessing calculation unit performs parameter calculation based on the global accumulation result to obtain the intermediate parameters of the layer normalization calculation;

通过计算单元根据中间参数和输入数据，采用层归一化算子进行规约计算得到计算结果。The calculation unit uses a layer normalization operator to perform reduction calculations according to intermediate parameters and input data to obtain calculation results.

在另一种可能的实现方式中，预处理累加单元包括选择器、累加单元和本地计数器，局部累加结果包括参数累加和以及局部累加次数，通过预处理累加单元对输入数据进行预处理得到局部累加结果，包括：In another possible implementation, the preprocessing accumulation unit includes a selector, an accumulation unit, and a local counter, the local accumulation result includes a parameter accumulation sum and a local accumulation number, and the preprocessing accumulation unit preprocesses the input data to obtain the local accumulation result, including:

通过选择器将输入数据发送至累加单元；Send the input data to the accumulation unit through the selector;

通过累加单元对与输入数据相关的参数进行累加得到参数累加和，并通过本地计数器对与输入数据相关的参数的累加次数进行统计得到局部累加次数。The parameters related to the input data are accumulated by the accumulation unit to obtain the parameter accumulation sum, and the accumulation times of the parameters related to the input data are counted by the local counter to obtain the local accumulation times.

在另一种可能的实现方式中，预处理累加单元还包括累加完成单元，该方法还包括：In another possible implementation, the preprocessing accumulation unit further includes an accumulation completion unit, and the method further includes:

当累加得到局部累加结果时，通过累加完成单元产生第一有效信号，第一有效信号用于指示累加单元将局部累加结果输出至对应的层归一化参数路由器。When a local accumulation result is obtained by accumulation, a first valid signal is generated by the accumulation completion unit, and the first valid signal is used to instruct the accumulation unit to output the local accumulation result to the corresponding layer normalization parameter router.

在另一种可能的实现方式中，预处理累加单元还包括等待远程单元，该方法还包括：In another possible implementation, the preprocessing accumulation unit further includes waiting for the remote unit, and the method further includes:

通过本地计数器控制等待远程单元产生第二有效信号，第二有效信号用于指示选择器将接收到的其他层归一化计算处理单元共享的局部累加结果发送至累加单元；Waiting for the remote unit to generate a second valid signal through local counter control, where the second valid signal is used to instruct the selector to send the received local accumulation results shared by the normalization calculation processing units of other layers to the accumulation unit;

通过预处理累加单元将共享的K个局部累加结果进行累加得到全局累加结果，包括：The shared K local accumulation results are accumulated through the preprocessing accumulation unit to obtain a global accumulation result, including:

通过累加单元将接收到的K-1个局部累加结果与本地的局部累加结果进行累加得到全局累加结果。The received K-1 local accumulation results are accumulated with the local local accumulation result by the accumulation unit to obtain a global accumulation result.

在另一种可能的实现方式中，预处理累加单元还包括远程完成单元，该方法还包括：In another possible implementation, the preprocessing accumulation unit further includes a remote completion unit, and the method further includes:

当累加得到全局累加结果时，通过远程完成单元控制累加完成单元产生第三有效信号，第三有效信号用于指示累加单元将全局累加结果传输至预处理计算单元进行计算；When a global accumulation result is accumulated, the accumulation completion unit is controlled by the remote completion unit to generate a third valid signal, and the third valid signal is used to instruct the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;

通过累加完成单元对等待远程单元、本地计数器和累加单元进行复位清零。The waiting remote unit, the local counter and the accumulation unit are reset and cleared by the accumulation completion unit.

在另一种可能的实现方式中，预处理累加单元还包括元素计数器，该方法还包括：In another possible implementation, the preprocessing accumulation unit further includes an element counter, and the method further includes:

通过元素计数器统计原始数据中包括的数据总数量。The total number of data included in the original data is counted by the element counter.

在另一种可能的实现方式中，层归一化参数路由器包括远程参数队列单元和远程参数寄存器；通过层归一化参数路由器接收其他层归一化处理单元发送的K-1个局部累加结果，包括：In another possible implementation, the layer normalization parameter router includes a remote parameter queue unit and a remote parameter register; receiving K-1 local accumulation results sent by other layer normalization processing units through the layer normalization parameter router includes:

对K-1个局部累加结果中的每个局部累加结果，通过远程参数队列单元接收其他层归一化处理单元发送的局部累加结果；For each of the K-1 local accumulation results, the local accumulation results sent by the normalization processing units of other layers are received through the remote parameter queue unit;

该方法还包括：The method further includes:

将局部累加结果寄存在远程参数寄存器中；Store the local accumulation result in the remote parameter register;

通过远程参数寄存器将局部累加结果通过目标数据通道输入到层归一化计算处理单元，目标数据通道为层归一化参数路由器与层归一化计算处理单元之间的数据通道。The local accumulation result is input to the layer normalization calculation processing unit through the target data channel via the remote parameter register. The target data channel is the data channel between the layer normalization parameter router and the layer normalization calculation processing unit.

在另一种可能的实现方式中，层归一化参数路由器还包括本地参数寄存器和数据选择器，通过层归一化参数路由器将局部累加结果发送至其他层归一化处理单元进行共享，包括：In another possible implementation, the layer normalization parameter router further includes a local parameter register and a data selector, and the local accumulation result is sent to other layer normalization processing units for sharing through the layer normalization parameter router, including:

通过本地参数寄存器寄存当前层归一化处理单元的局部累加结果；The local accumulation result of the normalization processing unit of the current layer is stored in the local parameter register;

通过数据选择器将局部累加结果发送至其他层归一化处理单元的层归一化参数路由器进行共享。The local accumulation results are sent to the layer normalization parameter routers of other layer normalization processing units through the data selector for sharing.

在另一种可能的实现方式中，层归一化参数路由器还包括数据选择器、本地寄存器和远程寄存器；该方法还包括：In another possible implementation, the layer normalization parameter router further includes a data selector, a local register, and a remote register; and the method further includes:

当局部累加结果输入到层归一化计算处理单元中时，将本地寄存器的值设置为第一数值；When the local accumulation result is input into the layer normalization calculation processing unit, the value of the local register is set to the first value;

当局部累加结果通过数据选择器发送至其他层归一化参数路由器时，将远程寄存器的值设置为第一数值；When the local accumulation result is sent to the normalization parameter routers of other layers through the data selector, the value of the remote register is set to the first value;

当本地寄存器和远程寄存器的值均为第一数值时，将远程参数寄存器清零。When the values of the local register and the remote register are both the first value, the remote parameter register is cleared.

在另一种可能的实现方式中，局部累加结果是以数据包的形式在层归一化参数路由器之间进行传输的，通过远程参数队列单元接收其他层归一化处理单元发送的局部累加结果，包括：In another possible implementation, the local accumulation result is transmitted between the layer normalization parameter routers in the form of a data packet, and the local accumulation result sent by other layer normalization processing units is received through the remote parameter queue unit, including:

通过远程参数队列单元接收其他层归一化处理单元发送的数据包，数据包中包括多个微片；receiving a data packet sent by a normalization processing unit of another layer through a remote parameter queue unit, wherein the data packet includes a plurality of micro slices;

通过远程参数队列单元将数据包中的多个微片进行拼接，得到局部累加结果。Multiple micro-slices in the data packet are spliced together through the remote parameter queue unit to obtain a local accumulation result.

在另一种可能的实现方式中，每个层归一化处理单元的输入数据包括多个数据，In another possible implementation, the input data of each layer normalization processing unit includes multiple data,

与输入数据相关的参数包括多个目标参数，局部累加结果包括多个目标参数的累加和以及局部累加次数，目标参数为数据的平方，局部累加次数为多个目标参数的累加总次数；或者，The parameters related to the input data include multiple target parameters, the local accumulation result includes the cumulative sum of the multiple target parameters and the number of local accumulations, the target parameter is the square of the data, and the number of local accumulations is the total number of accumulations of the multiple target parameters; or,

与输入数据相关的参数包括多个数据和多个目标参数，局部累加结果包括多个数据的累加和、多个目标参数的累加和、以及局部累加次数。The parameters related to the input data include multiple data and multiple target parameters, and the local accumulation result includes the cumulative sum of the multiple data, the cumulative sum of the multiple target parameters, and the number of local accumulations.

在另一种可能的实现方式中，层归一化算子包括LayerNorm算子或RMS-LayerNorm算子。In another possible implementation manner, the layer normalization operator includes a LayerNorm operator or a RMS-LayerNorm operator.

需要说明的是，关于上述实施例中的方法，其中各个步骤执行操作的具体方式已经在有关该装置实施例中进行了详细描述，此处将不做详细阐述说明。It should be noted that, regarding the method in the above embodiment, the specific manner of performing the operation in each step has been described in detail in the relevant device embodiment, and will not be elaborated here.

图8是根据一示例性实施例示出的一种层归一化计算的协同处理装置的框图。例如，装置10可以被提供为一服务器或终端设备。参照图8，装置10包括处理组件822，其进一步包括一个或多个处理器，以及由存储器832所代表的存储器资源，用于存储可由处理组件822的执行的指令，例如应用程序。存储器832中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件822可以包括多个加速单元，每个加速单元包括1个层归一化处理单元，处理组件822被配置为执行指令，以执行上述方法。FIG8 is a block diagram of a collaborative processing device for layer normalization calculation according to an exemplary embodiment. For example, the device 10 may be provided as a server or a terminal device. Referring to FIG8 , the device 10 includes a processing component 822, which further includes one or more processors, and a memory resource represented by a memory 832 for storing instructions executable by the processing component 822, such as an application. The application stored in the memory 832 may include one or more modules, each of which corresponds to a set of instructions. In addition, the processing component 822 may include a plurality of acceleration units, each of which includes a layer normalization processing unit, and the processing component 822 is configured to execute instructions to perform the above method.

装置10还可以包括一个电源组件826被配置为执行装置10的电源管理，一个有线或无线网络接口850被配置为将装置10连接到网络，和一个输入输出接口858（I/O接口）。装置10可以操作基于存储在存储器832的操作系统，例如Windows Server^TM，Mac OS X^TM，Unix^TM, Linux^TM，FreeBSD^TM或类似。The device 10 may also include a power supply component 826 configured to perform power management of the device 10, a wired or wireless network interface 850 configured to connect the device 10 to a network, and an input/output interface 858 (I/O interface). The device 10 may operate based on an operating system stored in the memory 832, such as Windows Server ^TM , Mac OS X ^TM , Unix ^TM , Linux ^TM , FreeBSD ^TM , or the like.

本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。The present disclosure may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子（非穷举的列表）包括：便携式计算机盘、硬盘、随机存取存储器（RAM）、只读存储器（ROM）、可擦式可编程只读存储器（EPROM或闪存）、静态随机存取存储器（SRAM）、便携式压缩盘只读存储器（CD-ROM）、数字多功能盘（DVD）、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波（例如，通过光纤电缆的光脉冲）、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions used by an instruction execution device. A computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (a non-exhaustive list) of computer-readable storage media include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disk read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as a punch card or a raised structure in a groove on which instructions are stored, and any suitable combination of the above. The computer-readable storage medium used herein is not to be interpreted as a transient signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., a light pulse through a fiber optic cable), or an electrical signal transmitted through a wire.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing/processing device, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network can include copper transmission cables, optical fiber transmissions, wireless transmissions, routers, firewalls, switches, gateway computers, and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device.

用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构（ISA）指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机（例如利用因特网服务提供商来通过因特网连接）。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列（FPGA）或可编程逻辑阵列（PLA），该电子电路可以执行计算机可读程序指令，从而实现本公开的各个方面。The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as "C" language or similar programming languages. The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer, partially on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., using an Internet service provider to connect through the Internet). In some embodiments, by using the state information of the computer-readable program instructions to personalize an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), the electronic circuit may execute the computer-readable program instructions, thereby implementing various aspects of the present disclosure.

这里参照根据本公开实施例的方法、装置（系统）和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Various aspects of the present disclosure are described herein with reference to flowcharts and/or block diagrams of methods, devices (systems) and computer program products according to embodiments of the present disclosure. It should be understood that each box in the flowchart and/or block diagram and the combination of boxes in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processor of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operating steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings show the possible architecture, function and operation of the system, method and computer program product according to multiple embodiments of the present disclosure. In this regard, each square box in the flow chart or block diagram can represent a part of a module, program segment or instruction, and a part of the module, program segment or instruction includes one or more executable instructions for realizing the specified logical function. In some alternative implementations, the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two continuous square boxes can actually be executed substantially in parallel, and they can sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs the specified function or action, or can be implemented with a combination of special hardware and computer instructions.

以上已经描述了本公开的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present disclosure have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and changes will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The selection of terms used herein is intended to best explain the principles of the embodiments, practical applications, or technical improvements in the market, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. The cooperative processing device for layer normalization calculation is characterized by comprising K layer normalization processing units, wherein K is a positive integer greater than 1;

The K layer normalization processing units are used for sharing respective local accumulation results and carrying out cooperative processing of layer normalization calculation, the local accumulation results comprise accumulation results of parameters related to input data, and the input data is local data input to the layer normalization processing units in original data.

2. The apparatus of claim 1, wherein each of the layer normalization processing units further comprises a layer normalization calculation processing unit and a layer normalization parameter router;

the layer normalization calculation processing unit is used for preprocessing the input data to obtain the local accumulation result;

The layer normalization parameter router is used for sending the local accumulation results to other layer normalization processing units for sharing and receiving K-1 local accumulation results sent by the other layer normalization processing units;

the layer normalization calculation processing unit is further used for performing reduction calculation on the input data by adopting a layer normalization operator according to the shared K local accumulation results to obtain a calculation result.

3. The apparatus of claim 2, wherein K of the layer normalization parameter routers are connected by a ring routing fabric.

4. The apparatus of claim 2, wherein each of the layer normalization processing units further comprises a memory unit and a direct memory access unit;

The memory unit is used for storing the input data;

The direct memory access unit is used for acquiring the stored input data and scheduling the input data to the layer normalization calculation processing unit;

the layer normalization computing processing unit is further configured to obtain the scheduled input data.

5. The apparatus of claim 2, wherein each of the layer normalization computation processing units comprises a computation unit, a preprocessing accumulation unit, and a preprocessing computation unit;

the preprocessing accumulation unit is used for preprocessing the input data to obtain the local accumulation result;

The preprocessing accumulation unit is also used for accumulating the shared K local accumulation results to obtain a global accumulation result;

The preprocessing calculation unit is used for carrying out parameter calculation according to the global accumulation result to obtain intermediate parameters of the layer normalization calculation;

And the calculation unit is used for carrying out protocol calculation by adopting the layer normalization operator according to the intermediate parameter and the input data to obtain the calculation result.

6. The apparatus of claim 5, wherein the preprocessing accumulation unit comprises a selector, an accumulation unit and a local counter, the local accumulation result comprises a parameter accumulation sum and a local accumulation number,

The selector is used for sending the input data to the accumulation unit;

the accumulation unit is used for accumulating parameters related to the input data to obtain the parameter accumulation sum;

The local counter is used for counting the accumulation times of the parameters related to the input data to obtain the local accumulation times.

7. The apparatus of claim 6, wherein the preprocessing accumulation unit further comprises an accumulation completion unit,

The accumulation completion unit is used for generating a first effective signal when the local accumulation result is obtained through accumulation, and the first effective signal is used for indicating the accumulation unit to output the local accumulation result to a corresponding layer normalization parameter router.

8. The apparatus of claim 6, wherein the preprocessing accumulation unit further comprises a waiting remote unit,

The local counter is further configured to control the waiting remote unit to generate a second valid signal, where the second valid signal is used to instruct the selector to send the received local accumulation result shared by the other layer normalization computation processing units to the accumulation unit;

the accumulation unit is also used for accumulating the received K-1 local accumulation results and the local accumulation results to obtain the global accumulation result.

9. The apparatus of claim 7, wherein the preprocessing accumulation unit further comprises a remote completion unit,

The remote completion unit is used for controlling the accumulation completion unit to generate a third effective signal when the global accumulation result is obtained by accumulation, and the third effective signal is used for indicating the accumulation unit to transmit the global accumulation result to the preprocessing calculation unit for calculation;

The accumulation completion unit is also used for resetting and resetting the waiting remote unit, the local counter and the accumulation unit.

10. The apparatus of claim 2, wherein the layer normalized parameter router comprises a remote parameter queue element and a remote parameter register;

the remote parameter queue unit is used for receiving the local accumulation results sent by other layer normalization processing units;

The remote parameter register is used for registering the local accumulation results sent by other layer normalization processing units, and inputting the local accumulation results to the layer normalization calculation processing unit through a target data channel, wherein the target data channel is a data channel between the layer normalization parameter router and the layer normalization calculation processing unit.

11. The apparatus of claim 2, wherein the layer normalization parameter router further comprises a local parameter register and a data selector;

the local parameter register is used for registering the local accumulation result of the current layer normalization processing unit;

The data selector is used for sending the local accumulation result to the layer normalization parameter routers of other layer normalization processing units for sharing.

12. The apparatus of claim 10, wherein the layer normalization parameter router further comprises a data selector, a local register, and a remote register;

the local register is used for setting the value of the local register to be a first numerical value when the local accumulation result is input into the layer normalization calculation processing unit;

The remote register is used for setting the value of the remote register to be a first numerical value when the local accumulation result is sent to other layer normalization parameter routers through the data selector;

The remote parameter register is further configured to clear the remote parameter register when the values of the local register and the remote register are both the first value.

13. The apparatus of claim 10, wherein the local accumulation results are communicated in data packets between the layer normalized parameter routers, the remote parameter queue unit further to:

receiving the data packet sent by other layer normalization processing units, wherein the data packet comprises a plurality of flits;

and splicing the plurality of flits in the data packet to obtain the local accumulation result.

14. The apparatus according to any one of claims 1 to 13, wherein the input data of each of the layer normalization processing units comprises a plurality of data,

The parameters related to the input data comprise a plurality of target parameters, the local accumulation result comprises accumulation sums of a plurality of target parameters and local accumulation times, the target parameters are squares of the data, and the local accumulation times are accumulation total times of the plurality of target parameters; or alternatively

The parameters related to the input data include a plurality of the data and a plurality of the target parameters, and the local accumulation result includes an accumulation sum of the plurality of the data, an accumulation sum of the plurality of the target parameters, and the local accumulation number.

15. A co-processing method of layer normalization computation, for use in the apparatus of any one of claims 1 to 14, the method comprising:

Each layer normalization processing unit preprocesses the input data to obtain the local accumulation result;

Each layer normalization processing unit shares the local accumulation result to other layer normalization processing units;

And each layer normalization processing unit performs cooperative processing of layer normalization calculation according to the shared K local accumulation results.