CN114897133A

CN114897133A - Universal configurable Transformer hardware accelerator and implementation method thereof

Info

Publication number: CN114897133A
Application number: CN202210427056.XA
Authority: CN
Inventors: 粟涛; 杨鑫
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-12

Abstract

The invention discloses a universal configurable Transformer hardware accelerator and an implementation method thereof, wherein all operation requirements of a whole network layer can be met through a calculation engine array, a Softmax operation module and a LayerNorm operation module; the calculation engine array improves the calculation parallelism in the accelerator, accelerates the calculation speed and improves the utilization rate of the data loaded into the cache inside the accelerator; the characteristic data, the offset matrix data, the weight data and the parameter data can be respectively stored through each storage area of the on-chip cache module, so that the efficiency of subsequent data reading is improved, the time delay of the memory access process is reduced, and the operation efficiency is improved; the parallelism of the calculation engines, the number of the calculation engines and the operation parallelism of the nonlinear layer can be configured by the control module, the deployment and implementation of the model can be completed according to different calculation resources, and the method can be widely applied to the technical field of neural network models.

Description

Universal configurable Transformer hardware accelerator and implementation method thereof

Technical Field

The invention relates to the technical field of neural network models, in particular to a universal configurable Transformer hardware accelerator and an implementation method thereof.

Background

The successive proposition of various neural network models undoubtedly endows a plurality of application scenes such as human-computer interaction, automatic driving, biological medical treatment, security monitoring and the like with continuously developing and progressive internal driving force. In recent years, a Transformer model is widely concerned, is an outstanding deep learning model based on self-attention, can realize rapid parallel by utilizing a self-attention mechanism, and shows great potential in a multi-modal input and multi-task scene by the unification of the model. As the computational tasks that the neural network model needs to handle become more intensive, deployment on hardware requires a large amount of memory and computational cost. Compared with inherent defects such as large transmission delay and data insecurity in cloud processing, the method has prominent importance and necessity in deployment models of edge terminals and embedded terminals. Therefore, to ensure real-time computation, a hardware accelerator is urgently needed to provide support at the hardware level.

Different from the traditional neural network model architecture based on convolution operation, the Transformer model is essentially a structure of an encoder and a decoder, wherein the encoder is responsible for mapping input into a hidden layer, and the decoder is responsible for mapping the hidden layer into output. The Transformer model is based on a self-attention mechanism, adopts an encoder-decoder architecture, and has a special network structure. Due to the fact that hardware architectures of existing general neural network accelerators are not matched, the utilization rate of hardware resources is low, scheduling efficiency is not high, and calculation of a transform model cannot be effectively accelerated. The Transformer model comprises a plurality of matrix multiplications and residual error structures with different sizes, a non-linear layer operation, a normalization layer, a full connection layer and the like of a self-attention layer, a large number of computing units are needed to realize parallelism in the computing process, a large number of memory accesses are needed, and the memory access delay also occupies the overall running time of the network. The existing hardware accelerator related to the transform does not completely provide an optimized calculation circuit of the calculation layers, only realizes matrix multiplication and a linear layer or performs acceleration design aiming at a self-attention layer, does not completely realize acceleration of the whole network structure, and has poor acceleration effect for the application requirement of accelerating the whole transform network. In addition, most of the FPGA-based Transformer accelerators are designed by adopting HLS, expandability is not considered in the design, and the storage structure and the data access mode are inflexible and cannot process the condition of large model parameters.

Disclosure of Invention

In order to solve the above technical problems, the present invention aims to: the invention provides a universal configurable Transformer hardware accelerator and an implementation method thereof, which are used for improving the expandability of the Transformer hardware accelerator and the operation efficiency of a Transformer model.

The first technical scheme adopted by the invention is as follows:

a universally configurable fransformer hardware accelerator, comprising:

a compute engine array comprising a plurality of configurable compute engines comprising a configurable series of addition trees, an accumulator, a ReLU unit, and a plurality of parallel multipliers, the compute engine array for performing linear layer operations, self-attention layer operations, full-link layer operations, and residual link operations in parallel;

the on-chip cache module comprises a weight cache region, a bias cache region, a parameter cache region and a plurality of characteristic cache regions, the computing engine array is connected to the off-chip storage through the on-chip cache module, and the on-chip cache module is used for realizing data interaction between the computing engine array and the off-chip storage;

a non-linear layer acceleration module comprising a Softmax operation module and a LayerNorm operation module, an input of the Softmax operation module and an input of the LayerNorm operation module each connected with an output of the compute engine array, the Softmax operation module and the LayerNorm operation module each connected with the on-chip cache module, the Softmax operation module and the LayerNorm operation module to perform non-linear layer operations;

the control module is used for carrying out operation configuration and access logic scheduling control on the calculation engine array, the Softmax operation module and the LayerNorm operation module.

Further, the parallelism of the compute engine array is controlled by configuring the number of compute engines and parallel multipliers.

Further, the Transformer hardware accelerator further includes a data rearrangement module, the compute engine array is connected to the on-chip cache module through the data rearrangement module, and the data rearrangement module is configured to perform data rearrangement on output vectors of the compute engine array and transmit the output vectors to each cache region of the on-chip cache module.

Further, the feature cache region includes a first feature cache region and a second feature cache region, a storage bit width of the first feature cache region is consistent with a bit number of a feature map vector input to the calculation engine array, and a storage bit width of the second feature cache region is consistent with a bit number of an output vector of the calculation engine array.

Further, the Softmax operation module comprises a maximum value lookup unit, a maximum value cache unit and a Softmax calculation unit, wherein the maximum value lookup unit is used for looking up the maximum value of each row of elements of the output vector of the calculation engine array, the maximum value cache module is used for caching the maximum value, and the Softmax calculation unit is used for acquiring feature cache data from the feature cache region and performing normalization index operation according to the feature cache data and the maximum value.

Further, the LayerNorm operation module includes a mean calculation unit, a mean buffer unit, a variance calculation unit, a variance buffer unit, and a LayerNorm calculation unit, wherein the mean calculation unit is configured to calculate a mean of each row of elements of an output vector of the calculation engine array, the mean buffer unit is configured to buffer the mean, the variance calculation unit is configured to calculate a variance of each row of elements of the output vector of the calculation engine array, the variance buffer unit is configured to buffer the variance, and the LayerNorm calculation unit is configured to obtain offset matrix buffer data and parameter buffer data from the offset buffer area and the parameter buffer area, and perform a layer normalization function operation according to the offset matrix buffer data, the parameter buffer data, the mean, and the variance.

Further, the control module is further configured to receive communication information of a host, and perform operation configuration and access logic scheduling control on the calculation engine array, the Softmax operation module, and the LayerNorm operation module according to the communication information.

The second technical scheme adopted by the invention is as follows:

a method for implementing a universal configurable Transformer hardware accelerator is used for being implemented by the universal configurable Transformer hardware accelerator, and comprises the following steps:

the calculation engine array, the Softmax operation module and the LayerNorm operation module are subjected to operation configuration and access logic scheduling control through the control module;

inputting the feature map vector and the weight vector into the calculation engine array to obtain an output vector;

rearranging the data of the output vector and then sequentially storing the rearranged data into each cache region of the on-chip cache module;

inputting the output vector to the Softmax operation module and the LayerNorm operation module;

searching the maximum value of the output vector through the Softmax operation module, and calling feature cache data of a feature cache area to perform normalization index operation to obtain a first operation result;

calculating the mean value and the variance of each row of elements of the output vector through the LayerNorm operation module, and calling the bias matrix cache data of the bias cache region and the parameter cache data of the parameter cache region to perform layer normalization function operation to obtain a second operation result;

and outputting the first operation result and the second operation result.

The invention has the beneficial effects that: the invention provides a universal configurable Transformer hardware accelerator and an implementation method thereof, which can realize linear layer operation, self-attention layer operation, full connection layer operation and residual connection operation through a calculation engine array, and realize nonlinear layer operation through a Softmax operation module and a LayerNorm operation module, thereby meeting all operation requirements of the whole network layer; the calculation engine array improves the calculation parallelism in the accelerator, accelerates the calculation speed and improves the utilization rate of the data loaded into the cache inside the accelerator; the characteristic data, the offset matrix data, the weight data and the parameter data can be respectively stored through each storage area of the on-chip cache module, so that the efficiency of subsequent data reading is improved, the time delay of the memory access process is reduced, and the operation efficiency is improved; the parallelism of the computing engines, the number of the computing engines and the operation parallelism of the nonlinear layer can be configured through the control module, and the deployment implementation of the transform model can be completed according to different computing resources.

Drawings

FIG. 1 is a schematic structural diagram of a generic configurable transform hardware accelerator according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a compute engine according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a Softmax operation module according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a LayerNorm operation module according to an embodiment of the present invention;

fig. 5 is a flowchart illustrating steps of a method for implementing a generic configurable transform hardware accelerator according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the description of the present invention, the meaning of a plurality is more than two, if there are first and second described for the purpose of distinguishing technical features, but not for indicating or implying relative importance or implicitly indicating the number of indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Referring to fig. 1, an embodiment of the present invention provides a generic configurable transform hardware accelerator, including:

the computing engine array comprises a plurality of configurable computing engines, each computing engine comprises an addition tree with configurable series, an accumulator, a ReLU unit and a plurality of parallel multipliers, and the computing engine array is used for executing linear layer operation, self-attention layer operation, full-connection layer operation and residual connection operation in parallel;

the on-chip cache module comprises a weight cache region, a bias cache region, a parameter cache region and a plurality of characteristic cache regions, the calculation engine array is connected to the off-chip storage through the on-chip cache module, and the on-chip cache module is used for realizing data interaction between the calculation engine array and the off-chip storage;

the nonlinear layer acceleration module comprises a Softmax operation module and a LayerNorm operation module, wherein the input end of the Softmax operation module and the input end of the LayerNorm operation module are both connected with the output end of the calculation engine array, the Softmax operation module and the LayerNorm operation module are both connected with the on-chip cache module, and the Softmax operation module and the LayerNorm operation module are used for executing nonlinear layer operation;

the control module, the calculation engine array, the Softmax operation module and the LayerNorm operation module are all connected with the control module, and the control module is used for performing operation configuration and access logic scheduling control on the calculation engine array, the Softmax operation module and the LayerNorm operation module.

Specifically, the transform hardware accelerator according to the embodiment of the present invention includes:

1) general vector multiplication-based compute engine array: the computing engine array is composed of a plurality of configurable computing engines, and the number of the computing engines can be configured; each calculation engine contains a plurality of parallel multipliers, adders, accumulators and ReLU units, and the parallel multipliers, the adders, the accumulators and the ReLU units are combined into a pipeline structure, so that parallel calculation of a large amount of matrix data can be realized. The matrix operation of the attention layer, the operation of the linear layer and the operation of the full connection layer can be performed on the calculation array for parallel operation and acceleration.

2) An on-chip cache module: the on-chip cache module is divided into a plurality of characteristic cache regions, weight cache regions, bias and parameter cache regions and the like, and is used for realizing the interaction of data storage outside the chip and a calculation engine array, caching data which are about to participate in calculation and data which are calculated to the on-chip cache, reducing the memory access bandwidth, realizing internal high-bandwidth access, shortening the operation time and reducing the operation power consumption.

3) A nonlinear layer acceleration module: the Softmax operation module is used for performing Softmax nonlinear operation in a network layer, the LayerNorm operation module is used for achieving LayerNorm related nonlinear operation and normalization, time used by part of calculation of the Softmax operation module and the LayerNorm operation module can be overlapped with running time of the calculation engine array, and parallel acceleration operation can be achieved by the other part of calculation.

4) A control module: the control module is responsible for receiving communication information from the host, including control signals, configuration information and the like. In addition, the control module also comprises scheduling control of different calculation types and access logic in the network layer, so that the access time is hidden to the maximum extent, the operation delay of the whole layer network is reduced, and the operation efficiency is improved.

Further as an alternative embodiment, the parallelism of the compute engine array is controlled by configuring the number of compute engines and parallel multipliers.

The embodiment of the invention can map different operators of the Transformer model to corresponding hardware modules to realize efficient operation and shorten the overall operation time. The parallelism is configured as follows: a calculation engine comprises N multipliers, M calculation engines are shared, and the representation bit width of one datum is W.

The computing engine array is a core computing part of the hardware accelerator, and occupies most computing resources. The design of the computing engines is matched with vector operation, and each computing engine comprises a plurality of parallel multipliers, an adder with configurable series, an accumulator and a Relu unit. Matrix multiplication, residual structure addition, bias addition and output matrix transposition of different sizes in the transform algorithm can be realized according to the functions of the internal sub-controllers. The transposition is related to the data arrangement in the cache, so that the data access and calculation are convenient. The parallelism can be obtained by configuring the multipliers and the number of the calculation engines, one calculation unit inputs a group of input characteristic diagram vectors of NW bits and a group of weight vectors of NW bits, outputs an output vector of MW bits after the calculation of the calculation engine array, and sequentially stores the output vector into the data cache unit through data rearrangement.

The computing engine can satisfy different matrix operation types in a plurality of transform structures under the action of the controller through configuration, for example:

1) the first layer of the feedforward layer in the Transformer model contains Relu operation, and Relu can be turned on to support the calculation of the type layer

2) In addition to accumulating a plurality of calculation results, the partial calculation needs to add offset data, add residual data of a previous layer, or even add both. These functions can be implemented in an accumulation stage by multiplexing, classifying the classes of computations of the transform model to fit different configurations of the computational array.

The architecture of the computing engine of the embodiment of the invention is shown in fig. 2, different types of operation can be switched in modes through the control module, the flexibility is strong, and the utilization rate of the operation unit is high.

Referring to fig. 1, as a further optional implementation manner, the transform hardware accelerator further includes a data rearrangement module, the compute engine array is connected to the on-chip cache module through the data rearrangement module, and the data rearrangement module is configured to perform data rearrangement on the output vector of the compute engine array and transmit the output vector to each cache region of the on-chip cache module.

Specifically, in the embodiment of the present invention, in addition to the core computing unit of the computing engine array itself, the problem of input and output data arrangement of the computing engine is also considered. Since the matrix output in the calculation of the self-attention layer may also be the weight output of the next calculation, in order to maximize the utilization of the calculation unit, increase the hardware reuse rate, and reduce the idle time of the hardware, the output data needs to be distributed and rearranged into different buffers for the next access calculation. In addition, the processing of matrix transposition is also designed in different matrix calculations, and if all the matrix calculations are carried out by a single hardware transposition module, more time and resources are consumed, so that the data rearrangement module is additionally arranged between the calculation engine array and the on-chip cache module in the embodiment of the invention, the characteristic of transposition is optimized by carrying out special rearrangement through the data rearrangement module, the sacrifice is the cutting and combination of the cache unit, and a large cache unit needs to be formed by using a cache with smaller bit width, so that the selective writing requirement on some bit widths in a cache with large bit width is met.

Referring to fig. 1, as a further alternative embodiment, the feature buffer includes a first feature buffer and a second feature buffer, where a storage bit width of the first feature buffer is consistent with a bit number of a feature map vector input to the compute engine array, and a storage bit width of the second feature buffer is consistent with a bit number of an output vector of the compute engine array.

In particular, the Transformer model itself has various calculation types, and in the general natural language processing field, the storage required by the feature map is relatively less. In order to improve the operation efficiency of the hardware accelerator and reduce interaction with off-chip storage, and in addition, the transform structure involves the operation of a plurality of output matrixes again, a simple ping-pong cache design still needs to temporarily store data back to the off-chip storage, and the requirement cannot be met. The embodiment of the invention adopts a plurality of data buffers to temporarily store the intermediate data so as to achieve the high bandwidth of the on-chip data transmission band. In order to better match the compute array, reduce data reordering, the size and bit width of the data cache are related to the configuration of the compute engine array.

Feature graph caching is divided into two types: the storage bit width of each address of the first-class characteristic cache region is NWbits, the data in one row are arranged from left to right, the data in the next row are arranged from left to right after the arrangement is finished, and the like; the feature cache data arrangement sequence of the second type of feature cache region is unchanged, but the storage bit width is changed, and the feature cache region is designed into MW bits and is mainly used for addition operation of a residual error connection part.

The computing array comprises a plurality of computing engines to achieve higher parallelism, and in order to simultaneously acquire a plurality of groups of weight values and send the weight values to the computing array in parallel, on-chip cache bandwidth needs to be increased. The embodiment of the invention adopts a multi-group weight cache method, each group of sub-weight caches access part of weight data, and the total weight sum of the multi-group of sub-weight caches is equal to the required number of the whole weight cache. The first type of weight cache is a large block of storage space, and the other type of weight cache is formed by splicing a plurality of groups of small caches to form a large cache space. The reason is that values after some matrix calculations are completed need to be calculated again as weights next time, in order to reduce operations of data transposition and rearrangement and reduce delay of network layer operations, special data rearrangement operations are performed at the output stage of the calculation array, and both a control signal for data writing and an address generation module need to be additionally designed. If the requirements are required to be met, selective writing in needs to be carried out on the W bit of each storage block, and the caches with the N W bits are considered to be spliced to form a complete cache space. When the next matrix calculation needs to read data from the buffer, the NWbits data can also be acquired at the same time.

The parameter cache region and the bias cache region mainly store parameter data and bias data, the memory occupation is small, the parameter cache region and the bias cache region are also divided into two types according to different bit widths of the storage unit, one type of bit width is NW bits, the other type of bit width is MWbits, and the specific selection criterion corresponds to the network structure layer.

Referring to fig. 3, as a further optional implementation manner, the Softmax operation module includes a maximum value lookup unit, a maximum value cache unit, and a Softmax calculation unit, the maximum value lookup unit is configured to lookup a maximum value of each line element of the output vector of the calculation engine array, the maximum value cache module is configured to cache the maximum value, and the Softmax calculation unit is configured to acquire feature cache data from the feature cache region and perform normalization index operation according to the feature cache data and the maximum value.

Specifically, the Softmax operation module is located in the self-attention layer, and the non-linear function is specifically implemented by converting the result into an exponential function, so as to ensure that the probability is non-negative, and then dividing the converted result by the sum of all converted results, which can be understood as a percentage, so as to obtain an approximate probability. To avoid overflow in fixed-point representations of large inputs, a method is employed in which the maximum value of the input vector is subtracted from the dot product value being processed. The Softmax operation is on a calculated critical path, and the embodiment of the invention mainly adopts two optimization techniques: the method comprises the steps of hiding the calculation acquisition time of the maximum value into the calculation time of a previous matrix, increasing the throughput rate by adopting a pipeline mode through a Softmax calculation module, and further reducing the running delay by adopting a plurality of parallel calculation units.

As shown in fig. 3, which is a schematic structural diagram of the Softmax operation module according to the embodiment of the present invention, it can be understood that the calculation engine array sequentially obtains values of the output matrix, and simultaneously sends a plurality of output values to the maximum value lookup unit of the Softmax operation module, so as to obtain a maximum value of each row of feature matrices, and temporarily store the maximum value in the maximum value cache unit; after the calculation engine array finishes matrix operation before the Softmax operation, starting to calculate the Softmax value of each data in the matrix, acquiring the maximum value of a row from the maximum value cache unit, acquiring a value at one address of the characteristic cache region in each clock cycle, wherein the value comprises a plurality of data in the matrix values, the data can be sent to the Softmax operation module in parallel, and after several cycles, a plurality of Softmax values can be obtained in parallel; and a batch of new calculated values can be sent in each clock cycle and sequentially flow into the Softmax operation module to form a pipeline, so that the throughput rate of the module is improved.

Referring to fig. 4, as a further alternative embodiment, the LayerNorm operation module includes a mean calculation unit, a mean buffer unit, a variance calculation unit, a variance buffer unit, and a LayerNorm calculation unit, the mean calculation unit is configured to calculate a mean of each row of elements of the output vector of the calculation engine array, the mean buffer unit is configured to buffer the mean, the variance calculation unit is configured to calculate a variance of each row of elements of the output vector of the calculation engine array, the variance buffer module is configured to buffer the variance, and the LayerNorm calculation unit is configured to obtain the offset matrix buffer data and the parameter buffer data from the offset buffer area and the parameter buffer area, and perform a layer normalization function operation according to the offset matrix buffer data, the parameter buffer data, the mean, and the variance.

Specifically, a LayerNorm operation module is added in the Transformer model to ensure the stability of data characteristic distribution. The LayerNorm operation module needs to calculate the mean value and the variance on each sample, and the structure of the LayerNorm operation module is shown in FIG. 4, and it can be understood that the LayerNorm operation module directly acquires data from the calculation engine array and then sends the data to the mean value calculation unit and the variance calculation unit, so that the calculation time can be hidden in the previous matrix operation when the mean value and the variance are calculated, and the time delay on the key path of the data flow cannot be increased; in addition, the variance calculation and the mean calculation can be executed in parallel, and because the variance calculation also comprises a first-stage multiplication at the previous stage of the addition tree, the mean value can be obtained in a period ahead when the variance value is finally obtained, and the time sequence requirement of data acquisition in the variance calculation formula can be just met; the parallelism of the mean and variance calculations is consistent with the number of calculation engines, which is M.

After the variance and the mean are obtained, LayerNorm operation needs to be carried out on each data point, and the parallel operation of a data flow mode of a production line, a multi-path multiplier and an adder is also adopted; the parallelism is consistent with the parallelism of the data in the cache and is N; when calculating LayerNorm output, corresponding data needs to be read from a mean buffer, a variance buffer, and a bias buffer and a parameter buffer in an on-chip buffer module.

As a further optional implementation manner, the control module is further configured to receive communication information of the host, and perform operation configuration and access logic scheduling control on the compute engine array, the Softmax operation module, and the layerrnorm operation module according to the communication information.

It should be appreciated that any model containing a Transformer structure and a neural network model containing a common operator structure can be subjected to operation acceleration by using the hardware architecture of the embodiment of the invention; multipliers in a computing engine can be increased, and the parallelism is improved, so that better acceleration performance can be obtained; the embodiment of the invention is not only suitable for being deployed on an FPGA device, but also can realize ASIC (application specific integrated circuit) of a hardware accelerator.

The structure and the operation principle of the digital converter of the embodiment of the invention are explained above, and it can be recognized that the embodiment of the invention has the following advantages:

1) the Transformer network structure layer is various and comprises a plurality of operation types such as linear layer operation, self-attention layer operation, full-connection layer operation, residual connection, nonlinearity and the like, most of the first four operations can be realized by adopting the calculation engine array in the embodiment of the invention, and the calculation of the nonlinear layer can be realized by deploying the Softmax operation module and the LayerNorm operation module in the embodiment of the invention, so that the architecture in the embodiment of the invention meets all operation requirements of the whole network layer;

2) the calculation engine array of the vector operation improves the calculation parallelism in the accelerator, accelerates the calculation speed and improves the utilization rate of data loaded into the cache inside the accelerator;

3) the accelerator is internally provided with a data rearrangement module which can self-adapt to the sequence change of data arrangement to perform classified partition caching, so that the integrity of acquired data is increased, the data acquisition efficiency is improved, and the delay of the memory access process is reduced;

4) the parallelism of the internal units of the computing engines, the number of the computing engines and the operation parallelism in the nonlinear layer module can be configured, and the deployment implementation of the model can be carried out according to different computing resources.

Referring to fig. 5, an embodiment of the present invention provides an implementation method of a generic configurable Transformer hardware accelerator, which is implemented by the generic configurable Transformer hardware accelerator, and includes the following steps:

s101, performing operation configuration and access logic scheduling control on a calculation engine array, a Softmax operation module and a LayerNorm operation module through a control module;

s102, inputting the feature map vector and the weight vector into a calculation engine array to obtain an output vector;

s103, rearranging data of the output vector and sequentially storing the data into each cache region of the on-chip cache module;

s104, inputting the output vector into a Softmax operation module and a LayerNorm operation module;

s105, searching the maximum value of the output vector through a Softmax operation module, and calling feature cache data of a feature cache region to perform normalization index operation to obtain a first operation result;

s106, calculating the mean value and the variance of each row of elements of the output vector through a LayerNorm operation module, and calling the bias matrix cache data of the bias cache region and the parameter cache data of the parameter cache region to perform layer normalization function operation to obtain a second operation result;

and S107, outputting the first operation result and the second operation result.

Specifically, the embodiment of the invention can realize linear layer operation, self-attention layer operation, full connection layer operation and residual connection operation through the calculation engine array, and can realize nonlinear layer operation through the Softmax operation module and the LayerNorm operation module, thereby meeting all operation requirements of the whole network layer; the calculation engine array improves the calculation parallelism in the accelerator, accelerates the calculation speed and improves the utilization rate of the data loaded into the cache inside the accelerator; the characteristic data, the offset matrix data, the weight data and the parameter data can be respectively stored through each storage area of the on-chip cache module, so that the efficiency of subsequent data reading is improved, the time delay of the memory access process is reduced, and the operation efficiency is improved; the parallelism of the computing engines, the number of the computing engines and the operation parallelism of the nonlinear layer can be configured through the control module, and the deployment implementation of the transform model can be completed according to different computing resources.

It can be understood that the contents in the system embodiments are all applicable to the method embodiments, the functions specifically implemented by the method embodiments are the same as the system embodiments, and the beneficial effects achieved by the method embodiments are also the same as the beneficial effects achieved by the system embodiments.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The above-described methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the above-described methods may be implemented in any type of computing platform operatively connected to a suitable connection, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A generic configurable Transformer hardware accelerator, comprising:

2. The universally configurable fransformer hardware accelerator of claim 1, wherein: the parallelism of the computing engine array is controlled by configuring the number of the computing engines and the parallel multipliers.

3. The universally configurable fransformer hardware accelerator of claim 1, wherein: the Transformer hardware accelerator further comprises a data rearrangement module, the calculation engine array is connected with the on-chip cache module through the data rearrangement module, and the data rearrangement module is used for performing data rearrangement on output vectors of the calculation engine array and transmitting the output vectors to each cache region of the on-chip cache module.

4. The universally configurable fransformer hardware accelerator of claim 1, wherein: the feature cache region comprises a first feature cache region and a second feature cache region, the storage bit width of the first feature cache region is consistent with the bit number of the feature map vector input into the calculation engine array, and the storage bit width of the second feature cache region is consistent with the bit number of the output vector of the calculation engine array.

5. The universally configurable fransformer hardware accelerator of claim 1, wherein: the Softmax operation module comprises a maximum value searching unit, a maximum value caching unit and a Softmax calculating unit, wherein the maximum value searching unit is used for searching the maximum value of each row of elements of the output vector of the calculation engine array, the maximum value caching module is used for caching the maximum value, and the Softmax calculating unit is used for acquiring feature cache data from the feature cache region and carrying out normalization index operation according to the feature cache data and the maximum value.

6. The universally configurable fransformer hardware accelerator of claim 1, wherein: the LayerNorm operation module comprises a mean value calculation unit, a mean value buffer unit, a variance calculation unit, a variance buffer unit and a LayerNorm calculation unit, wherein the mean value calculation unit is used for calculating the mean value of each row of elements of the output vector of the calculation engine array, the mean value buffer unit is used for buffering the mean value, the variance calculation unit is used for calculating the variance of each row of elements of the output vector of the calculation engine array, the variance buffer module is used for buffering the variance, and the LayerNorm calculation unit is used for acquiring offset matrix buffer data and parameter buffer data from the offset buffer area and the parameter buffer area and carrying out layer normalization function operation according to the offset matrix buffer data, the parameter buffer data, the mean value and the variance.

7. The universally configurable fransformer hardware accelerator of any one of claims 1 to 6, wherein: the control module is also used for receiving communication information of a host, and performing operation configuration and access logic scheduling control on the calculation engine array, the Softmax operation module and the LayerNorm operation module according to the communication information.

8. A method for implementing a generic configurable Transformer hardware accelerator, for implementation by a generic configurable Transformer hardware accelerator according to any of claims 1 to 7, comprising the steps of:

searching the maximum value of the output vector through the Softmax operation module, and calling feature cache data of a feature cache region to perform normalization index operation to obtain a first operation result;

and outputting the first operation result and the second operation result.