CN116861966B

CN116861966B - Transformer model accelerator and construction and data processing methods and devices thereof

Info

Publication number: CN116861966B
Application number: CN202311127908.4A
Authority: CN
Inventors: 杨宏斌; 董刚; 赵雅倩; 曹其春; 胡克坤; 王斌强
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2024-01-23
Anticipated expiration: 2043-09-04
Also published as: CN116861966A

Abstract

The application relates to a transducer model accelerator and construction and data processing methods and devices thereof. The method comprises the following steps: carrying out statistical analysis on the calculation process of the transducer model to obtain all the calculation types involved, and determining the corresponding functional calculation units according to the calculation types; splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topological structure; and optimizing the number of nodes of the multi-head attention layer and the feedforward neural network layer in the topological structure of the transducer model accelerator and the corresponding relation between each function computing unit and the node position through computation according to the transducer model parameters and on-chip resources. The method can improve the calculation speed.

Description

Transformer model accelerator and construction and data processing methods and devices thereof

Technical Field

The application relates to the technical field of a transducer model in the heterogeneous computing field, in particular to a transducer model accelerator, and construction and data processing methods and devices thereof.

Background

The transform model is a model for machine translation proposed by Google in 2017, completely abandons the structure of the traditional cyclic neural network, adopts a structure based on a complete attention mechanism, achieves quite remarkable effects, and from now on, enables the model design model of the complete attention mechanism to go from the NLP field to the computer vision field, for example, VIT is a visual model based on the transform, and terminates the dominance of CNN in the image field for many years. Compared with the traditional Convolutional Neural Network (CNN) and cyclic neural network (RNN) methods, the attention mechanism-based transducer model has excellent performance in the fields of natural language processing (NLP, natural Language Processing), image recognition, target detection, target tracking and the like, and is very widely applied. There are many types of transducer models currently available, but the basic building blocks are similar, all employing encoder-decoder architecture employing stacked multi-headed attention mechanisms plus fully connected layers. The transform model can model the global dependency relationship of the image, can make full use of the context information, and along with the evolution of the transform model, the parameter quantity is larger and the computation complexity is higher, so that how to effectively accelerate the computation of the general characteristics of the model is a problem to be solved in the field.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a transducer model accelerator, and a method and apparatus for constructing and processing data thereof, which can improve the computation speed of the transducer model.

In one aspect, a method for constructing a transducer model accelerator is provided, the method comprising:

carrying out statistical analysis on the calculation process of the transducer model to obtain all the calculation types involved, and determining the corresponding functional calculation units according to the calculation types;

splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topological structure;

and optimizing the number of nodes of the multi-head attention layer and the feedforward neural network layer in the topological structure of the transducer model accelerator and the corresponding relation between each function computing unit and the node position through computation according to the transducer model parameters and on-chip resources.

In one embodiment, the step of determining the corresponding functional computing unit according to the computing type includes:

Statistical analysis is performed on the computation process of the transducer model, and five computation types are available for the computation process: matrix multiplication, softmax function, layer normalization, residual and activation;

the corresponding function calculation unit is determined according to the five calculation types and comprises a matrix multiplication unit, a Softmax unit, a layer normalization unit, a residual error unit and an activation unit.

In one embodiment, the statistical analysis of the computation process of the transducer model includes five steps for obtaining all the computation types involved in the computation process, including:

analyzing a transducer model structure to obtain a plurality of encoders and decoders which are arranged in series in the transducer model;

analyzing structures of the encoder and the decoder to obtain a main functional layer and an auxiliary functional layer of the encoder and the decoder in a calculation process, wherein the main functional layer comprises a multi-head attention layer and a feedforward neural network layer, and the auxiliary functional layer comprises a residual layer and a layer normalization layer which are positioned behind the multi-head attention layer and the feedforward neural network layer and an activation layer which is positioned in the feedforward neural network layer;

and obtaining five calculation types in the main functional layer and the auxiliary functional layer through statistics: matrix multiplication, softmax function, layer normalization, residual and activation.

the structure of each function calculation unit comprises an input neighbor cache, an output neighbor cache, a weight cache, a calculation unit, a control unit, an input/output Host cache and a Flit generation unit; the input neighbor cache is used for receiving data of adjacent nodes, the output neighbor cache is used for outputting data of the adjacent nodes, the weight cache is used for storing weight data of the current node participating in calculation, the input/output Host cache is used for main storing data loading and output of the multi-head attention layer and the feedforward neural network layer, the Flit generating unit is used for generating Flit packets transmitted by each node in the multi-head attention layer and the feedforward neural network layer, and the control unit is used for controlling data access, packet generation and analysis;

the calculation units in each function calculation unit are arranged to form a matrix multiplication unit, a Softmax unit, a layer normalization unit, a residual unit and an activation unit.

In one embodiment, the steps of splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topology structure include:

Constructing a multi-head attention layer and a feedforward neural network layer by adopting a network-on-chip annular structure;

and constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer, wherein the multi-head attention layer, the feedforward neural network layer and the data cache plane are mutually connected to form a transducer model accelerator topological structure.

In one embodiment, the step of splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer, and forming a transducer model accelerator topology further includes:

selecting a matrix multiplication unit, a Softmax unit, a layer normalization unit and a residual unit from the function calculation unit as nodes of a network structure on a ring-shaped piece of the multi-head attention layer;

selecting a matrix multiplication unit, a layer normalization unit, a residual error unit and an activation unit from the function calculation unit as nodes of a network structure on a ring-shaped sheet of a feedforward neural network layer;

and setting the data caching plane to perform data interaction with the adjacent nodes on the multi-head attention layer and the adjacent nodes on the feedforward neural network layer.

In one embodiment, the step of constructing a data cache plane between the multi-headed attention layer and the feedforward neural network layer includes:

the data cache plane is set to comprise an input feature cache, an output feature cache, a weight cache and a residual error cache, wherein the input feature cache is used for storing and outputting input feature data, the weight cache is used for storing and outputting weight data, the input feature data and the weight data are processed by the computing component and then written back to the output feature cache, the output feature cache is used for outputting feature data as input feature data of the next group of computation, and the residual error cache is used for storing the input feature data of each multi-head attention layer and the feedforward neural network layer to conduct residual error computation.

In one embodiment, the step of setting the data cache plane to perform data interaction with the adjacent node on the multi-head attention layer and the adjacent node on the feedforward neural network layer includes:

setting a node adjacent to the data cache plane on the multi-head attention layer and a node adjacent to the data cache plane on the feedforward neural network layer as input and output nodes;

And setting the data cache plane to the data distribution mode of the input/output node by using a Benes network or a butterfly network.

In one embodiment, after the step of forming the transducer model accelerator topology, the method further comprises:

and setting a data storage format in the transducer model accelerator by using a parallel inner product mode, and storing data in each block according to rows and columns.

setting a matrix multiplication component to calculate data stored in the transducer model accelerator; the sub-processing units of the matrix multiplication component adopt a pulse array structure, and each sub-processing unit adopts a block matrix parallel multiplication accumulation structure.

In one embodiment, after the statistical analysis is performed on the computation process of the transducer model to obtain five steps of all the computation types involved in the computation process, the method further includes:

setting an external access memory, matrix multiplication, softmax function, layer normalization, residual error and instruction format of an activation operation, wherein parameters of the instruction format comprise memory address offset, data volume, line width, line number and cache block selection.

In one embodiment, the step of setting the structure of each function calculation unit to include the Flit generating unit includes:

the Flit generating unit is used for generating Flit formats of Flit packets transmitted by all nodes in the multi-head attention layer and the feedforward neural network layer;

and controlling the Flit packets in each node in the multi-head attention layer and the feedforward neural network layer to be generated by the Flit generating unit according to a Flit format.

In one embodiment, the optimizing the number of nodes of the multi-head attention layer and the feedforward neural network layer in the converter model accelerator topology structure and the corresponding relation between each functional computing unit and the node position by computing according to the converter model parameters and on-chip resources includes:

setting parameters of a to-be-calculated transducer model and on-chip resource constraint;

giving initial configuration of a function calculation unit corresponding to each node of the multi-head attention layer and the feedforward neural network layer in a transducer model accelerator topological structure;

calculating total computation time and total data transmission path delay of a currently configured transducer model accelerator;

the number, the proportion and the corresponding relation between each function calculation unit and the node position are adjusted, and then calculation is carried out so as to continuously reduce the time consumption;

And acquiring a transducer model accelerator topology structure with the minimum sum of total calculation time and total delay of a data packet transmission path in a limited on-chip resource constraint range.

In one embodiment, the step of setting parameters of the to-be-calculated transducer model and on-chip resource constraint includes:

the hardware resources corresponding to the matrix multiplication unit, the Softmax unit, the layer normalization unit, the residual error unit and the activation unit, which are included in the definition function calculation unit, are Lm, ls, ln, lr and La respectively, the calculation time is Tm, ts, tn, tr and Ta respectively, the path delay is Dm, ds, dn, dr and Da respectively, and the unit number is Nm, ns, nn, nr and Na respectively;

the formula for calculating the total used resource amount of each function calculation unit is as follows: l_total=lm1+lm2+ … +lm (Nm) +ls1+ls2+ … +ls (Ns) +ln1+ln2+ … +ln (Nn) +lr1+lr2+ … +lr (Nr) +la1+la2+ … +la (Na);

assuming that the computation time of the function computation units corresponding to each node of the multi-head attention layer and the feedforward neural network layer is equal, (Tm/Nm) = (Ts/Ns) = (Tn/Nn) = (Tr/Nr) = (Ta/Na);

assuming that the total amount of on-chip resources allocated to all the functional computing units is fl_total, the constraint condition of the on-chip resources is: l_total is less than or equal to FL_total.

In one embodiment, the calculating the total computation time and the total data transmission path delay of the currently configured transducer model accelerator includes:

the formula for calculating the total calculation is: t_total= (Tm/Nm) + (Ts/Ns) + (Tn/Nn) + (Tr/Nr) + (Ta/Na);

after the data packet is sent out from the data cache plane, the propagation path delay of the data packet is increased by 1 after passing through one node; the formula for calculating the total delay of the data transmission path is: dtotal=dm1+dm2+ … +dm (Nm) +dm1+ds 2+ … +ds (Ns) +dna1+dna2+ … +dn (Nn) +dr1+dr2+ … +dr (Nr) +da 1+da2+ … +da (Na).

In one embodiment, the step of adjusting the number, the ratio and the corresponding relation between each function calculating unit and the node position to continuously reduce the time includes:

under the constraint conditions of equal calculation time conditions and on-chip resources of the functional calculation units corresponding to each node of the multi-head attention layer and the feedforward neural network layer, the number and the proportion of each node of the multi-head attention layer and the feedforward neural network layer and the corresponding relation between each functional calculation unit and the node position are adjusted;

and taking the difference value between the total calculation time before and after the adjustment and the configuration of the transducer model accelerator and the reduction of the data packet transmission path delay as a reward, and continuously iterating until the result converges.

In another aspect, a transducer model accelerator is provided that includes a multi-headed attention layer, a feedforward neural network layer, and a data cache plane between the multi-headed attention layer and the feedforward neural network layer; the transducer model accelerator is obtained by adopting the construction method of the transducer model accelerator.

In one embodiment, the multi-head attention layer and the feedforward neural network layer both adopt a network-on-chip annular structure, the function calculation unit comprises a matrix multiplication unit, a Softmax unit, a layer normalization unit, a residual error unit and an activation unit, and the matrix multiplication unit, the Softmax unit, the layer normalization unit and the residual error unit are selected from the function calculation unit as nodes of the network-on-chip annular structure of the multi-head attention layer; selecting a matrix multiplication unit, a layer normalization unit, a residual error unit and an activation unit from the function calculation unit as nodes of a network structure on a ring-shaped sheet of a feedforward neural network layer; and the data caching plane performs data interaction with adjacent nodes on the multi-head attention layer and adjacent nodes on the feedforward neural network layer.

In one embodiment, the structure of each function calculation unit includes an input neighbor cache, an output neighbor cache, a weight cache, a calculation unit, a control unit, an input/output Host cache, and a Flit generation unit; the input neighbor cache is used for receiving data of adjacent nodes, the output neighbor cache is used for outputting data of the adjacent nodes, the weight cache is used for storing weight data of the current node participating in calculation, the input/output Host cache is used for main storing data loading and output of the multi-head attention layer and the feedforward neural network layer, the Flit generating unit is used for generating Flit packets transmitted by each node in the multi-head attention layer and the feedforward neural network layer, and the control unit is used for controlling data access, packet generation and analysis.

In one embodiment, the data buffer plane includes an input feature buffer, an output feature buffer, a weight buffer, and a residual buffer, where the input feature buffer is used to store and output input feature data, the weight buffer is used to store and output weight data, the input feature data and the weight data are processed by the computing component and then written back to the output feature buffer, the output feature buffer is used to output feature data as input feature data of a next group of computation, and the residual buffer is used to store the input feature data of each multi-head attention layer and the feedforward neural network layer for residual computation.

In one embodiment, a node on the multi-head attention layer adjacent to the data caching plane and a node on the feedforward neural network layer adjacent to the data caching plane are used as input and output nodes, and the data caching plane uses a Benes network or butterfly network mode to distribute data to the input and output nodes.

In yet another aspect, a data processing method of a transducer model accelerator is provided, which includes the steps of:

loading data from the external memory into an input feature cache, an output feature cache, a weight cache and a residual error cache of the data cache plane;

distributing corresponding weights to each node of the multi-head attention layer and the feedforward neural network layer;

distributing corresponding instructions and parameters to each node of the multi-head attention layer and the feedforward neural network layer;

distributing corresponding feature data to nodes of the multi-head attention layer and the feedforward neural network layer;

after each node calculates and finishes the input calculation flow, the next node continues to calculate or returns to the data cache;

and returning to the step of distributing the corresponding instructions and parameters to each node of the multi-head attention layer and the feedforward neural network layer, repeating the steps until all the node calculation is completed, and outputting the calculation result to an external memory.

According to the transducer model accelerator and the construction and data processing method and device thereof, the computation process of the transducer model is subjected to statistical analysis, the functional computation units corresponding to the computation types of the transducer model structure are abstracted to form the multi-head attention layer and the feedforward neural network layer, then the data caching plane is set by combining the characteristics of the transducer model to form the topological structure of the accelerator, all basic modules and the whole structure of the accelerator are optimized, and the transducer model accelerator with the lowest total computation time of the model and the total delay of the data packet transmission path can be finally obtained by adjusting the number, the proportion and the corresponding relation between each functional computation unit and the node position.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of an application environment for a method of constructing a transducer model accelerator in one embodiment;

FIG. 2 is a flow diagram of a method of constructing a transducer model accelerator in one embodiment;

FIG. 3 is a schematic diagram of the topology of a transducer model accelerator in one embodiment;

FIG. 4 is a schematic diagram of a data buffer plane of a transducer model accelerator according to an embodiment;

FIG. 5 is a schematic diagram of the structure of each functional computing unit in one embodiment;

FIG. 6 is a schematic diagram of a structure of a Benes network used for data distribution of the data cache plane to the I/O nodes according to an embodiment;

FIG. 7 is a schematic diagram of a butterfly network used for data distribution from the data cache plane to the I/O nodes in one embodiment;

FIG. 8 is a schematic diagram of a 1 row 1 column inner product calculation process in one embodiment;

FIG. 9 is a diagram of inner product data loading and generation with 4 degrees of parallelism in one embodiment;

FIG. 10 is a schematic diagram of a corresponding memory data storage format in one embodiment;

FIG. 11 is a schematic diagram of matrix partitioning in one embodiment;

FIG. 12 is a diagram of the internal architecture of a matrix multiplication component in one embodiment;

FIG. 13 is an instruction format of external memory access, matrix multiplication, softmax, layer normalization, residual and activate operations in one embodiment;

FIG. 14 is a Flit format transmitted between nodes in a transducer model accelerator in one embodiment;

FIG. 15 is a schematic diagram of the steps of optimizing the number of nodes of the multi-headed attention layer and the feedforward neural network layer in the topology of the transducer model accelerator and the correspondence between each functional computing unit and the node position by calculation according to the transducer model parameters and on-chip resources in one embodiment;

FIG. 16 is a block diagram of the construction of a transducer model accelerator in one embodiment;

FIG. 17 is a flow diagram of a method of data processing by a transducer model accelerator in one embodiment;

fig. 18 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Network on chip (NoC) technology connects a large amount of computing resources in a network form inside a chip, and realizes separation of the functions of computation and communication. Therefore, the network on chip has the characteristic that the system on chip realizes data calculation in a highly integrated mode, and also has the characteristic that the computer network realizes data communication in a routing and switching technology, thereby integrating the advantages of the two.

The topology is one of the cores of the network on chip, which represents the shape of the network and also its physical connectivity, i.e. the way in which routers are interconnected. Different topologies have a direct impact on the overall NoC performance, and therefore different topologies need to be selected for different applications. NoC topologies are divided into two broad categories, regular and irregular, with regular including star, polygon, tree, butterfly, mesh, torus network, etc. There are also irregular structures, most of which are network structures tailored for certain specific applications.

The network on chip mainly comprises three aspects of a topological structure, a router and a network adapter. The topology shows the physical connection and communication implementation of the shared link. The router realizes communication protocol and arbitration function, and also has buffering function for data transmission in network. Network adapters, also known as network interfaces, function primarily to provide high-level communication services to complete the logic of interconnecting a local processor with other parts of the network.

Since each route of the ring NoC structure involves only three-way channels (left-right and local), complex logic structures such as crossbars are avoided. Therefore, the router of the ring NoC structure is simpler in structure and more excellent in performance. Therefore, the Multi-Head Attention layer (MHA) and the feedforward neural network layer (FFN, feedforward Neural Network) in the topology structure of the network-on-chip-based transducer model accelerator (simply referred to as accelerator) of the present application both adopt a ring network-on-chip structure.

In order to provide better scalability of the accelerator, the multi-headed attention layer (MHA) and the feed forward neural network layer (FFN) in the accelerator each contain multiple nodes, with adjacent node combinations being divided into lateral and longitudinal loops. Each ring is composed of four nodes, the transverse ring and the longitudinal ring are intersected at a common node, and in the process of executing a transducer, data of a resource node at the common node can be selectively transmitted to each node on the transverse ring or each node on the longitudinal ring, and as can be seen from the characteristics of an algorithm, the data only needs to be transmitted to the next adjacent node on the ring in most cases. The resource nodes of each node in the accelerator network use a processor core with relatively comprehensive functions instead of a simple functional processing unit, mainly to improve the universality of the accelerator.

In order to solve the above problems, the embodiment of the invention provides a method for constructing a transducer model accelerator based on a network-on-chip structure, which comprises the steps of analyzing an existing transducer model and abstracting the transducer model structure. And then, combining the characteristics of a transducer model to design the topological structure of the accelerator, optimizing each basic module and the whole structure of the accelerator, and continuously reducing the time consumption by adjusting the number, the proportion and the node positions of each node and testing, so that the minimum total calculation time of the model and the delay of a data packet transmission path are finally achieved in a limited on-chip resource range.

The construction method of the transducer model accelerator can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. (wherein, the terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers and portable wearable devices, the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers, and the terminal 102 may input data into a transducer model accelerator constructed in the server 104 for constructing the transducer model accelerator in the server 104.

In one embodiment, as shown in fig. 2, a method for constructing a transducer model accelerator is provided, and the method is applied to the server 104 in fig. 1 for illustration, and includes the following steps:

step S1, carrying out statistical analysis on the calculation process of a transducer model to obtain all the related calculation types, and determining corresponding functional calculation units according to the calculation types;

s2, splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topological structure;

And S3, optimizing the number of each node of the multi-head attention layer and the feedforward neural network layer in the topological structure of the transducer model accelerator and the corresponding relation between each function computing unit and the node position through computation according to the transducer model parameters and on-chip resources.

The on-chip resources refer to resources of the FPGA chip.

As shown in fig. 3, fig. 3 is a schematic diagram of a topology structure of the transducer model accelerator constructed by the construction method of the transducer model accelerator.

In this embodiment, the step of determining the corresponding functional computing unit according to the computing type includes:

The Softmax function is a normalized exponential function, and is a popularization of a logic function. Layer Normalization (LN) is a normalization technique used in deep neural networks. It may normalize the output of each neuron in the network so that the output of each layer in the network has a similar distribution.

In this embodiment, the statistical analysis is performed on the computation process of the transducer model, and five steps are included in obtaining all the computation types involved in the computation process:

Referring to fig. 5, in this embodiment, the step of determining the corresponding function calculation unit according to the calculation type includes:

In this embodiment, the steps of splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topology structure include:

In this embodiment, the steps of splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topology structure further include:

Referring to fig. 4, in this embodiment, the step of constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer includes:

Thus, the input feature buffer, the output feature buffer and the residual buffer alternate data access in addition to the weight buffer fixed memory weight data. The input feature cache refers to the input features which are output by pointing to the multi-head attention layer and the feedforward neural network layer and used for controlling the multi-head attention layer and the feedforward neural network layer, and the output feature cache refers to the output features which are received by the multi-head attention layer and the feedforward neural network layer.

In this embodiment, the step of setting the data cache plane to perform data interaction with the node adjacent to the multi-head attention layer and the node adjacent to the feedforward neural network layer includes:

Where, as shown in FIG. 6, the Benes is a non-blocking N-input N-output multi-stage network with 2log (N) +1 stages. As shown in fig. 7, butterfly network interconnections may also provide adequate throughput.

In this embodiment, after the step of forming the transducer model accelerator topology, the method further includes:

The matrix multiplying component is based on a block matrix calculation mode, and various matrixes with different sizes can be adapted. For unifying the data storage formats, a parallel inner product mode is used, data is unified to be stored in a block mode by rows (columns), and a 1-row 1-column inner product calculation process is shown in fig. 8. FIG. 9 is a diagram showing loading and generating of inner product data with a parallelism of 4, in the process of loading data, a left matrix sequentially loads row data blocks, a right matrix correspondingly loads column data blocks, and traverses all column data for a group of rows to obtain a group of row calculation results, and then the left matrix loads the next group of rows until all row calculation is completed; fig. 10 is a schematic diagram of a corresponding memory data storage format. Fig. 11 is a schematic block diagram of a matrix. Fig. 12 is a schematic diagram of the internal structure of the matrix multiplication unit, and the whole adopts a structure of a systolic array, and each sub-processing unit adopts a block matrix parallel multiplication accumulation structure, which can flexibly adjust the parallelism of each row of data and realize the compromise of resources and calculation time.

In this embodiment, after the statistical analysis is performed on the computation process of the transducer model to obtain five steps of all the computation types involved in the computation process, the method further includes:

As shown in fig. 13, fig. 13 is an instruction format of external access, matrix multiplication, softmax, layer normalization, residual, and activate operations.

In the present embodiment, the configuration including Flit generating unit step of setting each function calculating unit includes:

Fig. 14 shows the Flit format transmitted between nodes in a transducer model accelerator, and the data packet is generated by a Flit generating unit in each node. The Flit format of each group of Flit packets comprises a Header Flit, a Body Flit and a Tail Flit, wherein the Header Flit sequentially comprises FT, PT, DT, src, MDst, LEN and Acc, and the Body Flit and the Tail Flit sequentially comprise FT and a payload; the FT is Flit Type and comprises characteristic data, weight data and instructions; PT is packet type, and MATMULT, softmax, layerNorm, residul and ReLU are divided according to the type of the arrived computing node; DT represents distribution type, including uncast and multicast; src is a source node ID, and MDst is a destination node ID; if DT is multicast, then MDst includes other traversed nodes; LEN represents the total byte number of the set Flit data; acc is Flit passing path node count value, and is used for testing data packet transmission path delay. The FT contains some necessary control information for the Flit header to ensure proper transmission and reception of data. The Flit Payload (Payload) is then the actual data content.

Referring to fig. 15, in this embodiment, the steps of optimizing the node numbers of the multi-head attention layer and the feedforward neural network layer in the topology structure of the transducer model accelerator and the corresponding relationship between each function calculation unit and the node position by calculating according to the transducer model parameters and on-chip resources include:

As shown in fig. 15, given the parameters of the transducer model to be calculated, the resource constraint on the FPGA chip; giving initial configuration of a system structure, testing total calculation time of a currently configured model and delay of a data packet transmission path, feeding back to an Agent, then taking an optimization action by the Agent, adjusting the number, the proportion and the node position of each node, testing, continuously reducing the time consumption, and finally reaching the lowest total calculation time of the model and delay of the data packet transmission path in a limited on-chip resource range.

In this embodiment, the step of setting parameters of the to-be-calculated transducer model and on-chip resource constraint includes:

In this embodiment, the step of calculating the total computation time and the total delay of the data transmission path of the currently configured transducer model accelerator includes:

In this embodiment, the steps of adjusting the number and the ratio of each node and the corresponding relation between each function calculating unit and the node position, and then calculating to continuously reduce the time consumption include:

In the construction method of the transducer model accelerator, the multi-head attention layer and the feedforward neural network layer are formed by carrying out statistical analysis on the calculation process of the transducer model and abstracting the functional calculation units corresponding to the calculation type of the transducer model structure, then the topological structure of the accelerator is formed by setting the data cache plane by combining the characteristics of the transducer model, and the basic modules and the whole structure of the accelerator are optimized.

In one embodiment, as shown in fig. 16, there is provided a construction apparatus 10 of a transducer model accelerator, including: the function calculation unit acquires a module 1, an accelerator topological structure construction module 2 and an accelerator topological structure optimization module 3.

The function calculation unit acquisition module 1 is used for carrying out statistical analysis on the calculation process of the transducer model to acquire all the calculation types involved, and determining the corresponding function calculation unit according to the calculation types.

The accelerator topology construction module 2 is configured to split all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and construct a data buffer plane between the multi-head attention layer and the feedforward neural network layer to form a transducer model accelerator topology.

The accelerator topological structure optimization module 3 is used for optimizing the number of nodes of the multi-head attention layer and the feedforward neural network layer in the accelerator topological structure of the transducer model and the corresponding relation between each functional calculation unit and the node position through calculation according to the transducer model parameters and on-chip resources.

In this embodiment, the step of determining the corresponding function calculation unit according to the calculation type includes:

In this embodiment, the step of constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer includes:

The Flit format of each group of Flit packets comprises a Header Flit, a Body Flit and a Tail Flit, wherein the Header Flit sequentially comprises FT, PT, DT, src, MDst, LEN and Acc, and the Body Flit and the Tail Flit sequentially comprise FT and a payload; the FT is Flit Type and comprises characteristic data, weight data and instructions; PT is packet type, and MATMULT, softmax, layerNorm, residul and ReLU are divided according to the type of the arrived computing node; DT represents distribution type, including uncast and multicast; src is a source node ID, and Dst is a destination node ID; if DT is multicast, then MDst includes other traversed nodes; LEN represents the total byte number of the set Flit data; acc is Flit passing path node count value, and is used for testing data packet transmission path delay. The FT contains some necessary control information for the Flit header to ensure proper transmission and reception of data. The Flit Payload (Payload) is then the actual data content.

In this embodiment, the steps of optimizing the number of nodes of the multi-head attention layer and the feedforward neural network layer in the topology structure of the transducer model accelerator and the correspondence between each functional computing unit and the node position by computing according to the transducer model parameters and on-chip resources include:

In the construction device of the transducer model accelerator, the multi-head attention layer and the feedforward neural network layer are formed by carrying out statistical analysis on the calculation process of the transducer model and abstracting the functional calculation units corresponding to the calculation type of the transducer model structure, then the topological structure of the accelerator is formed by setting the data cache plane by combining the characteristics of the transducer model, and the basic modules and the whole structure of the accelerator are optimized.

For specific limitations on the construction means of the transducer model accelerator, reference may be made to the above limitations on the construction method of the transducer model accelerator, and no further description is given here. The various modules in the above-described construction means of the transducer model accelerator may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In yet another aspect, a transducer model accelerator is provided that includes a multi-headed attention layer, a feedforward neural network layer, and a data cache plane located between the multi-headed attention layer and the feedforward neural network layer; the transducer model accelerator is obtained by adopting the construction method of the transducer model accelerator.

In this embodiment, the multi-head attention layer and the feedforward neural network layer both adopt a network-on-chip ring structure, the function calculation unit includes a matrix multiplication unit, a Softmax unit, a layer normalization unit, a residual unit and an activation unit, and the function calculation unit selects the matrix multiplication unit, the Softmax unit, the layer normalization unit and the residual unit as nodes of the network-on-chip ring structure of the multi-head attention layer; selecting a matrix multiplication unit, a layer normalization unit, a residual error unit and an activation unit from the function calculation unit as nodes of a network structure on a ring-shaped sheet of a feedforward neural network layer; and the data caching plane performs data interaction with adjacent nodes on the multi-head attention layer and adjacent nodes on the feedforward neural network layer.

The input feature buffer is used for storing and outputting input feature data, the weight buffer is used for storing and outputting weight data, the input feature data and the weight data are processed by the computing component and then written back to the output feature buffer for storing the output feature data, the output feature buffer is used for outputting the feature data as input feature data of the next group of computation, and the residual error buffer is used for storing the input feature data of each multi-head attention layer and feedforward neural network layer for residual error computation.

The function calculation unit comprises a matrix multiplication unit, a Softmax unit, a layer normalization unit, a residual error unit and an activation unit, wherein the calculation types corresponding to the function calculation unit are respectively as follows: matrix multiplication, softmax function, layer normalization, residual and activation.

The structure of each function computing unit comprises an input neighbor cache, an output neighbor cache, a weight cache, a computing unit, a control unit, an input/output Host cache and a Flit generating unit; the input neighbor cache is used for receiving data of adjacent nodes, the output neighbor cache is used for outputting data of the adjacent nodes, the weight cache is used for storing weight data of the current node participating in calculation, the input/output Host cache is used for main storing data loading and output of the multi-head attention layer and the feedforward neural network layer, the Flit generating unit is used for generating Flit packets transmitted by each node in the multi-head attention layer and the feedforward neural network layer, and the control unit is used for controlling data access, packet generation and analysis.

Flit packets in each node in the multi-head attention layer and the feedforward neural network layer are generated by the Flit generating unit according to a Flit format. Wherein the Flit format of each set of Flit packets comprises a Header Flit comprising FT, PT, DT, src, MDst, LEN and Acc in sequence, a Body Flit comprising FT and a payload in sequence; the FT is Flit Type and comprises characteristic data, weight data and instructions; PT is packet type, and MATMULT, softmax, layerNorm, residul and ReLU are divided according to the type of the arrived computing node; DT represents distribution type, including uncast and multicast; src is a source node ID, and Dst is a destination node ID; if DT is multicast, then MDst includes other traversed nodes; LEN represents the total byte number of the set Flit data; acc is Flit passing path node count value, and is used for testing data packet transmission path delay. The FT contains some necessary control information for the Flit header to ensure proper transmission and reception of data. The Flit Payload (Payload) is then the actual data content.

The data cache plane comprises an input feature cache, an output feature cache, a weight cache and a residual error cache, wherein the input feature cache is used for storing and outputting input feature data, the weight cache is used for storing and outputting weight data, the input feature data and the weight data are processed by the computing component and then written back to the output feature cache, the output feature cache is used for outputting feature data as input feature data of the next group of computation, and the residual error cache is used for storing the input feature data of each multi-head attention layer and the feedforward neural network layer to conduct residual error computation.

And the nodes on the multi-head attention layer, which are adjacent to the data caching plane, and the nodes on the feedforward neural network layer, which are adjacent to the data caching plane, are used as input and output nodes, and the data caching plane uses a Benes network or butterfly network mode to carry out data distribution on the input and output nodes.

The data storage formats in the transducer model accelerator all use a parallel inner product mode, and data are divided into blocks according to rows and stored according to columns.

For specific limitations on the transducer model accelerator, reference may be made to the limitations hereinabove described for the method of constructing the transducer model accelerator, and no further description is given herein.

In yet another aspect, as shown in fig. 17, there is provided a data processing method of a transducer model accelerator, which includes the steps of:

Step S11, loading data from an external memory into an input feature cache, an output feature cache, a weight cache and a residual error cache of the data cache plane;

step S12, distributing corresponding weights to all nodes of the multi-head attention layer and the feedforward neural network layer;

step S13, distributing corresponding instructions and parameters to all nodes of the multi-head attention layer and the feedforward neural network layer;

step S14, distributing corresponding characteristic data to each node of the multi-head attention layer and the feedforward neural network layer;

step S15, after each node calculates and finishes the input calculation flow, the next node continues to calculate or returns to the data cache;

and step S16, returning to the step S13 of distributing corresponding instructions and parameters to each node of the multi-head attention layer and the feedforward neural network layer, and repeatedly executing the steps S13-S15 until calculation of all the nodes is completed, and outputting a calculation result to an external memory.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 18. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the construction data of the transducer model accelerator. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of constructing a transducer model accelerator.

It will be appreciated by those skilled in the art that the structure shown in fig. 18 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application is applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

For specific limitations regarding implementation steps when the processor executes the computer program, reference may be made to the limitations of the method for constructing the transducer model accelerator hereinabove, and no further description is given here.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

For specific limitations regarding the implementation steps of the computer program when executed by the processor, reference may be made to the limitations of the method of constructing a transducer model accelerator hereinabove, and will not be described in detail herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for constructing a transducer model accelerator, comprising:

According to the parameters of the transducer model and on-chip resources, optimizing the number of nodes of the multi-head attention layer and the feedforward neural network layer in the topological structure of the transducer model accelerator and the corresponding relation between each function computing unit and the node position through computation;

the steps of splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer, and forming a transducer model accelerator topological structure include the following steps:

constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer, wherein the multi-head attention layer, the feedforward neural network layer and the data cache plane are mutually connected to form a transducer model accelerator topological structure;

the step of constructing a data cache plane between the multi-head attention layer and the feedforward neural network layer comprises the following steps: the data cache plane is set to comprise an input feature cache, an output feature cache, a weight cache and a residual error cache, wherein the input feature cache is used for storing and outputting input feature data, the weight cache is used for storing and outputting weight data, the input feature data and the weight data are processed by the computing component and then written back to the output feature cache, the output feature cache is used for outputting feature data as input feature data of the next group of computation, and the residual error cache is used for storing the input feature data of each multi-head attention layer and the feedforward neural network layer to conduct residual error computation.

2. The method for constructing a transducer model accelerator according to claim 1, wherein the step of determining the corresponding functional computing unit according to the computing type includes:

3. The method for constructing a transducer model accelerator according to claim 2, wherein the step of statistically analyzing the computation process of the transducer model to obtain all the computation types involved in the computation process includes:

4. The method for constructing a transducer model accelerator according to claim 3, wherein the step of determining the corresponding function calculation unit according to the calculation type comprises:

5. The method for constructing a fransformer model accelerator according to claim 4, wherein the steps of splitting all the functional computing units into a multi-head attention layer and a feedforward neural network layer, and constructing a data buffer plane between the multi-head attention layer and the feedforward neural network layer, and forming a fransformer model accelerator topology further comprise:

6. The method for constructing a transducer model accelerator according to claim 5, wherein the step of setting the data cache plane to perform data interaction with the node adjacent to the multi-head attention layer and the node adjacent to the feedforward neural network layer comprises:

7. The method for constructing a transducer model accelerator according to claim 1, further comprising, after the step of forming a transducer model accelerator topology:

8. The method for constructing a transducer model accelerator according to claim 7, further comprising, after the step of forming a transducer model accelerator topology:

9. The method for constructing a transducer model accelerator according to claim 2, wherein after the statistical analysis is performed on the computation process of the transducer model to obtain five types of computation involved in the computation process, the method further comprises:

10. The method according to claim 4, wherein the step of setting up the structure of each function calculation unit to include the Flit generating unit comprises:

11. The method for constructing a transducer model accelerator according to claim 2, wherein the step of optimizing the number of nodes of the multi-headed attention layer and the feedforward neural network layer in the transducer model accelerator topology and the correspondence between each functional computing unit and the node position by computing according to transducer model parameters and on-chip resources comprises:

12. The method for constructing a transducer model accelerator according to claim 11, wherein the step of setting parameters of the transducer model to be calculated and on-chip resource constraint comprises:

13. The method of constructing a transducer model accelerator according to claim 12, wherein the calculating the total computation time and the total data transmission path delay of the currently configured transducer model accelerator comprises:

14. The method for constructing a transducer model accelerator according to claim 12, wherein the step of adjusting the number, the ratio and the correspondence between each function calculation unit and the node position for continuous reduction comprises:

15. A transducer model accelerator, wherein the transducer model accelerator comprises a multi-headed attention layer, a feedforward neural network layer, and a data cache plane between the multi-headed attention layer and the feedforward neural network layer; the transducer model accelerator obtained by a construction method using the transducer model accelerator according to any one of claims 1 to 14.

16. The transducer model accelerator of claim 15, wherein the multi-head attention layer and the feedforward neural network layer each adopt a network-on-chip architecture, the function calculation unit comprises a matrix multiplication unit, a Softmax unit, a layer normalization unit, a residual unit and an activation unit, and the matrix multiplication unit, the Softmax unit, the layer normalization unit and the residual unit are selected from the function calculation unit as nodes of the network-on-chip architecture of the multi-head attention layer; selecting a matrix multiplication unit, a layer normalization unit, a residual error unit and an activation unit from the function calculation unit as nodes of a network structure on a ring-shaped sheet of a feedforward neural network layer; and the data caching plane performs data interaction with adjacent nodes on the multi-head attention layer and adjacent nodes on the feedforward neural network layer.

17. The fransformer model accelerator of claim 15, wherein each functional computing unit comprises a structure including an input neighbor cache, an output neighbor cache, a weight cache, a computing unit, a control unit, an input-output Host cache, and a Flit generating unit; the input neighbor cache is used for receiving data of adjacent nodes, the output neighbor cache is used for outputting data of the adjacent nodes, the weight cache is used for storing weight data of the current node participating in calculation, the input/output Host cache is used for main storing data loading and output of the multi-head attention layer and the feedforward neural network layer, the Flit generating unit is used for generating Flit packets transmitted by each node in the multi-head attention layer and the feedforward neural network layer, and the control unit is used for controlling data access, packet generation and analysis.

18. The transducer model accelerator of claim 15, wherein the data buffer plane comprises an input feature buffer for storing and outputting input feature data, an output feature buffer for storing and outputting weight data, a weight buffer for storing and outputting weight data, the input feature data and the weight data being processed by the computing means and written back to the output feature buffer for storing the output feature data as a next set of computed input feature data, and a residual buffer for storing the input feature data for each multi-head attention layer and feedforward neural network layer for residual computation.

19. The transducer model accelerator of claim 15, wherein nodes on the multi-headed attention layer adjacent to the data caching plane and nodes on the feedforward neural network layer adjacent to the data caching plane are input and output nodes, and the data caching plane uses a Benes network or butterfly network to distribute data to the input and output nodes.

20. A method of data processing for a transducer model accelerator according to any one of claims 15 to 19, comprising the steps of: