CN116431562B

CN116431562B - Multi-head attention mechanism fusion calculation distribution method based on acceleration processor

Info

Publication number: CN116431562B
Application number: CN202310687654.5A
Authority: CN
Inventors: 徐旎林; 闫夏超; 高伟
Original assignee: Taichu Wuxi Electronic Technology Co ltd
Current assignee: Taichu Wuxi Electronic Technology Co ltd
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-11-28
Anticipated expiration: 2043-06-12
Also published as: CN116431562A

Abstract

The invention relates to the field of data processing, and discloses a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, which comprises the following steps: acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed; based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain fusion operators and calculation logic corresponding to each fusion operator; and sequentially calling interfaces corresponding to the fusion operators to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained. The method is favorable for fully using hardware resources by combining fusion operators, and compared with the condition of only processing a single operator, the method can reduce the starting overhead of a slave core by combining the fusion operators, avoid frequent memory read-write, reduce memory access data volume, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.

Description

Multi-head attention mechanism fusion calculation distribution method based on acceleration processor

Technical Field

The invention relates to the technical field of data processing, in particular to a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor.

Background

With the increase of computing power, deep learning has become an important breakthrough in the field of artificial intelligence in the last decade. While more and more complex applications require more and more massive computing data, more efficient and faster support frames are being driven. In order to improve the calculation efficiency of the framework, specific hardware structures are combined, the algorithm structures of various operators are considered, the algorithm structures are matched with each other, and hardware resources are fully utilized so as to achieve the optimal calculation effect.

The current AI acceleration processor mostly adopts heterogeneous many-core design, and can obtain high-efficiency computing performance and reduce power consumption to a certain extent. The heterogeneous many-core architecture comprises a master core and a slave core array which share memory space. However, in the prior art, each calculation needs to write the output of the operator back to the memory, and when the next calculation is performed, the data is read back from the main memory to the cache space of the slave core again for calculation. Repeated data movement causes the access data to become large, occupies bandwidth and affects performance.

Disclosure of Invention

In view of this, the embodiment of the invention provides a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, so as to solve the problems of large data access quantity and low efficiency in the data distribution calculation process in the prior art.

In a first aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computation allocation method based on an acceleration processor, which is applied to a main core of the acceleration processor, where the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, each slave core processes operators of a plurality of operation types, and the method includes:

acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed;

based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;

and sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained.

By reasonably fusing part of calculation processes, the method is favorable for full use of slave operators through a fusion operator combination mode, compared with the condition that only a single operator is processed at the same time, the combination of the fusion operator can reduce the starting cost of the slave core, avoid frequent memory reading and writing, directly store intermediate results in the slave core for the next calculation, reduce the memory access data quantity, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.

In an optional implementation manner, the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, which includes:

extracting the operation type of each operator in each slave core from the slave core information;

and carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator.

By dividing and merging operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.

In a second aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computation allocation method based on an acceleration processor, which is applied to a slave core of the acceleration processor, where the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, and the method includes:

acquiring an interface for starting a fusion operator corresponding to a slave core, and determining a current fusion operator to be processed and a corresponding calculation logic; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core;

Extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;

and calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.

The combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, and directly store the intermediate result on the slave core for the next calculation. In particular, the data obtained by simple calculation of the non-matrix multiplication type can be also written back into the memory without writing back the data, so that the memory access data quantity is reduced as much as possible, the memory bandwidth competition is avoided, and the influence of the memory bottleneck is greatly reduced.

In an optional implementation manner, when the fusion operator to be processed is a first fusion operator, the current data to be processed is data to be processed corresponding to the first fusion operator in a memory, the current data to be processed includes first data to be processed and first data to be transposed, the computing logic is used for computing the current data to be processed by using the corresponding operator to obtain a computing result, and the computing result is written into the memory, where the computing method includes:

performing matrix multiplication operation on the first data to be processed to obtain first intermediate data;

Obtaining first transfer data by carrying out transfer processing on the first data to be transferred;

performing matrix multiplication operation on the first transfer data to obtain second intermediate data;

and performing matrix multiplication operation based on the first intermediate data and the second intermediate data to obtain a first calculation result, and storing the first calculation result into a memory.

By dividing the fusion operator, the first intermediate data, the first transfer data and the second intermediate data can be completely transferred through the buffer memory of the slave core without writing back into the memory, repeated fetching of the data is reduced as much as possible, the data access quantity is reduced, and the bandwidth pressure is reduced.

In an optional implementation manner, when the fusion operator to be processed is a second fusion operator, the current data to be processed is a second data to be processed corresponding to the second fusion operator in a memory and a first calculation result of the first fusion operator, and the calculating the current data to be processed by using the corresponding operator according to the calculation logic to obtain a calculation result and writing the calculation result into the memory includes:

performing matrix multiplication operation on the second data to be processed to obtain third intermediate data;

Performing non-matrix multiplication operation on the first calculation result to obtain fourth intermediate data;

and performing matrix multiplication operation based on the third intermediate data and the fourth intermediate data to obtain a second calculation result, and storing the second calculation result into a memory.

By dividing the fusion operator, the third intermediate data and the fourth intermediate data can be transmitted through the buffer memory of the slave core without writing back to the memory, repeated fetching of the data is reduced as much as possible, the data access quantity is reduced, and the bandwidth pressure is reduced.

In an optional implementation manner, when the fusion operator to be processed is a third fusion operator, the current data to be processed is third data to be processed corresponding to the third fusion operator in a memory and a second calculation result of the second fusion operator, and the calculating the current data to be processed by using the corresponding operator according to the calculation logic to obtain a calculation result and writing the calculation result into the memory includes:

and performing matrix multiplication operation on the third data to be processed and the second calculation result to obtain a third calculation result, and storing the third calculation result into a memory.

In an alternative embodiment, the slave core further comprises a first buffer space and a second buffer space;

The first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.

When the first buffer space is used for calculation, the second buffer space acquires data from the memory, and after the calculation of the first buffer space is completed, the first buffer space and the second buffer space are exchanged. Through the mode, the memory access process and the calculation process are mutually hidden, different calculation modes are performed simultaneously, and the overall performance is improved.

In a third aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computing allocation device based on an acceleration processor, which is applied to a main core of the acceleration processor, where the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and the method includes:

the first acquisition module is used for acquiring the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed;

the association module is used for carrying out fusion association on operators of the slave cores based on the calculation requirement and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;

And the calling module is used for sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained.

In a fourth aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computing allocation device based on an acceleration processor, which is applied to a slave core of the acceleration processor, where the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, and includes:

the second acquisition module is used for acquiring an interface of a fusion operator corresponding to the starting slave core and determining a current fusion operator to be processed and a corresponding calculation logic thereof; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core;

The extraction module is used for extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;

and the calculation module is used for calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.

In a fifth aspect, embodiments of the present invention provide a multi-headed attention mechanism fusion computing dispensing system based on an acceleration processor, comprising: a master core, a memory and a slave core array formed by a plurality of slave cores, wherein each slave core processes operators of a plurality of operation types;

the master core acquires slave core information, data to be processed in a memory and calculation requirements of the data to be processed;

sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start a slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to calculation logic of each fusion operator sequentially to obtain a calculation result;

The slave core acquires an interface for starting a fusion operator corresponding to the slave core, and determines a fusion operator to be processed currently and calculation logic corresponding to the fusion operator;

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram illustration of a multi-headed processor-based attention mechanism fusion calculation distribution method in accordance with an embodiment of the present invention;

FIG. 2 is a flow diagram illustration of another acceleration processor based multi-headed attention mechanism fusion calculation allocation method in accordance with an embodiment of the present invention;

FIG. 3 is a flow diagram illustration of yet another acceleration processor based multi-headed attention mechanism fusion calculation distribution method in accordance with an embodiment of the present invention;

FIG. 4 is a flow diagram illustration of yet another acceleration processor based multi-headed attention mechanism fusion calculation distribution method in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of a multi-headed processor-based attention mechanism fusion computing apparatus in accordance with an embodiment of the present invention;

FIG. 6 is a block diagram of another acceleration processor based multi-headed attention mechanism fusion computing apparatus in accordance with an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The multi-headed attentiveness mechanism is an artificial intelligence technique that allows a neural network to concentrate on portions of critical information while ignoring non-important portions when processing sequence data. Attention mechanisms have been widely used in the fields of natural language processing, computer vision, speech recognition, and the like.

The calculation formula of the multi-head attention mechanism can be expressed as follows:

q (query) is the query, K (key) and V (value) are the key and corresponding value, respectively, and O (output) is the query's corresponding output.

Wherein Q and O are in correspondence, i.e. the sequence lengths of Q and O are the same. K and V are corresponding relations, i.e. the sequence lengths of K and V are identical. In the calculation process, each query Q is calculated with all K, then multiplied by the corresponding V, and finally the output results are integrated.

In the above formula, the result of multiplying each K and Q is a value, then, the time series result of K is calculated by Softmax, and then multiplied by the corresponding V, and finally, all the results are accumulated to obtain the corresponding output.

The current AI acceleration processor mostly adopts heterogeneous many-core design, and can obtain high-efficiency computing performance and reduce power consumption to a certain extent. The heterogeneous many-core architecture comprises a master core and a slave core array, which share memory space. Each slave core has its own independent cache space, and can process the calculation data in parallel. Each slave core is provided with an accelerating component which is specially used for processing matrix multiplication or other operations, so that the computing speed is improved to the greatest extent.

When the processor is used for processing deep learning training related applications at present, a main core usually calls different operator interfaces in sequence according to the network requirements, data is read from a main memory to a cache space of the main core, corresponding calculation is carried out on the data, and after the processing is completed, a calculation result is written back to a memory.

Taking the case of a multi-head attention mechanism as an example, the dimension information of each parameter needs to be defined first. Wherein the dimension of the input attribute_in (q, k, v) is [ seq_len, hidden_size ], the dimension of the input weight kernel (Wq, wk, wv) is [ hidden_size, nheads x proj_size ],

the dimension of the output weight (Wo) is [ nfeads_size, hidden_size ]. If the multi-head attention mechanism is realized according to the traditional method and the formula definition, the multi-head attention mechanism can be basically split into three operators, namely a gemm operator, a batch_gemm operator and a softmax operator, and the calculation steps are as follows:

1) First, each attribute_in, kernel is read into the cache space of the slave core from the main memory, matrix calculation is performed in the slave core to obtain Q, K, V, the dimension is [ seq_len, n heads ] proj_size ], and Q, K, V is written back to the main memory.

2) The dimensions of Q, K are adjusted by reading Q, K from main memory to the cache space of the slave core again.

The dimensions of Q are adjusted to [ n heads, seq_len, proj_size ], the dimensions of K are adjusted to [ n heads, proj_size, seq_len ], and Q, K are written back to main memory again.

3) The cache space from Q, K to the slave core is read in from the main memory, the latch matrix calculation is performed on Q, K to obtain QK, the dimensions are [ nfeads, seq_len ], and the QK is written back to the main memory.

4) And reading the QK from the main memory to the cache space of the slave core again, performing softmax calculation on the QK to obtain S_QK, and writing the S_QK back to the main memory, wherein the dimensions are [ nfeads, seq_len ].

5) S_QK is read in from the main memory, V is read into the cache space of the slave core, batch matrix calculation is carried out on the V, S_QK, V_S_QK is obtained, the dimension is [ n heads, seq_len, proj_size ], and the V_S_QK is written back into the main memory.

6) V_S_QK, wo is read from the main memory to the cache space of the slave core, matrix calculation is carried out on V_S_QK, wo, so that the attribute_out is obtained, the dimension is [ seq_len, hidden_size ], and the attribute_out is written back to the main memory.

The above calculation process has obvious disadvantages:

1) The output after each calculation can be used as input for the next calculation, but since the traditional calculation mode splits the whole calculation flow into three operators, gemm, batch_gemm, softmax, the output of the operator has to be written back to the main memory, and the data has to be read back from the main memory to the cache space of the slave core again for calculation until the next calculation. Repeated data movement causes the access data to become large, occupies bandwidth and affects performance. The most obvious duplicate data loading is that in computing the matrix multiplication of Q, K, Q, K needs to be read in twice, the first read in only performs a dimension adjustment on Q, K in the slave core computation, without any additional computation.

2) From the core computation, only one of the matrix class computation (gemm, batch_gemm) and the non-matrix class computation (softmax) is performed at the same time, but for the acceleration processor the matrix class computation and the non-matrix class computation may be computed in parallel by different operators within the core.

The embodiment of the invention provides a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, which is characterized by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed. And carrying out fusion association on operators of the slave cores based on the calculation requirements and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator. And sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator sequentially, and a calculation result is obtained. Compared with the situation that only a single operator is processed at the same time, the combination of the fusion processing operators can reduce the starting cost of the slave core, avoid frequent memory read-write, directly store the intermediate result in the slave core for the next calculation, reduce the memory access data amount, avoid the memory bandwidth competition and greatly reduce the influence of memory bottleneck.

In accordance with an embodiment of the present invention, there is provided an embodiment of a multi-headed attention mechanism fusion computing method based on an acceleration processor, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a main core of the acceleration processor, such as a SWAI chip, and the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and fig. 1 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:

step S101: the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed are acquired. Specifically, the slave core information comprises the type of slave core supporting calculation, and the data to be processed in the memory can be divided in row-column dimension to optimize the data access mode.

Step S102: and carrying out fusion association on operators of the slave cores based on the calculation requirements and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator. Specifically, by dividing and fusing operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.

Step S103: and sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator sequentially, and a calculation result is obtained. Specifically, the method of combining the fusion operators is favorable for fully using hardware resources, and compared with the condition that only a single operator is processed at the same time, the method of combining the fusion operators can reduce the starting overhead of the slave core and avoid frequent memory reading and writing.

Through the steps S101 to S104, the multi-head attention mechanism fusion calculation distribution method based on the acceleration processor provided by the embodiment of the invention is beneficial to fully using hardware resources in a fusion operator combination mode, and compared with the case of only processing a single operator at the same time, the combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, directly store the intermediate result in the slave core for the next calculation, reduce the memory access data volume, avoid the memory bandwidth competition, and greatly reduce the influence of the memory bottleneck.

In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a main core of the acceleration processor, such as a SWAI chip, and the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and fig. 2 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:

step S201: the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed are acquired. Specifically, the slave core information comprises the type of slave core supporting calculation, and the data to be processed in the memory can be divided in row-column dimension to optimize the data access mode.

Step S202: and carrying out fusion association on operators of the slave cores based on the calculation requirements and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator. Specifically, by dividing and fusing operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.

Step S203: and sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator sequentially, and a calculation result is obtained. Specifically, the method of combining the fusion operators is favorable for fully using hardware resources, and compared with the condition that only a single operator is processed at the same time, the method of combining the fusion operators can reduce the starting overhead of the slave core and avoid frequent memory reading and writing.

Specifically, step S202 described above includes:

step S2021: the operation type of each operator in each slave core is extracted from the slave core information.

Step S2022: and carrying out fusion association on operators according to the calculation requirements and the operation types to obtain a first fusion operator, a second fusion operator and a third fusion operator.

Specifically, by dividing and fusing operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.

Specifically, according to the calculation logic and the correlation, the whole calculation process can be decomposed into three fusion big operators, the output result of the last calculation of each fusion operator must be written back into the memory, the calculation results of other processes can be used as the intermediate result of the next calculation, and the calculation results can be selected not to be written back into the memory.

In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which may be used for a slave core of the acceleration processor, such as a SWAI chip, where the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, each slave core processes operators of a plurality of operation types, and fig. 3 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:

step S301: and acquiring an interface for starting a fusion operator corresponding to the slave core, and determining a fusion operator to be processed currently and a corresponding calculation logic thereof. The method comprises the steps that a master core fuses and associates operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed, and the first fusion operator, the second fusion operator, the third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core. Specifically, by reasonably fusing part of calculation processes, the full use of slave operators is facilitated through a fusion operator combination mode, compared with the situation that only a single operator is processed at the same time, the combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, directly store intermediate results in the slave core for the next calculation, reduce the memory access data volume, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.

Step S302: and extracting the current data to be processed corresponding to the current fusion operator to be processed from the memory. Specifically, the data in the memory can be divided in row and column dimensions, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of operators is reduced.

Step S303: and calculating the current data to be processed by utilizing a corresponding operator according to calculation logic to obtain a calculation result, and writing the calculation result into a memory. Specifically, the whole calculation process is decomposed into three fusion big operators, the output result of each fusion operator except the last calculation is necessarily written back to the memory, the calculation results of other processes are taken as the intermediate result of the next calculation, the memory can be selected not to be written back, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of the operators is reduced.

In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a slave core of a master core of an acceleration processor, such as a SWAI chip, and the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and fig. 4 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 4, the flowchart includes the following steps:

Step S401: and acquiring an interface for starting a fusion operator corresponding to the slave core, and determining a fusion operator to be processed currently and a corresponding calculation logic thereof. The method comprises the steps that a master core fuses and associates operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed, and the first fusion operator, the second fusion operator, the third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core. Specifically, by reasonably fusing part of calculation processes, the full use of slave operators is facilitated through a fusion operator combination mode, compared with the situation that only a single operator is processed at the same time, the combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, directly store intermediate results in the slave core for the next calculation, reduce the memory access data volume, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.

Step S402: and extracting the current data to be processed corresponding to the current fusion operator to be processed from the memory. Specifically, the data in the memory can be divided in row and column dimensions, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of operators is reduced.

Step S403: and calculating the current data to be processed by utilizing a corresponding operator according to calculation logic to obtain a calculation result, and writing the calculation result into a memory. Specifically, the whole calculation process is decomposed into three fusion big operators, the output result of each fusion operator except the last calculation is necessarily written back to the memory, the calculation results of other processes are taken as the intermediate result of the next calculation, the memory can be selected not to be written back, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of the operators is reduced.

Specifically, when the fusion operator to be processed in the step S401 is the first fusion operator, the current data to be processed is the data to be processed corresponding to the first fusion operator in the memory, the current data to be processed includes the first data to be processed and the first data to be transposed, and the step S403 includes:

and a step a1 of performing matrix multiplication operation on the first data to be processed to obtain first intermediate data.

And a step a2, obtaining first transposed data by carrying out transposition processing on the first data to be transposed.

And a step a3, performing matrix multiplication operation on the first transfer data to obtain second intermediate data.

And a step a4, performing matrix multiplication operation based on the first intermediate data and the second intermediate data to obtain a first calculation result, and storing the first calculation result into a memory.

Specifically, since numerous data transposition operations are involved in the calculation process, the acceleration processor supports data transposition reading, and the data transposition reading instruction of the acceleration processor can be used for replacing original data sequence reading and performing combination operation of manual data re-shooting on the slave core, so that the overall calculation efficiency is improved.

Specifically, when the fusion operator to be processed in the step S401 is the second fusion operator, the current data to be processed is the second data to be processed corresponding to the second fusion operator in the memory and the first calculation result of the first fusion operator, and the step S403 includes:

and b1, performing matrix multiplication operation on the second data to be processed to obtain third intermediate data.

And b2, performing non-matrix multiplication operation on the first calculation result to obtain fourth intermediate data.

And b3, performing matrix multiplication operation based on the third intermediate data and the fourth intermediate data to obtain a second calculation result, and storing the second calculation result into a memory.

Specifically, when the fusion operator to be processed in the step S401 is the third fusion operator, the current data to be processed is the third data to be processed corresponding to the third fusion operator in the memory and the second calculation result of the second fusion operator, and the step S403 includes:

And c1, performing matrix multiplication operation on the third data to be processed and the second calculation result to obtain a third calculation result, and storing the third calculation result into a memory.

In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a slave core of a master core of the acceleration processor, such as a SWAI chip, and the acceleration processor includes the master core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and the slave core further includes a first buffer space and a second buffer space. The first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.

Specifically, by setting a double-buffer mechanism, the memory access data and the data calculation are executed in parallel by using two cache spaces, and when the data calculation is performed, a matrix multiplication calculation instruction and a non-matrix multiplication calculation instruction can be submitted at the same time, so that two types of calculation components can be executed in parallel, the data calculation time is shortened, the data memory access time and the data calculation time are hidden from each other, and the overall calculation time is accelerated.

In one embodiment, a SWAI chip is described as an example, where the SWAI chip includes a master core and a slave core array, and the master core and the slave core share a memory space, and each slave core has its own cache space and an acceleration component that specifically processes matrix multiplication and other operations. Based on the hardware structure of the SWAI chip, the computation process of the multi-head attention mechanism can be decomposed into three large fusion operators:

1) Fusing the calculation matrices q=wq, iq, k=wk, ik, and qk= (kTWk, iT) (Wq, iq);

2) Fusing the calculation matrices v=wv, iv, calculate softmax (QK), and calculate hi=batch_gemm (V, softmax (QK));

3) Finally, combining all the obtained hi to complete the calculation of the intent_out=gemm (hi, wo);

the three fusion operators are named as A, B and C, and the complete calculation process of the operator A can be described as follows:

1. the main core calls an A interface and starts the slave core;

2. wk, i and k are directly read into a slave core cache from a main memory through hardware transposition, and are stored as kT, wk and iT in the slave core cache;

3. on the slave core, the computation of k=gemm (kT, wk, iT) is done using a matrix multiplication acceleration component, and Wq, i and q are read in from the master to the slave core cache at the same time;

4. on the slave cores, the matrix multiplication acceleration component is used for completing the computation of q=gemm (Wq, i, Q), since the computation of K is completed and is not read back to the master memory, the matrix multiplication acceleration component can be used again after the computation of Q is completed, the computation of qk=gemm (kTWk, iT, wq, iq) is completed, and during the computation, the computation-completed QK is written back to the master memory from the core cache by the partition;

5. the master core confirms that all slave cores complete calculation and perform synchronization, and complete calculation of the fusion operator A is completed;

The complete calculation process for the B operator can be described as:

1. the master core calls the B interface and starts the slave core;

2. reading Wv, i and v and QK from the main memory into the slave core cache;

3. on the slave cores, the computation of v=gemm (Wv, i, V) is done using a matrix-multiplication acceleration component, while the computation of s_qk=softmax (QK) is done simultaneously using a non-matrix-multiplication acceleration component;

4. on the slave core, performing hi=batch_gemm (V, softmax (QK)) computation using a matrix multiplication acceleration component, while the chunks write the computed hi back to main memory from the core cache;

5. the master core confirms that all slave cores complete calculation and perform synchronization, and complete calculation of the fusion operator B is completed;

the complete calculation process for the C operator can be described as:

1. the master core calls a C interface and starts the slave core;

2. reading Wo, i and hi from the main memory into the slave core cache;

3. on the slave core, the computation of the attribute_out=gemm (Wo, i, hi) is completed using a matrix multiplication acceleration component, while the partition writes the computed attribute_out back to the master from the core cache;

4. the master core confirms that all slave cores complete calculation and perform synchronization, and complete calculation of the fusion operator C is completed;

in the calculation process, the main storage data transposition can be taken to the cache section of the slave core, because for the calculation of K=gemm (kT, wk, iT) in the fusion operator A, the transposition read instruction is directly only required to be called to take the Wk, i, K transposition to the cache section of the slave core, the next calculation can be carried out, the two-pass data are not required to be accessed like the common method, the first-pass access data are subjected to data transposition, the second-pass access data are subjected to data calculation, the data access quantity is increased, and the extra bandwidth resources are occupied.

Preferably, the memory access units based on the SWAI chip have asynchronous characteristics, so that a double buffering technology can be applied in the calculation of each fusion operator for improving the data transmission efficiency and reducing the delay in the data transmission. When double buffering is used, there are two buffers, one for receiving data and the other for processing data. When one buffer is receiving data, the other buffer can process data simultaneously, thereby realizing parallelization processing. For example, the A operator calculation process proceeds by allocating two blocks of buffers on the cache region of the slave core, the first block of buffer being the calculation buffer and the second block of buffer being the memory access buffer. First undergo a cold loading process: a memory access class instruction is initiated, and the first blocks Wk, i and k are taken to a calculation buffer area. Then enter the double buffering process: and initiating a next access class instruction, taking the next blocks Wk, i and k onto the access buffer area, judging whether the data on the calculation buffer area is completely taken, if so, performing matrix multiplication calculation on the data on the buffer area, wherein the calculation process is parallel to the process of taking the data on the next block buffer area, and exchanging the access buffer area and the calculation buffer area after the calculation is completed. In such a way, the memory access process and the calculation process are mutually hidden, and the overall performance is improved.

Each type of computation in the slave core corresponds to an efficient computation element and has asynchronous properties, which can be further optimized for the double buffer mode of the fusion operator B. There are three types of data operations in the fusion operator B: the three types of calculation can be respectively completed by three different efficient calculation components in the core by the SWAI chip, and the parallel calculation is realized by a fusion operator. The specific method comprises the following steps: two blocks of buffers are allocated on the cache region of the slave core, wherein the first block of buffer is a calculation buffer and the second block of buffer is a memory access buffer. First of all, a cold loading process is also carried out: the first blocks Wv, i and v and the first block QK are read into the calculation buffer from the main memory. Then enter the double buffering process: and reading the next Wv, i and v and the next QK from the main memory into the memory buffer, judging whether the data on the calculation buffer is already fetched, if so, performing matrix multiplication calculation on the data on the buffer, wherein the calculation process is parallel to the process of fetching the data on the next buffer, and exchanging the memory buffer and the calculation buffer after the calculation is completed. In such a way, the memory access process and the calculation process are mutually hidden, and the overall performance is improved.

Firstly, submitting a matrix multiplication related instruction to calculate V=gemm (Wv, i, V), and submitting a non-matrix multiplication related instruction to calculate S_QK=softmax (QK) after the instruction is finished, wherein matrix multiplication and non-matrix multiplication calculation are respectively carried out by different hardware components in the whole calculation process, and are calculated in parallel. The whole large calculation process is parallel to the process of taking the data in the next buffer area, and after the calculation is completed, the memory access buffer area and the calculation buffer area are exchanged. Through the mode, the memory access process and the calculation process are mutually hidden, different calculation modes are performed simultaneously, and the overall performance is improved.

The SWAI chip has 32 slave core computing units per core group, arranged in 4 rows and 8 columns. The operator computation tasks need to be distributed to the 32 slave computing units at the time of computation. Taking the fusion operator C as an example, the core calculation of the fusion operator C is that the attribute_out=gemm (Wo, i, hi), the data dimension of hi is [ n heads, seq_len, proj_size ], the data dimension of Wo, i is [ n heads, proj_size, hidden_size ], and the gemm calculation can be converted into one batch_gemm calculation, namely n heads gemm calculation. Each child gemm calculation completes a matrix multiplication of [ seq_len, proj_size ] [ proj_size, hidden_size ]. In a typical model network, nfeads are typically multiples of 4, which can be converted to multiples of 4 by zero padding, if not. The nfeads may be de-sliced onto 4 lines of each core group of the SWAI chip. The advantage of this segmentation is that each line of each core group of the SWAI chip processes the respective nfets, without disturbing each other, in full parallelism. Meanwhile, the seq_len can be further segmented to 8 columns of each core group of the SWAI chip, and for Wo, i data can be taken from a main memory to a slave core high-speed space in a line broadcast memory access mode, so that memory access efficiency is improved. After each row completes the gemm calculation of the respectively allocated nfeads, the calculation results of 4 rows need to be merged once, and the data protocol can be efficiently realized through inter-column slave core communication.

The SWAI chip comprises a master core and a slave core array, which share memory space, and each slave core has own cache space and can process computing tasks in parallel. Each slave core also has acceleration components that specifically handle matrix multiplication or other operations. Different operator combinations can be called for many times in the calculation process of the multi-head attention mechanism, and the calculation process can be divided into two major categories of matrix multiplication operation and non-matrix multiplication operation.

The embodiment also provides a multi-head attention mechanism fusion calculation distribution device based on an acceleration processor, which is used for realizing the embodiment and the preferred implementation mode, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a multi-head attention mechanism fusion computing and distributing device based on an acceleration processor, which is applied to a main core of the acceleration processor, wherein the acceleration processor includes a main core, a memory and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, as shown in fig. 5, and includes:

The first obtaining module 501 is configured to obtain the slave core information, the data to be processed in the memory, and the computing requirement of the data to be processed.

The association module 502 is configured to perform fusion association on the operators of the slave core based on the calculation requirement and the slave core information, to obtain a first fusion operator, a second fusion operator, a third fusion operator, and calculation logic corresponding to each fusion operator.

And the calling module 503 is configured to call interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator in sequence to start the slave core, so that the slave core calculates the data to be processed by using the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained.

In some alternative embodiments, the association module 502 includes:

and the extraction unit is used for extracting the operation type of each operator in each slave core from the slave core information.

And the fusion unit is used for carrying out fusion association on the operators according to the calculation requirements and the operation types to obtain a first fusion operator, a second fusion operator and a third fusion operator.

The present embodiment provides a multi-head attention mechanism fusion calculation allocation device based on an acceleration processor, which is applied to a slave core of the acceleration processor, wherein the acceleration processor includes a master core, a memory and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, as shown in fig. 6, and includes:

the second obtaining module 601 is configured to obtain an interface for starting a fusion operator corresponding to the slave core, and determine a fusion operator to be processed currently and a computation logic corresponding to the fusion operator. The method comprises the steps that a master core fuses and associates operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed, and the first fusion operator, the second fusion operator, the third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core.

And the extracting module 602 is configured to extract, from the memory, current data to be processed corresponding to the current fusion operator to be processed.

The calculating module 603 is configured to calculate, according to the calculating logic, the current data to be processed by using the corresponding operator to obtain a calculation result, and write the calculation result into the memory.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The multi-headed processor-based attention mechanism fusion computing apparatus of this embodiment is presented in the form of functional units, referred to as ASIC (Application Specific Integrated Circuit ) circuits, processors and memory executing one or more software or firmware programs, and/or other devices capable of providing the functionality described above.

Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.

The embodiment of the invention also provides a multi-head attention mechanism fusion calculation distribution system based on the acceleration processor, which comprises the following steps: a master core, a memory and a slave core array formed by a plurality of slave cores, wherein each slave core processes operators of a plurality of operation types;

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, which is applied to a main core of the acceleration processor, wherein the acceleration processor comprises the main core, a memory and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, and the method is characterized by comprising the following steps:

the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps:

carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;

the slave core further comprises a first buffer space and a second buffer space;

2. A multi-headed attention mechanism fusion calculation allocation method based on an acceleration processor, applied to a slave core of the acceleration processor, the acceleration processor comprising a master core, a memory and a slave core array formed by a plurality of slave cores, each slave core processing operators of a plurality of operation types, the method comprising:

acquiring an interface for starting a fusion operator corresponding to a slave core, and determining a current fusion operator to be processed and a corresponding calculation logic; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;

calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory;

3. The method for computing and distributing the multi-head attention mechanism fusion based on the acceleration processor according to claim 2, wherein when the fusion operator to be processed is a first fusion operator, the current data to be processed is data to be processed corresponding to the first fusion operator in a memory, the current data to be processed includes the first data to be processed and the first data to be transposed, the computing the current data to be processed by using the corresponding operator according to the computing logic to obtain a computing result, and writing the computing result into the memory, the method comprises:

4. The method for computing and distributing a multi-head attention mechanism fusion based on an acceleration processor according to claim 3, wherein when the fusion operator to be processed is a second fusion operator, the current data to be processed is a second data to be processed corresponding to the second fusion operator in a memory and a first computation result of the first fusion operator, the computing the current data to be processed by the corresponding operator according to the computation logic to obtain a computation result, and writing the computation result into the memory, the method comprises:

5. The method for computing and distributing the multi-head attention mechanism fusion based on the acceleration processor according to claim 4, wherein when the fusion operator to be processed is a third fusion operator, the current data to be processed is a third data to be processed corresponding to the third fusion operator in a memory and a second computation result of the second fusion operator, the computing the current data to be processed by the corresponding operator according to the computation logic to obtain a computation result, and writing the computation result into the memory, wherein the method comprises:

6. A multi-headed attention mechanism fusion computing and allocation device based on an acceleration processor, applied to a main core of the acceleration processor, the acceleration processor comprising a main core, a memory and a slave core array formed by a plurality of slave cores, each slave core processing operators of a plurality of operation types, the device comprising:

the association module is used for carrying out fusion association on operators of the slave cores based on the calculation requirement and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;

The calling module is used for sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence to obtain a calculation result; the slave core further comprises a first buffer space and a second buffer space; the first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.

7. A multi-headed attention mechanism fusion computing apparatus based on an acceleration processor, for use with a slave core of the acceleration processor, the acceleration processor comprising a master core, a memory and a slave core array of a plurality of slave cores, each slave core processing operators of a plurality of operation types, comprising:

the second acquisition module is used for acquiring an interface of a fusion operator corresponding to the starting slave core and determining a current fusion operator to be processed and a corresponding calculation logic thereof; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;

the calculation module is used for calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory; the slave core further comprises a first buffer space and a second buffer space; the first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.

8. A multi-headed attention mechanism fusion computing dispensing system based on an acceleration processor, comprising: a master core, a memory and a slave core array formed by a plurality of slave cores, wherein each slave core processes operators of a plurality of operation types;

based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;

Sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start a slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to calculation logic of each fusion operator sequentially to obtain a calculation result; the slave core further comprises a first buffer space and a second buffer space; the first buffer space and the second buffer space are used for alternately carrying out a process of acquiring data from the memory and a data processing process;