CN113901747A

CN113901747A - Hardware accelerator capable of configuring sparse attention mechanism

Info

Publication number: CN113901747A
Application number: CN202111197446.4A
Authority: CN
Inventors: 梁云; 卢丽强; 罗梓璋; 金奕成
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-01-07

Abstract

The invention discloses a hardware accelerator capable of configuring a sparse attention mechanism, which comprises: the device comprises a sampling dense matrix multiplication module, a mask block packaging module and a configurable sparse matrix multiplication module; the sampling dense matrix multiplication module adopts a hardware structure of a pulse array; the mask block packaging module comprises a column number counter, a line activation unit counter and a buffer area; the configurable sparse matrix multiplication operation module comprises a configurable operation unit PE, a register array and a divider, wherein the configurable operation unit is separated from the register array. The invention efficiently and dynamically determines the sparse mode of the fractional matrix according to the characteristics of the input matrix, can still keep higher throughput under higher sparsity, and can efficiently and dynamically accelerate the operation of a sparse attention mechanism.

Description

Hardware accelerator capable of configuring sparse attention mechanism

Technical Field

The invention relates to an artificial intelligence application hardware accelerator, in particular to a hardware accelerator capable of configuring a sparse attention mechanism, which is a configurable multi-stage systolic array hardware accelerator for the sparse attention mechanism.

Background

An artificial neural network based on an attention mechanism occupies an important position in recent machine learning. Document [1 ]](A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,

Kaiser, and I.Polosukhin, "Attention is all you connected," in Proceedings of the 31st International Conference on Neural Information Processing Systems,2017, pp.6000-6010.) describe the Attention mechanism. Among them, the Transformer (Transformer) structure has an attention mechanism as its basic component, and has excellent performance in various artificial intelligence tasks, such as language model in natural language processing field, machine translation, text classification and text generation, and image header, image generation and image slicing in computer vision field.

Note that the mechanism takes as input three matrices, called query matrix Q, key matrix K and value matrix V, respectively. Firstly, a fractional matrix S is obtained through matrix multiplication of a Q matrix and a K matrix, then normalization is carried out through Softmax operation according to row soft maximum values, and finally matrix multiplication is carried out on the fractional matrix S and a V matrix to obtain an output matrix. Since the attention mechanism encodes information in K and V matrices, implementing the attention mechanism requires a large amount of computational resources and grows as the square of the length of the input sequence. For example, in the BERT model described in document [2] (J.Devrlin, M. -W.Chang, K.Lee, and K.Toutanova, "Bert: Pre-training of deep bidirectional transformations for language understating," in NAACL-HLT,2019.), the number of input words can be as high as 1.6 ten thousand, and the calculation requirement thereof will be as high as 861.9 gigafloating point operations (GFLOPs).

The prior art starts with the sparsity of the fractional matrix S, and can reduce a large amount of computing resources required by the attention mechanism. The sparsity of the fractional matrix S has a certain pattern, which is dynamically determined by software. While the calculation of the sparse attention mechanism still requires the participation of hardware. There are several sparse attention mechanisms dedicated to hardware accelerators. Document [3](B.Li,S.Pandey,HThe FTRANS described in Fang, Y.Lyv, J.Li, J.Chen, M.Xie, L.Wan, H.Liu, and C.Ding, "Ftrans: energy-efficiency authentication of transformations using FPGA," in Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design,2020) is a Field Programmable Gate Array (FPGA) accelerator that uses fast Fourier transform to accelerate attention-deficit. But this method limits the weight matrix to be only a block circulant matrix, so that it loses generality for other applications. Document [4 ]](H.Wang, Z.Zhang, and S.Han, "Spotten: effective spot orientation with cassette token and head pruning," in Proceedings of the International Symposium on High Performance Computer Architecture,2021.) employing surrogate pruning techniques, some rows and columns of the attention matrix are extracted. But the calculation engine is designed for parallel matrix multiplication operation, so that the method is poor in performance on a structureless dynamic sparse matrix. Document [5 ]](T.J.ham, S.J.Jung, S.Kim, Y.H.Oh, Y.park, Y.Song, J.H.park, S.Lee, K.park, J.W.Lee et al, "A ^3: additive attachment mechanisms in neural networks with adaptation," in2020IEEE International Symposium on High Performance Computer Architecture (HPCA) IEEE,2020.)³Focus on approximating the attention mechanism with reduced computational effort. However, only a plurality of dot product calculators are designed in the hardware structure level of the work, and the hardware acceleration capability is not good.

Disclosure of Invention

The invention aims to provide a hardware accelerator capable of configuring a sparse attention mechanism, which can efficiently and dynamically accelerate the operation of the sparse attention mechanism by adopting a novel hardware accelerator structure provided by the invention, and can be applied to various artificial intelligence tasks, such as language models, machine translation, text classification and text generation in the natural language processing field, image titles, image generation and image segmentation and the like in the computer vision field.

The method comprises the steps of firstly obtaining a fractional matrix sample mask with a low bit number through a sampling dense matrix multiplication (SDDMM) module, then blocking and packaging the sample mask, and inputting the sample mask into a configurable sparse matrix multiplication (SpMM) module. The configurable sparse matrix multiplication module is a fractional matrix fixed data stream operator. The method is characterized in that a data moving structure is separated from an arithmetic unit, a sample mask module can be used as configuration, the position of the arithmetic unit in an array is dynamically determined, and two-step sparse matrix multiplication operation and Softmax operation are completed.

The technical scheme provided by the invention is as follows:

a hardware accelerator configurable sparse attention mechanism, comprising: the device comprises a sampling dense matrix multiplication module, a mask block packaging module and a configurable sparse matrix multiplication module; wherein the content of the first and second substances,

the sampling dense matrix multiplication operation module is used for dynamically performing low-bit matrix multiplication by adopting a hardware structure of a pulse array and outputting a fractional matrix sample mask;

the mask block packaging module comprises a column number counter, a plurality of row activation unit counters and a buffer area; the fractional matrix sample mask is used as input, and the fractional matrix sample mask is received column by column and records column numbers; setting a threshold value, collecting column numbers of comparison results exceeding the threshold value, and packing and storing the column numbers in a buffer area (cache) according to lines; and when the collected number in a certain row reaches the number of the operators in one row of the configurable sparse matrix multiplication module, integrally outputting the cached result to the configurable sparse matrix multiplication module as configuration information.

The configurable sparse matrix multiplication operation module comprises a configurable operation unit PE, a register array and a divider, and is used for realizing a data stream operation mode with a fixed fractional matrix. The configurable operation unit is separated from the register array. In each configurable operation unit, registers in the same row of the register array can be dynamically connected through a multi-selector in the configurable unit, different operation modes are realized through a multi-stage data path, and the output correctness is kept through a void controller.

The above-mentioned hardware accelerator with a configurable sparse attention mechanism is implemented by using a module for dynamically generating a fractional matrix sample mask and a block packing mechanism, instead of a complex module (a sorting module) based on sorting in the prior art; meanwhile, a hardware accelerator with better performance than the existing sparse matrix operation technology is obtained by adopting a data stream with fixed fractional matrix based on a systolic array and a configurable operation unit. The method comprises the following steps:

A. and establishing a sampling dense matrix multiplication module for performing sampling dense matrix multiplication. Dynamically performing low-bit number matrix multiplication by using a hardware structure of a systolic array, and outputting a fractional matrix sample mask; the method comprises the following steps:

A1. taking the query matrix Q and the key matrix K as the input of the sampling dense matrix multiplication module, and obtaining a plurality of highest bits of the Q and K matrixes to obtain a Q matrix and a K matrix represented by low bit numbers;

A2. performing matrix multiplication operation on the Q matrix and the K matrix represented by the low bit number obtained in the step A1 by using a systolic array structure;

A3. performing Softmax operation on the result of the matrix multiplication operation;

A4. and comparing the result obtained by the Softmax operation with a threshold value to obtain a comparison result matrix, and outputting the comparison result matrix as a fractional matrix sample mask. The specific choice of threshold is related to sparsity, typically between 0.002 and 0.08.

B. And establishing a mask block packaging module for carrying out mask block packaging. The method comprises the following steps:

B1. and carrying out mask blocking and packaging on the fractional matrix sample mask output by the sampling dense matrix multiplication operation module, receiving column by column and recording column numbers.

B2. Collecting the column numbers of the comparison result matrix exceeding the threshold value, and packing and storing the column numbers in a cache according to rows;

B3. and when the number of collected operators in a certain row reaches the number of operators in a row of the configurable sparse matrix multiplication module, integrally outputting the cached result to the configurable sparse matrix multiplication module as configuration information.

C. And establishing a configurable sparse matrix multiplication operation module, and separating an operation unit from a data register array on the basis of a pulse array by using a data stream structure with a fixed fractional matrix. Each arithmetic unit has the ability to dynamically configure connections to data registers in the same row to correspond to the fractional matrix sample mask. The arithmetic unit supports the following four arithmetic stages to realize the calculation of the lean attention mechanism:

C1. and in the sparse fractional matrix calculation stage, the query matrix Q, the key matrix K and mask configuration information are input, the operation units in one row are configured to the positions where the fractional matrices need to be calculated, the sparse fractional matrix S is calculated in a mode of outputting a fixed data stream, and the sparse fractional matrix S is reserved in the operation units.

C2. And in the natural logarithmic power calculation stage, the sparse fractional matrix S is subjected to element-by-element calculation of the natural logarithmic power and is stored in an operation unit in situ.

C3. And (5) a sparse output matrix calculation stage. And inputting the value matrix V, and calculating a matrix multiplication result of the natural logarithmic power of the sparse fractional matrix S and the value matrix V in a manner of a data stream with fixed weight. The result of the calculation is an output matrix which is gradually moved to the edge of the arithmetic unit array by rows in the operation process, and is buffered and accumulated at the edge of the array.

C4. And a division stage, namely dividing the output matrix in the cache by the sum of one row element by element to obtain an output matrix, namely a final result of the attention mechanism.

Through the steps, the hardware accelerator with the configurable sparse attention mechanism can be realized.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a hardware accelerator capable of configuring a sparse attention mechanism, which can efficiently and dynamically accelerate the operation of the sparse attention mechanism. The technical advantages of the invention include:

A. the invention can efficiently and dynamically determine the sparse mode of the fractional matrix according to the characteristics of the input matrix, saves the calculated amount by at least 5.9 times, and simultaneously keeps the accuracy of the output result.

B. The invention can efficiently accelerate the operation of the sparse attention mechanismWith high throughput 529GOP/S and low chip area 16.9mm²And low power consumption of 2.76W. At higher sparsity, higher throughput can still be maintained.

Drawings

FIG. 1 is a schematic diagram of a computation mode of a data stream with a fixed fractional matrix in the invention when it is dense;

in the figure, Q (0, K) denotes the 0 th row of the query matrix, Q (1, K) denotes its 1st row, again K (0, K) denotes the 0 th row of the key matrix, V (K,0) denotes the 0 th column of the value matrix, O (0, j) denotes the 0 th row of the output matrix, and so on; PE represents an arithmetic unit, S (i, j) represents an element of the ith row and the jth column of the fractional matrix; exp represents a natural logarithmic power; div denotes a divider.

FIG. 2 is a schematic diagram of a calculation mode of a data stream with a fixed fractional matrix in the invention when the data stream is sparse;

in the figure, the black squares in the sparse mode indicate elements that need to be operated on, and the white squares indicate elements that do not need to be operated on. The rest are labeled as in FIG. 1.

FIG. 3 is a schematic diagram of a configurable sparse matrix multiplier;

in the figure, first represents a register for storing a matrix Q; representing registers for storing the matrix K and the matrix V; indicating a configurable operation unit; and fourthly, a divider.

FIG. 4 is a schematic diagram of a configurable computing unit;

in the figure, # means multiple selector; sixthly, representing a fractional matrix S register; seventhly, a result matrix Z register is represented; the natural logarithm power register and the natural logarithm power operation module of the fractional matrix S are represented; and ninthly, a cavitation controller.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings. The invention provides a hardware accelerator capable of configuring a sparse attention mechanism, which comprises: the device comprises a sampling dense matrix multiplication operation module, a mask block packing module and a configurable sparse matrix multiplication operation module. In the tasks of natural language processing and computer vision, when an artificial neural network containing a converter structure is used for reasoning by using a processing attention mechanism, three input matrixes Q, K and V can be sent to the hardware accelerator provided by the invention, and an output matrix of the hardware accelerator is received, so that the effect of improving the operation speed is achieved.

In natural language processing, Q and K matrixes are coding matrixes of words in the text, and V is a coding matrix of translated words corresponding to K; in the computer vision task, the Q and K matrices are image coding matrices, and V is a converted image coding matrix corresponding to K. In a complete artificial neural network, the Q, K and V matrices of the hidden layer's transducers are typically only intermediate data.

In the invention, the sampling dense matrix multiplication operation module is a common matrix multiplication accelerator module and is realized by using a pulse array instead of adopting a complex sequencing module in the prior art to generate a dynamic sparse mode. For the input data, the top bits of the fixed point decimal representation are intercepted, usually 4 bits are taken, and a matrix represented by low bit number is obtained. And inputting the matrix represented by the low bit number into the pulse matrix to obtain a matrix multiplication result. And shifting the matrix multiplication result out of the systolic array and entering a natural logarithm power operator. After the results are cached, the product of the sum of each element and each line and the threshold is compared one by one, the output exceeding the threshold is 1, otherwise, the output is 0, and a mask matrix with one bit length is obtained. The specific choice of threshold is related to sparsity, typically between 0.002 and 0.08.

The mask block packaging module comprises a column number counter, a plurality of row activation unit counters and a buffer area. The mask packing module inputs the mask matrix obtained by the sampling dense matrix multiplication module column by column and accumulates a column number counter. When an element with a value of 1 is input into a certain row, the current column number is buffered in the buffer area of the row, and the activated unit counter of the row is accumulated. And when the counter of the active unit in a certain row reaches the number of the operation units in one row of the sparse matrix multiplication module, the whole buffer area is output to the sparse matrix multiplication module as the configuration data of the buffer area.

The configurable sparse matrix multiplication operation module comprises a configurable operation unit PE, a register array, a divider and a void controller. On the basis of the existing systolic array, the arithmetic unit is separated from the register array. In each operation unit, the registers in the same row can be dynamically connected through a multi-selector, different operation modes are realized through a multi-stage data path, and the output correctness is kept through a bubble controller.

The configurable sparse matrix multiplication module realizes a data stream operation mode with a fixed fractional matrix. As shown in fig. 1, in the case of a dense matrix, 4-step row operation is required. In the first step, each row of the matrix Q is respectively input into each row of the operation array, and each row is input one period later than the last row; each row of the matrix K is input into each column of the operation array, each column being one cycle later than the previous column. Each arithmetic element PE is responsible for calculating elements at the same position of the fractional matrix S, and each period multiplies the input data and accumulates the multiplied data with locally stored data, and transmits the input data to an adjacent arithmetic element, the data of the matrix S is transmitted to the right, and the data of the matrix K is transmitted downward. And after the fractional matrix S is calculated, the second step is carried out, and the natural logarithmic power of the matrix S is calculated inside each operation unit and is locally stored. In a third step, the matrix V is input to the operational array in the manner of the matrix K in the first step. Each arithmetic unit keeps the matrix S unchanged, multiplies the input data of the matrix V by the matrix S, adds the multiplied data with the data transmitted by the left arithmetic unit, transmits the result to the right arithmetic unit, and transmits the data of the matrix V to the lower arithmetic unit. The leftmost arithmetic unit accepts 0 as input, and the rightmost arithmetic unit stores the output result in the buffer. And after the calculation is finished, the fourth step is carried out, the data of the matrix S are transmitted to the right and accumulated to obtain the sum, and the output matrix Z in the cache is divided by the sum of each row element by element to obtain the output matrix O.

In the case of sparse matrix multiplication, as shown in fig. 2, the sparse pattern is shown on the left, where black squares represent positions in the matrix S that need to be calculated and white squares represent positions that do not need to be calculated, which can be directly noted as 0. The right side is a schematic diagram of an operation array, the data flow mode is the same as the first step and the third step in fig. 1, but an operation unit is configured at the position needing to be calculated only in the sparse mode, and the rest positions only use a register to keep data flow.

The overall architecture of the configurable sparse matrix multiplier operator is shown in fig. 3. In the figure (r) is shown a register for storing a matrix Q, which can be shifted in a row in a pulsating manner. ② denotes registers for storing matrix K and matrix V, which can move data in a row in a pulsating manner. And thirdly, representing configurable operation units PE, wherein each operation unit is connected with all the first and second operation units in the same row, so that the configurable operation units PE can access any data in the same row and equivalently can be configured at any position in the same row. And connecting with the right side of the same row as a data output channel. The rightmost side is connected with a divider. The number of the configurable operation units (c) in each line is related to the target sparsity, for example, when the sparsity does not exceed 50%, the number of the configurable operation units (c) in each line is half of the number of the (r).

The structure and connection of the configurable arithmetic unit are shown in fig. 4, in which the left and right large boxes respectively represent a configuration arithmetic unit. Indicating a multi-selector, indicating a score matrix S register, indicating a result matrix Z register, indicating a natural logarithm power register and a natural logarithm power operation module of the score matrix S, and indicating a bubble controller. The control signals are C1 to C5, wherein C1 and C2 are position configuration signals, and in the first step, C1 and C2 are the positions of the arithmetic units in a row respectively; in the third step, C1 is to read data from the group VIII, and C2 is the same as the first step. C3 is a phase input control signal for reading data from C in the first step and C in the second step from the left side or the leftmost 0 signal. C4 is stage output control signal, and in the first step, the output is stored in the sixth step and in the third step, the output is stored in the seventh step. C5 is a bubble control signal, which is a shift register FIFO with controllable length, and the length of the shift register FIFO is configured to the distance between the current arithmetic unit and the arithmetic unit on the right side in sparse mode, so as to maintain the correctness of data result accumulation. And is connected with the right arithmetic unit to sum up in the fourth step.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A hardware accelerator configurable sparse attention mechanism, comprising: the device comprises a sampling dense matrix multiplication module, a mask block packaging module and a configurable sparse matrix multiplication module; wherein:

the mask block packaging module comprises a column number counter, a plurality of row activation unit counters and a buffer area; the device comprises a sampling dense matrix multiplication operation module, a buffer area and a sampling module, wherein the sampling module is used for receiving and recording column numbers by columns by taking a fractional matrix sample mask output by the sampling dense matrix multiplication operation module as input, collecting the column numbers of comparison results exceeding a threshold value, and packing and storing the column numbers in the buffer area according to rows; when the collected number in a certain row reaches the number of operators in one row of the configurable sparse matrix multiplication module, integrally outputting the cached result to the configurable sparse matrix multiplication module as configuration information;

the configurable sparse matrix multiplication module comprises a configurable operation unit PE, a register array and a divider, and is used for realizing a data stream operation mode with a fixed fractional matrix; the configurable operation unit is separated from the register array; the registers in the same row of the register array realize different operation modes through a multi-stage data path, and the output correctness is kept through a void controller.

2. The hardware accelerator of claim 1, wherein registers in a same row of the register array are dynamically connected by a multi-selector in each configurable unit.

3. The hardware accelerator of claim 1, wherein the hardware accelerator of the configurable sparse attention mechanism is applicable to artificial intelligence tasks including natural language processing, machine translation, text classification and generation, and image processing.

4. The configurable sparse attention mechanism hardware accelerator of claim 1, wherein the configurable sparse attention mechanism hardware accelerator is obtained by using a module for dynamically generating fractional matrix sample masks and a block packing mechanism, while using a fixed data stream based on a fractional matrix of a systolic array, and a configurable arithmetic unit; the method comprises the following steps:

A. establishing a sampling dense matrix multiplication module, and performing sampling dense matrix multiplication:

taking the query matrix Q and the key matrix K as the input of a sampling dense matrix multiplication operation module, dynamically performing low-bit matrix multiplication by using a hardware structure of a pulse array, and outputting a fractional matrix sample mask;

B. establishing a mask block packaging module for carrying out mask block packaging; the method comprises the following steps:

B1. carrying out mask block packaging on the fractional matrix sample mask output by the sampling dense matrix multiplication operation module, receiving column by column and recording column numbers;

B2. setting a threshold value, collecting column numbers of a comparison result matrix exceeding the threshold value, and packing and storing the column numbers in a buffer area according to lines;

B3. when the collected number in a certain row reaches the number of operators in one row of the configurable sparse matrix multiplication module, integrally outputting the cached result to the configurable sparse matrix multiplication module as configuration information;

C. establishing a configurable sparse matrix multiplication operation module, using a data flow structure with a fixed fractional matrix in a pulse array, and separating an operation unit from a data register array; each operation unit is dynamically connected with the data register in the same row and corresponds to the fractional matrix sample mask; the arithmetic unit realizes the calculation of the lean attention mechanism and comprises the following four arithmetic stages:

C1. sparse fractional matrix calculation;

inputting a query matrix Q, a key matrix K and mask configuration information, configuring the operation units in one row to the positions where the fractional matrix needs to be calculated, calculating a sparse fractional matrix S in a mode of outputting a fixed data stream, and keeping the sparse fractional matrix S in the operation units;

C2. a natural logarithmic power calculation stage;

calculating a natural logarithmic power of the sparse fraction matrix S element by element, and storing the natural logarithmic power in an arithmetic unit in situ;

C3. sparse output matrix calculation stage;

inputting a value matrix V, and calculating a matrix multiplication result of a natural logarithmic power of a sparse fractional matrix S and the value matrix V in a data stream mode with fixed weight;

C4. a division stage;

dividing the output matrix in the cache by the sum of one row element by element to obtain an output matrix, namely a final result of the attention mechanism;

5. The hardware accelerator of claim 4 wherein step a comprises the following steps:

A4. and comparing the result obtained by the Softmax operation with a threshold value to obtain a comparison result matrix, and outputting the comparison result matrix as a fractional matrix sample mask.

6. A hardware accelerator with configurable sparse attention mechanism as in claim 5 wherein said threshold is specifically chosen in relation to sparsity, typically between 0.002 and 0.08.

7. The hardware accelerator of claim 4 wherein in step C3, the result of said matrix multiplication is an output matrix, which is gradually shifted by rows towards the edge of the arithmetic unit array during operation, buffered and accumulated at the array edge.

8. The configurable sparse attention mechanism hardware accelerator of claim 4, wherein in natural language processing and computer vision task processing, input matrices are sent to the configurable sparse attention mechanism hardware accelerator, the input matrices comprising a query matrix Q, a key matrix K, and a value matrix V; an output matrix is obtained by a hardware accelerator.

9. The hardware accelerator of claim 4, wherein in natural language processing, the matrices Q and K are both coding matrices for words in text, and V is a coding matrix for translated words corresponding to K.

10. The hardware accelerator of claim 4 wherein, in the computer vision task, the Q and K matrices are image coding matrices and V is a transformed image coding matrix corresponding to K.