CN114819127A

CN114819127A - Backpressure index type combined computing unit based on FPGA

Info

Publication number: CN114819127A
Application number: CN202210482666.XA
Authority: CN
Inventors: 黄以华; 许圣钧
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-29
Anticipated expiration: 2042-05-05
Also published as: CN114819127B

Abstract

The invention relates to a backpressure index type combination calculation unit based on an FPGA (field programmable gate array). aiming at a GNN (GNN) combination calculation stage, a plurality of sparse vertex characteristic vectors are merged and encoded, so that a weight data slice sent into the calculation unit is indexed, and then the indexed number is calculated and accumulated in an intermediate result register so as to complete the GNN combination stage calculation. The method fully utilizes the sparsity of the characteristic vectors of the nodes, greatly shortens the time required by the first layer of combined calculation under the limitation of limited FPGA on-chip calculation resources, reduces the blockage of a production line, and does not need complex control logic.

Description

Backpressure index type combined computing unit based on FPGA

Technical Field

The invention relates to the technical field of machine learning, in particular to a backpressure index type combined computing unit based on an FPGA (field programmable gate array).

Background

Graph Neural Networks (GNNs) have incomparable advantages over Deep Convolutional Neural Networks (DCNNs) in processing non-euclidean data and are therefore widely used in the fields of node classification, natural language processing, recommendation systems, graph clustering, link prediction. Unlike the DCNN algorithm which already has a more sophisticated landing scheme, there are still more problems with landing of the GNN algorithm. Compared with the DCNN algorithm, the GNN has higher calculation, storage and bandwidth overhead, and introduces irregular calculation and memory access modes. Therefore, acceleration means facing DCNN, such as nested loop optimization technology, quantization strategy, Winograd algorithm and the like, are difficult to directly migrate into the GNN accelerator design.

As the most widespread data format, graph data and neural networksThe combination of networks inevitably induces a large amount of special hardware architecture design, for example, DCNN application represented by images catalyzes the development of GPU. Therefore, the GNN accelerator, which is a special system architecture design facing the field of graph data processing, has remarkable practical engineering value and academic research value. Compared to DCNN, GNNs are not much efficient to compute. Although the matrix sizes involved in the GNN calculation are large, the percentage of non-zero elements in the adjacency matrix required for the calculation is often only 10 ^-3 To 10 ^-2 The magnitude order is about, and the distribution is irregular, so that the problems of low utilization rate of computing resources, uneven load and the like are limited in the GNN computing speed. In the process of designing the GNN accelerator, the throughput of the accelerator cannot be singly concerned, and what is more important is how to efficiently schedule data, so that the utilization rate of computing resources is improved.

Most of the existing GNN accelerators are specially optimized for the aggregation stage of the GNN with irregular access and storage, and a series of existing methods such as a systolic array in DCNN are directly used for the computation-intensive combination stage. However, the vector dimension of the combination stage is high, the data sparsity is also high, a large amount of computing resources are wasted by directly using the DCNN computing method, and the computing time is also long. Therefore, by utilizing the data characteristics of the calculation in the good combination stage, the calculation resources and the calculation time can be effectively saved, the pipeline blockage can be reduced, the utilization rate of the calculation resources can be improved, and the performance of the accelerator can be improved.

Disclosed in the prior art is a Neural Network Unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters of each R rows by S columns by C channels to generate F outputs of each Q rows by P columns, the neural network unit comprising: a first memory configured to hold a row of N words logically divided into G input blocks, wherein each input block is B words; a second memory configured to hold a row of N words logically divided into G filter blocks, wherein each filter block is B words; wherein B is a minimum factor of N greater than W, and wherein N is at least 512; an array of N Processing Units (PUs), wherein each PU of the array has: an accumulator; a register configured to receive a respective word of the N words from the line of the second memory; a multiplexing register configured to selectively receive respective ones of the N words from a row of the first memory or a word rotated from the multiplexing register of a logically adjacent PU; and an arithmetic logic unit coupled to the accumulator, register, and multiplexing register, wherein the N PUs are logically divided into G PU blocks, each PU block being B PUs; wherein the input blocks are saved in H rows of the first memory, wherein each of the H rows of the first memory saves a respective 2-dimensional slice of a corresponding row of the input H rows, wherein the respective 2-dimensional slices are saved in at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks saves a row of words of a 2-dimensional slice specified by a respective channel of the C channels; wherein the filter blocks are stored in R S C rows of the second memory, wherein each of the F filter blocks of the G filter blocks of each of the R S C rows of the second memory stores P copies of the weights of the corresponding filter in the corresponding row and the corresponding column of the corresponding filter and the F filters of the corresponding channel; and wherein to convolve the input with the filter, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read one of the H rows of the at least C input blocks from the first memory and rotate the rows around the N PUs during a portion of the multiply-accumulate operations such that each of the F of the G PU blocks receives each of the at least C input blocks of the rows before reading another of the H rows from the first memory. The scheme still causes a great deal of computing resource waste when applied to the GNN accelerator.

Disclosure of Invention

The invention provides a backpressure index type combined computing unit based on an FPGA (field programmable gate array), which is used for improving the computing efficiency of a GNN (GNN) combined stage.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an FPGA-based backpressure-indexed combinatorial computation unit for computing combinatorial phases of a graph neural network into which a weight matrix slice is input per clock cycle, the combinatorial computation unit comprising an index buffer, a range buffer, a control unit, a weight tiling buffer, a multiplier, an accumulator, and m intermediate result registers, wherein:

the index buffer is used for storing data after non-zero element encoding in the node characteristic matrix;

the range buffer is used for determining the number range of data in the weight matrix slice input in the current clock cycle;

the control unit judges whether the index number at the top of the index buffer is within the number range of the range buffer, if so, indexes out corresponding weight data in the weight matrix slice, feeds the weight data into a multiplier together with the specific value of a corresponding non-zero element in the index buffer, and accumulates the result of the multiplier into a corresponding middle result register by an accumulator according to the node number of the non-zero element;

the weight tiled buffer is used for storing weight matrix slices which cannot be processed in the current clock period, after the control unit finishes the current task, the control unit judges whether the weight matrix slices in the weight tiled buffer have data to be indexed or not, if not, the weight matrix slices are discarded, if yes, the indexes are sent into the multiplier and the accumulator to be searched for calculation, and until the weight tiled buffer is empty;

and after all the weight matrix slices are sent and processed, the data in the m intermediate result registers are the final result of the combined calculation.

Preferably, the node feature matrix is composed of an adjacency matrix and a node feature vector, wherein the adjacency matrix is used for representing the connection relation between the nodes, and the node feature vector is used for representing the feature of each node.

Preferably, the node feature matrix needs to be sliced, where the slice of the node feature matrix vector only needs to correspond to the slice of the adjacent matrix, specifically:

and taking out the node feature vector of the node related to a certain slice of the adjacency matrix as a node feature vector slice.

Preferably, when the index buffer stores data, the node feature vectors with the same number in the slices of different node feature vectors are merged, and are arranged according to the ascending order of the in-line element coordinates of the encoded non-zero elements, after merging, the same number in the slices of different node feature vectors is specifically:

after the slice size of the adjacency matrix is determined, the number of nodes of each slice of the node feature vectors is the same, N node feature vectors are assumed, nodes in each slice are numbered from 0 to N-1, and the node feature vectors with the same number in different slices are called the node feature vectors with the same number in different slices.

Preferably, the non-zero element coding in the node feature matrix specifically includes:

each non-zero element in the node feature matrix is represented by a triple, and the triple comprises three data of a row, a column and a specific value, wherein the row represents a node number, and the column represents the coordinate of the non-zero element in the row.

Preferably, the range buffer numbering range is incremented every clock cycle by an amount equal to the number of weight data in the slice of the weight matrix.

Preferably, the control unit determines whether the top index number of the index buffer is within the number range of the range buffer, specifically:

and if the index value sent by the index buffer is smaller than the current value in the range buffer, determining that the current weight matrix slice has data meeting the index value.

An FPGA-based backpressure indexed combinatorial computing system, the system comprising a preprocessing module, a combinatorial computing unit array and a plurality of weight memories, wherein:

the preprocessing module encodes and stores non-zero elements in the node feature matrix;

the combination calculation unit array comprises M multiplied by N back pressure index type combination calculation units based on the FPGA, data formed by non-zero elements in each M node feature vectors are input into one combination calculation unit, and each combination calculation unit in the same column is responsible for combination calculation of the M node feature vectors;

the weight memories store slices of the weight matrix, the slices of the weight matrix are sent to each combined computing unit in the same column in the combined computing unit array in a broadcast mode, each clock unit sends one weight matrix slice, the number of the weight memories is equal to the number of columns of the weight matrix, and each weight memory stores data of one column of the weight matrix.

Preferably, the preprocessing module further obtains sparsity of the node feature matrix by performing static statistics on the node feature matrix, and determines the slice fusion number m of the node feature vector and the slice size of the weight matrix according to the sparsity of the node feature matrix.

Preferably, when the weight tiled buffer is Full of data, the Full signal of the weight tiled buffer is pulled high, the combination calculating unit sends a backpressure signal to the weight memory, transmission of weight matrix slice data and operation of subsequent modules of the system are suspended, and a special condition of the current combination calculating unit is waited to be solved, namely the calculation task of the last weight matrix slice is completely completed or the Full signal of the weight tiled buffer is pulled low.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

1) the invention fully utilizes the sparsity of the characteristic vectors of the nodes in the first layer of combination calculation of the GNN network, completes the combination calculation in a fixed period by utilizing a backpressure index method, saves a large amount of time for the calculation of high-dimensional vectors, and simultaneously reduces the hardware overhead of each calculation unit, so that the calculation array can have higher parallelism.

2) The invention combines the non-zero elements of the m node characteristic vectors, so that the time division multiplexing computing unit improves m times of parallelism compared with the condition of no combination, provides more available vertexes for the subsequent polymerization stage, can effectively improve the utilization rate of the polymerization computing unit under proper parameters, and reduces the blockage of a production line.

3) The invention makes the combined computation time depend on the slice size of the weight data, makes the computation time of all computation units tend to be the same, provides regular data flow for the subsequent aggregation stage, and has simple control logic. Meanwhile, the size of the slice is adjusted, so that the overhead of hardware resources and the time of combination calculation can be freely adjusted.

Drawings

FIG. 1 is a schematic diagram of a backpressure index type combinational computing unit according to the present invention.

FIG. 2 is a schematic diagram of a data flow of a backpressure-indexed combinatorial computing unit.

FIG. 3 is a block diagram of a backpressure-indexed combinatorial computing system.

FIG. 4 is a diagram illustrating utilization of computing units and performance enhancement of accelerators under different parameters in a computing system using backpressure-indexed combination.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The present embodiment provides an FPGA-based backpressure-indexed combined computing Unit (CPE), as shown in fig. 1, where the combined computing Unit is configured to compute a combined phase of a graph neural network, and each clock cycle inputs a Weight matrix slice into the combined computing Unit, and the combined computing Unit includes an Index Buffer (Index Buffer), a Range Buffer (Range Buffer), a Control Unit (Control Unit), a Weight Tile Buffer (Weight Tile Buffer), a multiplier, an accumulator, and m intermediate result registers (Reg in the figure represents a set of m intermediate result registers), where:

the control unit judges whether the top index number of the index buffer is in the number range of the range buffer, if so, indexes out corresponding weight data in the weight matrix slice, feeds the weight data into a multiplier together with a specific value of a corresponding non-zero element in the index buffer, and accumulates the result of the multiplier into a corresponding middle result register by an accumulator according to the node number of the non-zero element;

Example 2

This example continues to disclose the following on the basis of example 1:

the node feature matrix is composed of an adjacency matrix and a node feature vector, wherein the adjacency matrix is used for representing the connection relation between nodes, and the node feature vector is used for representing the feature of each node.

The node feature matrix needs to be sliced, wherein the slicing of the node feature matrix vector only needs to correspond to the slicing of the adjacent matrix, specifically:

In GNN, an adjacency matrix is used for representing the connection relation between nodes, a node feature vector is used for representing the feature of each node, and the core of a GCN inference process is the update of the node feature vector. However, because both storage and computation resources on the FPGA chip are limited, the node feature matrix formed by the adjacent matrix and the node feature vector needs to be sliced and then deployed on the chip. As shown in fig. 2, when the index buffer stores data, node feature vectors with the same number in slices of different node feature vectors are merged, if the merging number is m, the m node feature vectors are merged and then arranged according to the order of the in-line element coordinates of the encoded non-zero elements from small to large, data composed of the non-zero elements in each m node feature vectors is input into a backpressure index type combination calculation unit, and the same number in slices of different node feature vectors specifically is as follows:

each slice of the adjacency matrix corresponds to a part of the connection relationship of the whole graph, namely, a subgraph, and only part of nodes are involved. Therefore, the slices of the node feature matrix vector only need to correspond to the slices of the adjacent matrix, after the slice size of the adjacent matrix is determined, the number of nodes of each slice of the node feature vector is the same, N node feature vectors are assumed, the nodes in each slice are numbered from 0 to N-1, and the node feature vectors with the same number in different slices are called the node feature vectors with the same number in different slices.

The non-zero element coding in the node feature matrix specifically comprises the following steps:

In the whole GCN reasoning process, the vector dimension required for multiplication in the combination stage of the first layer is extremely large, so that COO encoding mainly aims at original data. The input data of the combined calculation of the second layer is the calculation result of the first layer, and the vector dimension is greatly reduced and is dense, so that the direct calculation of the coding is not needed.

Since the amount of weight data sent by the weight memory per clock cycle is constant, the number range of the range buffer is increased in each clock cycle, and the increased number is equal to the number of weight data in the weight matrix slice.

The control unit judges whether the top index number of the index buffer is within the number range of the range buffer, specifically:

After the combination calculation is started, the encoded nonzero elements are stored in an Index Buffer. Each weight slice has a data number range, if the Index number at the top of the Index Buffer is in the number range of the currently received weight slice, the corresponding weight data in the slice is indexed and is calculated by a subsequent calculating unit, and the top data in the Index Buffer is popped up. When multiple data in a slice need to be indexed, multiple clock cycles are required to complete the operation on the slice, and a new slice is sent every clock cycle, thus creating backpressure. When the Control Unit does not finish the task of indexing the previous slice, the new slice is stored in the Weight Tile Buffer, and when the Control Unit finishes the current task, the Control Unit starts to judge whether the slice in the Weight Tile Buffer has data to be indexed, if not, the slice is discarded, and if so, the slice is indexed until the Weight Tile Buffer is empty.

Since the combination calculation Unit will be responsible for the combination calculation of m nodes at the same time, the Control Unit will decode the node to which the nonzero element belongs and the specific value of the nonzero element from the data sent from the Index Buffer, in addition to the coordinates of the nonzero element. And after corresponding data is taken out from the weight slice according to the index value, the weight data and the specific value of the non-zero element are sent into the multiplier together, and the result of the multiplier is accumulated into a corresponding intermediate result register according to the node number to which the non-zero element belongs. And when all the weight slices are sent, the data in the m intermediate result registers are combined to calculate a final result.

Fig. 2 shows a case where the slice size is 128 and the number of combinations m is 3. Let the dimension of the node feature vector be L. The intra-row coordinates of the non-zero elements in node 0 are 0, 3, L-4, the intra-row coordinate values of the non-zero elements in node 128 are 3, L-1, and the intra-row coordinates of the non-zero elements in node 256 are 5. The non-zero elements of the three node feature vectors are combined and arranged from small to large in the order of the coordinates of the elements in the row to obtain a right graph, which is arranged as (0,0), (0.3), (128,3), (256,5), (0, L-4), (128. L-1). Thereafter, it was put into an Index Buffer. Assuming a weight slice size of 4, there are T slices, one slice per clock cycle, into the backpressure index calculation unit. After the combination calculation is started, a first weight slice in a first clock cycle is sent to a calculation unit, the slice has data of 0-3, the Index value at the top of the Index Buffer is 0 at the moment, the requirement is met, the 0 th data in the slice is taken out, multiplied by the nonzero element value in the data at the top of the Index Buffer, and accumulated in a middle result register corresponding to a node 0 according to the node number. Meanwhile, the Index task corresponding to the data at the top of the Index Buffer is completed and popped up. At this point, slice 0 cannot be discarded yet, since the current Index Buffer top Index value is 3, and slice 0 still needs to be indexed for the next clock cycle. However, since the slice 1 is already coming, the slice 1 is cached in the Weight Tile Buffer, and if there is no data to be indexed in the slice 0, the Control Unit takes out the accumulated data from the Weight Tile Buffer and calculates the data. After T clock cycles, all the weight slices are sent, and most calculation units finish calculation tasks at the moment due to the fact that sparsity of feature vectors of nodes of the first layer of the network is high.

Example 3

On the basis of the

embodiments

1 and 2, the present embodiment further discloses an FPGA-based backpressure index type combination calculation system, as shown in fig. 3, the system includes a preprocessing module, a combination calculation unit array, and a plurality of weight memories, wherein:

the array of combination calculation units comprises M x N FPGA-based backpressure index-based combination calculation units according to any one of claims 1 to 7, data composed of non-zero elements in each M node feature vectors are input into one combination calculation unit, and each combination calculation unit in the same column is responsible for the combination calculation of the M node feature vectors;

Considering that the sparsity of the node feature vectors of different data sets is different, the preprocessing module obtains the sparsity of the node feature matrix by performing static statistics on the node feature matrix, and determines the slice fusion number m of the node feature vectors and the slice size of the weight matrix according to the sparsity of the node feature matrix.

Due to the irregular distribution of non-zero elements of the characteristic vectors of the nodes, the calculation scheme has two special cases, 1) the last slice of an individual calculation unit has a plurality of data to be indexed, so that the calculation unit cannot be completed within T clock cycles; 2) when the weight tiled buffer is Full of data, a Full signal of the weight tiled buffer is pulled up, the combined computing unit sends a back-pressure signal to the weight memory, the transmission of weight matrix slice data and the work of subsequent modules of the system are suspended, and the special condition of the current combined computing unit is waited to be solved, namely the computing task of the last weight matrix slice is completely completed or the Full signal of the weight tiled buffer is pulled down. However, according to experiments, the probability of occurrence of special conditions is low, and the flowing water of the system cannot be blocked in most cases.

Fig. 4 shows the utilization rates of the combination calculating unit and the aggregation calculating unit under different slice sizes and merging numbers m and the performance improvement condition of the accelerator under different parameters by using the method.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An FPGA-based backpressure-indexed combinatorial computation unit for computing combinatorial stages of a neural network of a graph, wherein a weight matrix slice is input into the combinatorial computation unit in each clock cycle, the combinatorial computation unit comprises an index buffer, a range buffer, a control unit, a weight tiling buffer, a multiplier, an accumulator, and m intermediate result registers, wherein:

2. The FPGA-based backpressure indexed combination computing unit of claim 1, wherein the node feature matrix is composed of an adjacency matrix and a node feature vector, wherein the adjacency matrix is used for representing the connection relation between the nodes, and the node feature vector is used for representing the feature of each node.

3. The FPGA-based backpressure indexed combinatorial computing unit of claim 2, wherein the node feature matrix needs to be sliced, wherein the slicing of the node feature matrix vector only needs to correspond to the slicing of the adjacency matrix, specifically:

4. The FPGA-based backpressure indexed combination computing unit of claim 3, wherein when the index buffer stores data, node feature vectors with the same number in slices of different node feature vectors are merged, and are arranged according to an ascending order of coordinates of elements in rows of encoded non-zero elements, where the same number in slices of different node feature vectors is specifically:

5. The FPGA-based backpressure index type combined computing unit of claim 4, wherein the non-zero element in the node feature matrix is encoded by:

6. The FPGA-based backpressure indexed combinatorial computing unit of claim 5, wherein the range buffer numbering range is incremented every clock cycle by an amount equal to the number of weight data in the slice of the weight matrix.

7. The FPGA-based backpressure index type combination calculating unit of claim 6, wherein the control unit determines whether the top index number of the index buffer is within the number range of the range buffer, specifically:

8. The backpressure index type combined computing system based on the FPGA is characterized by comprising a preprocessing module, a combined computing unit array and a plurality of weight memories, wherein:

9. The FPGA-based backpressure indexed combination computing system of claim 8, wherein the preprocessing module further obtains sparsity of the node feature matrix by performing static statistics on the node feature matrix, and determines the slice fusion number m of the node feature vectors and the slice size of the weight matrix according to the sparsity of the node feature matrix.

10. The FPGA backpressure indexed combinatorial computing system of claim 9, wherein when the weight tile buffer is Full of data, a Full signal of the weight tile buffer is pulled high, the combinatorial computing unit sends a backpressure signal to the weight memory, the transfer of weight matrix slice data and the operation of subsequent modules of the system are suspended, and a special condition of the current combinatorial computing unit is waited for to be resolved, that is, the computing task of the last weight matrix slice is completed completely or the Full signal of the weight tile buffer is pulled low.