CN110750265B

CN110750265B - High-level synthesis method and system for graph calculation

Info

Publication number: CN110750265B
Application number: CN201910842736.6A
Authority: CN
Inventors: 廖小飞; 汤嘉武; 郑龙; 金海�; 陈绍鹏
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2021-06-11
Anticipated expiration: 2039-09-06
Also published as: CN110750265A

Abstract

The invention discloses a high-level synthesis method and a system for graph-oriented computation, which belong to the field of big data processing and comprise the following steps: (1) generating a graph calculation program according to a functional programming model with points as centers; (2) specifying architecture parameters and microarchitectural parameters by adding optimization instructions; (3) compiling the graph computation program into a modular dataflow intermediate representation according to the dataflow graph and the optimization instructions; (4) mapping the intermediate representation of the data stream to an underlying architecture according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the hardware template; (5) if the instantiated parameterized hardware templates and the whole framework meet the constraint conditions, turning to the step (6); otherwise, after the optimization instruction is modified, the step (3) is carried out; (6) generating synthesizable hardware language code. The method can provide effective support for generating the RTL from the upper layer language so as to improve the parallelism of the graph calculation executed on the FPGA.

Description

High-level synthesis method and system for graph calculation

Technical Field

The invention belongs to the field of big data processing, and particularly relates to a high-level synthesis method and system for graph-oriented computation.

Background

In the last decade, graph application becomes more and more important with the occurrence of big data analysis problems of biological information networks, social networks, web page graphs and the like, graphs are the best expression of big data association attributes, graph computation is a mining and analysis process of massive, sparse and super-dimensional association based on graph modes, machine learning and deep learning of big data are both dependent on graph computation at present, and graph computation has become one of the mainstream modes of big data processing.

Graph computation has a complex and irregular nature, presenting new challenges to current hardware. For a general-purpose Central Processing Unit (CPU), the instruction-level parallelism is abnormally low even for optimal graph algorithms, mostly below 1.0, and many below 0.5; for a throughput-oriented architecture such as a Graphics Processing Unit (GPU), execution is performed in a Single-Instruction Multiple-Data (SIMD) manner, while power-law distribution of graph Data and irregularity of graph algorithms are inherently not friendly to SIMD mode, and have the problems of uneven load and low bandwidth utilization, and studies show that only less than 16% of the time for the GPU is fully utilized. Reconfigurable hardware such as Field-Programmable Gate Array (FPGA) has received great attention due to its low power consumption and reconfigurable characteristics, so at present, researchers have designed a variety of effective architectures on FPGA for graph computation, and due to the complexity of graph computation itself and the high barrier of hardware programming, it is a very time-consuming matter even for professional researchers to write complete graph computation hardware codes.

In order to relieve FPGA developers from the complicated hardware details, a High Level Synthesis (HLS) system is provided. The HLS system can convert a program written in a high-Level language (most words are C/C + +) into Register-Transfer Level (RTL) codes (such as Verilog, VHDL, and the like), and provides various optimization means so that developers can optimize a hardware structure from a high-Level language layer, and partially provides a visual view to conveniently analyze circuit behaviors of each clock cycle, thereby further improving the performance of generating the RTL. In summary, existing HLS systems fail to provide effective support for high-parallelism execution of graph computations on FPGAs.

Disclosure of Invention

In view of the defects and improvement requirements of the prior art, the invention provides a high-level synthesis method facing graph computation, and aims to provide effective support for generating graph application RTL from an upper language so as to improve the parallelism of graph computation executed on an FPGA.

To achieve the above object, according to an aspect of the present invention, there is provided a high-level synthesis method for graph-oriented computation, including:

(1) generating a graph computation program for describing graph computation tasks according to a predefined target programming model;

the target programming model is a functional programming model taking a point as a center, and the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and active point sets;

(2) assigning architecture parameters and micro-architecture parameters for the graph computation task by adding optimization instructions;

the architecture parameters comprise edge processing operation and updating operation related to graph calculation; the micro-architecture parameters include parallelism and data bit width;

(3) compiling the graph computer program into a modular data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction;

the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node in the dataflow graph, and each IR module is supported by a corresponding parameterized hardware template;

(4) mapping the compiled intermediate representation of the data stream to an underlying architecture according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;

(5) if the instantiated parameterized hardware templates and the whole framework meet the predefined constraint conditions, the step (6) is carried out; otherwise, after the optimization instruction is modified, the step (3) is carried out;

(6) and generating synthesizable hardware language codes according to the instantiated parameterized hardware templates and the whole architecture.

At present, no HLS tool can directly generate a high-parallelism efficient pipeline structure for graph calculation from an upper-level high-level language, on one hand, a user is difficult to accurately describe a required architecture and specify micro-architecture parameters due to limited expression capacity, on the other hand, after the high parallelism is specified for a graph algorithm, a large amount of data dependence and conflict can be generated, and a sufficient bottom layer optimization is not available to realize a storage and calculation structure required by the high parallelism, so that a finally generated hardware circuit can be actually executed in series or cannot be generated due to resource exhaustion.

The method divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, edge data calculation, combining calculation results and updating results and active point set according to a functional programming model, decomposes each graph operation into one or more IR modules through a data flow graph, defines the mapping relation between the IR modules and a hardware template parameterized at a bottom layer, improves the description capacity of an upper layer language on the bottom layer hardware, and can provide effective support for generating graph application RTL from the upper layer language; in addition, in the process of compiling the graph calculation program into the data stream IR and mapping the data stream IR to the bottom layer architecture, micro-architecture parameters such as parallelism and the like specified by the optimization instruction can be transmitted and finally act on the bottom layer architecture, so that the hardware structure can be optimized from a high-level language layer, and the parallelism of the graph calculation executed on the FPGA is effectively improved.

Furthermore, each IR module is provided with an input buffer area and an output buffer area, wherein the input buffer area is used for receiving the data transmitted by the previous module and indicating the overflow condition, and the output buffer area is used for storing the result generated by the current IR module for the next IR module to read, and generating a control signal according to the overflow condition of the buffer area;

the connection between the IR modules is realized by an input buffer and an output buffer.

The invention can effectively reduce the pipeline pause by setting the corresponding input buffer area and output buffer area for each IR module.

Further, the dataflow graph includes 17 IR modules M₁～M₁₇；

IR Module M₁A graph operation to perform a set of read active points; IR Module M₁Generating n source nodes according to the read point parallelism n in each clock cycle;

IR Module M₂～M₃Graph operations for performing read edge offsets; IR Module M₂From IR module M according to read point parallelism n₁Reading n source nodes to transmit source node information, and generating edge offset memory access requests aiming at the read source nodes in sequence; IR Module M₃Slave IR module M₂Processing after acquiring a side offset memory access request so as to read the side offset;

IR Module M₃～M₈Graph operations for performing read edge data; IR Module M₄Slave IR module M₃Receive side offset while slave IR module M₂Reading n source nodes for continuous transmission; IR Module M₅Slave IR module M₄Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; IR Module M₆Slave IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; IR Module M₇Slave IR module M₆Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; IR Module M₃Also from IR module M₆Processing after acquiring a side data access request to read side data; IR Module M₈Slave IR module M₃Receiving side data, from IR module M₇Receiving source node information and side control information, and transmitting the three types of information;

IR Module M₉～M₁₃A graph operation for reading target point data is performed; IR Module M₉Slave IR module M₈After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data, and transmitting the source node information and the side control information; IR Module M₁₀Slave IR module M₉Scheduling after reading the access request of the target node so as to improve the access throughput; IR Module M₁₁Slave IR module M₁₀Processing after reading the scheduled target node access request so as to read target node data; IR Module M₁₂Slave IR module M₁₁After reading the target node data, sequencing the target nodes according to the source nodes; IR Module M₁₃Transfer slave IR module M₁₂Reading source node information, sequenced target node data and side control information;

IR Module M₁₄Graph operations for performing edge data computations; IR Module M₁₄Slave IR module M₁₃Reading target node data, performing side data calculation to obtain an updated value, transmitting the updated value, and transmitting source node information and side control information;

IR Module M₁₅Graph operations for performing merging computation results; IR Module M₁₅Slave IR module M₁₄After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;

IR Module M₁₆～M₁₇Graph operations for executing the update results and the set of active points; IR Module M₁₆Slave IR module M₁₅The output buffer area reads the updating value corresponding to each effective source node and transmits the source node information; IR Module M₁₇Slave IR module M₁₆The output buffer area reads the updating result of the effective source node, and merges the updating values of the same effective source node, and after merging, the updating values are written back to the chip for storage;

wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters appointed by the optimization instruction, and the IR module M₆Among the transferred source nodes, there isThe source nodes of the m edge data access requests with edges belonging to the generated edges are effective source nodes, and the rest are invalid source nodes for filling.

In order to optimize a bottom hardware structure, in the compiling process, an upper layer language is compiled into a fine-grained intermediate representation, and then parallel opportunities in the fine-grained intermediate representation are mined, so that various optimizing means developed for a loop cannot pertinently and practically solve the problems of dependence and conflict existing in the operations, act on various operations contained in the loop, and cannot generate specific support for each operation.

The data flow diagram provided by the invention defines that the 17 IR modules effectively support 7 main diagram operations in the diagram computing tasks, so that the diagram computing tasks can be modularly displayed and expressed from a higher abstraction level, the optimization support is accurately provided, a large amount of data conflicts are avoided, and the execution parallelism of the diagram computing is favorably improved.

Further, each IR module stores the result generated by the IR module and the information to be transmitted into an output buffer area of the IR module;

an IR module M_dTo another IR module M_sWhen reading data, the IR module M_dFirstly, the IR module M_sRead the data in the output buffer to the IR module M_dThen the IR module M_dReading data from its input buffer;

wherein, is an IR module M_dAnd IR Module M_sAs IR module M₁～M₁₇Two different IR modules.

Further, IR module M₆Slave IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information, wherein the method comprises the following steps:

IR Module M₆Slave IR module M₅After reading the source nodes and the corresponding edge offsets, reading n source nodes from the input buffer area of the source nodes, and obtaining the maximum edge offset e corresponding to the read source nodes and the value of the edge reading counterc；

If c + m is equal to e, transmitting n source nodes and m edge data access requests, and increasing the value of the edge reading counter by m; passing n source nodes from the IR module M₆Removing the input buffer;

if c + m is larger than e, transmitting n source nodes and m edge data access requests, and simultaneously maintaining the value of the edge counter unchanged; passing n source nodes from the IR module M₆Removing the input buffer;

if c + m < e, and the source node v exists, the right offset e of the source node v_rvIf the number of the source nodes is c + m, transmitting the source nodes v and the source nodes with the numbers smaller than the source nodes v, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node v and the source nodes with numbers smaller than the source node v are selected from the IR module M₆Removing the input buffer;

if c + m < e, and the source node u exists, the left offset e of the source node u_lu< c + m, right offset e_ruIf the number of the source nodes is more than c + m, transmitting the source nodes u and the source nodes with the numbers less than those of the source nodes u, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node with the number less than that of the source node u is selected from the IR module M₆Removing the input buffer;

wherein, the initial value of the edge reading counter is 0.

With the above-described optimized scheduling, the IR module M is enabled to perform every clock cycle₆N source nodes and m edge data access requests are transmitted, so that the hardware implementation of the bottom layer can be simplified; when the degree of the source node is small, namely edges which do not belong to the source node exist in the transmitted edge data access and storage requests, all the edges of the small-degree nodes can be processed, and therefore the parallel operation among the nodes is realized; when the degree of the source node is large, namely the transmitted edge data access and storage request does not completely contain all edges of the source node, a plurality of edges of the node with large degree can be processed, and therefore intra-node parallelism is achieved. Thus, the present invention enables computation from a graphDue to the irregular characteristic of the tasks, the inter-node parallelism and the intra-node parallelism are flexibly realized, so that the resource utilization rate can be effectively improved, and the execution parallelism can be favorably improved.

Further, IR module M₁₀Slave IR module M₉Scheduling after reading the access request of the target node, comprising:

IR Module M₁₀Slave IR module M₉After a target node access request is read, obtaining the on-chip point data partition to which the target node access request belongs according to a request address, and distributing the target node access request to a request buffer corresponding to the on-chip point data partition;

the on-chip storage is divided into m on-chip point data divisions according to the read edge parallelism m, m request buffer areas correspondingly correspond to the m on-chip point data divisions one by one, and each clock cycle sends out access requests to the m on-chip point data divisions through the m request buffer areas.

Because a large amount of random access exists when target point data is read, through the scheduling optimization, the on-chip storage is divided into on-chip point data according to the read edge parallelism degree, and simultaneously, the buffer zones with corresponding quantity are generated, so that access conflict can be reduced, and higher throughput is ensured.

Further, if the number of edges is greater than a preset threshold, the edge data is stored in an off-chip DRAM, and the IR module M₃And when the side data access request is processed, side data is read from the off-chip DRAM.

Conventional HLS systems use arrays to represent memory locations, and in practice, arrays are difficult to map to the various hardware memory locations required for graph computation; the invention can support off-chip data transmission and can optimize the storage structure.

Further, the high-level synthesis method for graph-oriented computation provided by the present invention further includes:

(7) converting the obtained hardware language code which can be synthesized into a stream file and running on an FPGA development board;

(8) if the performance does not meet the preset performance requirement, modifying the optimization instruction, then executing the steps (3) - (6) to obtain the synthesizable hardware language code again, and turning to the step (7); otherwise, the operation ends.

According to another aspect of the present invention, there is also provided a high-level integration system for graph-oriented computing, comprising: the system comprises a graph calculation program generation module, an optimization module, a compiling module, a bottom layer mapping module, a constraint inspection module and a synthesis module;

a graph computation program generation module for generating a graph computation program for describing graph computation tasks according to a predefined target programming model; the target programming model is a functional programming model taking a point as a center, and the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and active point sets;

the optimization module is used for appointing architecture parameters and micro-architecture parameters for the graph calculation task by adding an optimization instruction; the architecture parameters comprise edge processing operation and updating operation related to graph calculation; the micro-architecture parameters include parallelism and data bit width;

the compiling module is used for compiling the graph calculation program into a modularized data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction; the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node in the dataflow graph, and each IR module is supported by a corresponding parameterized hardware template;

the bottom layer mapping module is used for mapping the compiled data stream intermediate representation to a bottom layer framework according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;

the constraint checking module is used for judging whether each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions or not and modifying the optimization instruction when judging that the constraint conditions are not met;

and the synthesis module is used for generating a synthesizable hardware language code according to each parameterized hardware template and the whole framework when each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the high-level comprehensive method and the system for graph calculation divide a graph calculation task into seven graph operations of reading a live point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and a live point set according to a functional programming model, decompose each graph operation into one or more IR modules through a data flow graph, define the mapping relation between the IR modules and a hardware template parameterized by a bottom layer, improve the description capacity of an upper layer language on bottom layer hardware, and provide effective support for generating a graph from the upper layer language and applying RTL (real time language); in addition, in the process of compiling the graph calculation program into the data stream IR and mapping the data stream IR to the bottom layer architecture, micro-architecture parameters such as parallelism and the like specified by the optimization instruction can be transmitted and finally act on the bottom layer architecture, so that the hardware structure can be optimized from a high-level language layer, and the parallelism of the graph calculation executed on the FPGA is effectively improved. In general, the method provides effective support for generating the RTL of the graph application from the upper layer language, and improves the parallelism of the graph computation executed on the FPGA.

(2) According to the high-level comprehensive method and system for graph calculation, 17 IR modules are defined through the data flow graph, and 7 main graph operations in the graph calculation tasks are effectively supported, so that the graph calculation tasks can be modularly displayed and expressed from a higher abstraction level, optimization support is accurately provided, a large number of data conflicts are avoided, and the execution parallelism of the graph calculation is favorably improved.

(3) According to the high-level comprehensive method and system for the graph calculation, when the degree of the source node is small, all edges of the small-degree nodes can be processed, and therefore the nodes are parallel; when the degree of the source node is large, a plurality of edges of the node with large degree can be processed, and therefore intra-node parallelism is achieved. Therefore, the invention can flexibly realize the inter-node parallelism and the intra-node parallelism according to the irregular characteristic of the graph calculation task, thereby effectively improving the resource utilization rate and being beneficial to improving the execution parallelism.

(4) According to the high-level comprehensive method and system for the graph computing, when a memory access request of read target point data is processed, on-chip storage is divided into on-chip point data according to read edge parallelism, and meanwhile, buffer areas with corresponding quantity are generated, so that memory access conflicts can be reduced, and higher throughput is guaranteed.

Drawings

FIG. 1 is a flowchart of a high-level synthesis method for graph-oriented computing according to an embodiment of the present invention;

FIG. 2 is a data flow diagram provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a point-centered functional programming model according to an embodiment of the present invention;

FIG. 4 is a block diagram of a data flow framework according to an embodiment of the present invention;

FIG. 5 is a diagram of a prior art point-centric commanded programming model;

fig. 6 is a schematic diagram of a hardware architecture according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Before explaining the technical scheme of the invention in detail, terms related to graph calculation are briefly introduced: the graph data structure is established on a primitive reflecting the real world, and comprises nodes (verticals or v), edges (edges or e) connecting different nodes and attributes (properties), wherein the nodes and the edges have own attributes, and the attributes can be any data of any type; compressed Sparse Row (CSR), an input format of a graph computation programming model, in which a concept of edge offset (offset) is introduced, that is, offset of an edge corresponding to a current vertex in an edge table specifically includes a left offset and a right offset, the left offset is used for recording a starting point of the edge corresponding to the vertex in the edge table, and the right offset is used for recording an end point of the edge corresponding to the vertex in the edge table; in the specific algorithm implementation, an active point set (ActiveVertex) is generally established, a target node is represented as u, an attribute Value is represented as Value, and in the invention, the input format required by a target programming model is CSR; in the aspect of an execution model, common execution models in graph calculation include a Push model and a Pull model, in the Pull model, each iteration of a graph algorithm schedules all nodes and processes all incoming data of the nodes, and because a PageRank algorithm is more suitable for adopting the Pull model, a BFS based on Push needs to obtain source point data and target point data at the same time, hardware overhead is increased, and the performance of the BFS based on Notify-Pull/Pull and Push/Pull is lower.

For providing effective support for generating graph application RTL from upper language to improve the parallelism of graph computation executed on FPGA, the invention provides a high-level synthesis method facing graph computation, as shown in FIG. 1, comprising:

the target programming model is a point-centered functional programming model, and as shown in fig. 2, the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, edge data calculation, merging calculation results, updating results and active point set;

FIG. 3 shows pseudo code for a point-centric functional programming model of the present invention, calculated from a graph of dependencies, in which: in line 0, the address of the CSR format diagram data is appointed by the user, the diagram data is read in, and the storage mode and the address of the data are automatically determined according to the following code; the 1 st line and the 2 nd line correspond to graph operation for reading the active point set, all points are active points due to pull execution, and the 1 st line is actually used as a signal for judging whether the iteration is finished or not, wherein the parallelism of the reading points can be selected; the 3 rd row corresponds to graph operation of read edge offset, parallelism can be specified, and default and point parallelism are kept consistent; line 4 has no real effect, and the target programming model is not executed by loop driving; the 5 th row corresponds to the parallelism of the read edge data, and can specify the parallelism of the read edge, because the sequential read edge does not depend on the source point data, when the node degree is larger, the parallelism in the node is equivalent to the parallelism in the node, and when the node degree is smaller, the parallelism between the nodes is equivalent to the parallelism between the nodes, the continuous operation of hardware resources is ensured, and the load balance problem caused by the uneven degree between the nodes is solved; in the graph operation of the line 6 corresponding to the read target point data, because a large amount of random access memory exists, a corresponding amount of buffer areas are generated according to the read edge parallelism, and the data of the chip is divided and simultaneously the access memory request is scheduled to ensure higher throughput; the 7 th row corresponds to graph operation of edge data calculation, and data bit width and data type can be specified; line 8 corresponds to graph operation of merging computation results, and according to the specified merging logic operation, the specific architecture can be generated according to the read edge parallelism and the parallel accumulation architecture provided in An "An effective graph operator with parallel data management" (author: Yao Pengcheng, etc.); the 10 th line and the 11 th line correspond to the graph operation of the updating result and the active point set, the 10 th line is also operated according to the appointed merging logic, the parallelism of the updating result operation is not more than the parallelism of the reading points, and the on-chip point data storage can be correspondingly set, so that the read-write conflict problem can not be generated;

the architecture parameters comprise edge processing operation and updating operation related to graph calculation;

the micro-architecture parameters include parallelism and data bit width; the parallelism specifically comprises read point parallelism and read edge parallelism, and the micro-architecture parameters specified by the optimized instruction can also comprise data types, format adjustment and the like according to actual needs;

as a preferred embodiment, each IR module is provided with an input buffer and an output buffer, wherein the input buffer is used for receiving data transmitted by a previous module and indicating an overflow condition, and the output buffer is used for storing a result generated by the current IR module for being read by a next IR module and generating a control signal according to the overflow condition of the buffer;

the connection between the IR modules is realized by an input buffer area and an output buffer area;

by setting a corresponding input buffer area and an output buffer area for each IR module, the pipeline pause can be effectively reduced;

in an embodiment of the invention, as shown in FIG. 4, the dataflow graph includes 17 IR modules M₁～M₁₇The numbers are 1-17 in sequence;

IR Module M₃～M₈Graph operations for performing read edge data; IR Module M₄Slave IR module M₃Receive side offset while slave IR module M₂Read nThe source node continues to transmit; IR Module M₅Slave IR module M₄Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; IR Module M₆Slave IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; IR Module M₇Slave IR module M₆Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; IR Module M₃Also from IR module M₆Processing after acquiring a side data access request to read side data; IR Module M₈Slave IR module M₃Receiving side data, from IR module M₇Receiving source node information and side control information, and transmitting the three types of information;

IR Module M₁₅Graph operations for performing merging computation results; IR mouldBlock M₁₅Slave IR module M₁₄After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;

wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters appointed by the optimization instruction, and the IR module M₆In the transmitted source nodes, the source node with edges belonging to the generated m edge data access requests is an effective source node, and the rest are invalid source nodes for filling;

as a further preferred embodiment, each IR module stores the results it produces and the information that needs to be transferred in its output buffer;

wherein, is an IR module M_dAnd IR Module M_sAs IR module M₁～M₁₇Two different IR modules;

depending on the actual implementation, the constraint condition may be a requirement for resources, timing, etc.;

when the instantiated parameterized hardware template and the whole framework do not meet the constraint conditions of resources, time sequences and the like, correspondingly, the modification of the optimization instruction can specifically be the reduction of parallelism, the optimization of a data partitioning method, the reduction of data replication, the optimization of a pipeline structure and the like;

(6) generating a synthesizable hardware language code according to each instantiated parameterized hardware template and the whole architecture;

the specific hardware description language can select VHDL, Verilog and the like according to actual needs, and the generated hardware language code is RTL code.

The high-level comprehensive method for calculating the orientation graph divides a graph calculation task into seven graph operations of reading a live point set, reading an edge offset, reading edge data, reading target point data, calculating edge data, combining a calculation result and updating a result and a live point set according to a functional programming model, decomposes each graph operation into one or more IR modules through a data flow graph, defines the mapping relation between the IR modules and a hardware template parameterized by a bottom layer, improves the description capacity of an upper layer language on the bottom layer hardware, and can provide effective support for generating graph application RTL from the upper layer language; in addition, in the process of compiling the graph calculation program into the data stream IR and mapping the data stream IR to the bottom layer architecture, micro-architecture parameters such as parallelism and the like specified by the optimization instruction can be transmitted and finally act on the bottom layer architecture, so that the hardware structure can be optimized from a high-level language layer, and the parallelism of the graph calculation executed on the FPGA is effectively improved. In general, the high-level synthesis method for the graph calculation can provide effective support for generating the graph from the upper-level language and applying the RTL, and effectively improve the parallelism of the graph calculation executed on the FPGA.

In the high-level comprehensive method for graph calculation, the provided data flow graph defines that the 17 IR modules effectively support 7 main graph operations in graph calculation tasks, so that the graph calculation tasks can be modularly displayed and expressed from a higher abstraction level, optimization support is accurately provided, a large number of data conflicts are avoided, and the execution parallelism of the graph calculation is favorably improved.

In a preferred embodiment, in the above-mentioned high-level synthesis method for graph-oriented computation, the IR module M₆Slave IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information, wherein the method comprises the following steps:

IR Module M₆Slave IR module M₅Reading n source nodes from an input buffer area of the source nodes after reading the source nodes and corresponding edge offsets, and obtaining a maximum edge offset e corresponding to the read source nodes and a value c of a read edge counter;

if c + m is equal to e, transmitting n source nodes and m edge data access requests, and increasing the value of the edge reading counter by m; slave IR module M for n source nodes to be communicated₆Removing the input buffer;

if c + m is larger than e, transmitting n source nodes and m edge data access requests, and simultaneously maintaining the value of the edge counter unchanged; will be describedPassing n source node slave IR modules M₆Removing the input buffer;

if c + m < e, and the source node u exists, the left offset e of the source node u_lu< c + m, right offset e_ruIf the number of the source nodes is more than c + m, transmitting the source nodes u and the source nodes with the numbers less than those of the source nodes u, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node with the number less than that of the source node u is selected from the IR module M₆Removing the input buffer; so that the next round of source node transmission starts from the source node u;

wherein, the initial value of the edge reading counter is 0.

With the above-described optimized scheduling, the IR module M is enabled to perform every clock cycle₆N source nodes and m edge data access requests are transmitted, so that the hardware implementation of the bottom layer can be simplified; when the degree of the source node is small, namely edges which do not belong to the source node exist in the transmitted edge data access and storage requests, all the edges of the small-degree nodes can be processed, and therefore the parallel operation among the nodes is realized; when the degree of the source node is large, namely the transmitted edge data access and storage request does not completely contain all edges of the source node, a plurality of edges of the node with large degree can be processed, and therefore intra-node parallelism is achieved.

The conventional HLS system adopts command programming, and fig. 5 is a pseudo code diagram of a conventional command programming model, in which a loop in a 1 st row represents sequentially processing all active vertices, and a loop in a 4 th row represents sequentially processing all edges of a current active vertex; firstly, considering that the nested loop is subjected to coarse-grained pipelining, namely lines 2-3 are taken as a pipelining stage, the loop of lines 4-8 is taken as a pipelining stage as a whole (only spatial can realize coarse-grained pipelining at any position at present), and then the loop represented by lines 4-8 is subjected to fine-grained pipelining, so that even if the maximum pipelining capability of the HLS tool is achieved, serious pipelining stagnation still exists under the condition of high parallelism; then considering adding parallel instructions, dividing the instructions into inter-point parallel and intra-point parallel, and respectively corresponding to the loop parallel of the 1 st line and the loop parallel of the 4 th line, wherein the two instructions can be added simultaneously; for point-to-point parallelism, for high-parallelism point-to-point parallelism (the parallelism of the graph accelerator can be 16 or 32), when the vertex degree is small, a lot of computing resources are idle in each period, the point-to-point parallelism situation is similar to the point-to-point parallelism, and due to the power law distribution characteristic of graph data, the load among parallel pipelines is uneven.

By contrast, the high-level comprehensive method for the face graph calculation can flexibly realize the inter-node parallelism and the intra-node parallelism according to the irregular characteristic of the graph calculation task, thereby effectively improving the resource utilization rate and being beneficial to improving the execution parallelism.

In a preferred embodiment, in the high-level synthesis method of the above-described orientation graph calculation, the IR module M₁₀Slave IR module M₉Scheduling after reading the access request of the target node, comprising:

In a preferred embodiment, in the above-mentioned high-level synthesis method for graph-oriented computation, if the number of edges is greater than a preset threshold, the edge data is stored in an off-chip DRAM, and the IR module M₃When processing a side data access request, reading side data from an off-chip DRAM; the specific threshold may be determined based on the actual on-chip storage capacity.

In order to implement the board-on-board operation of the FPGA, as shown in fig. 1, the high-level synthesis method for the graph-oriented calculation may further include:

(8) if the performance does not meet the preset performance requirement, modifying the optimization instruction, then executing the steps (3) - (6) to obtain the synthesizable hardware language code again, and turning to the step (7); otherwise, the operation is ended;

the specific performance requirements can be set according to the actual graph calculation task requirements, and when the performance does not meet the performance requirements, the modification of the optimization instruction can specifically be reducing the parallelism, optimizing a data partitioning method, reducing the data replication, optimizing a pipeline structure and the like.

In this embodiment, after the FPGA upper board is implemented to run according to the high-level comprehensive method of the above-mentioned graph-oriented calculation, a finally implemented bottom hardware architecture is as shown in fig. 6, and an actual pipeline is divided into 17 IR modules according to a data stream intermediate representation, where a module for executing reading live point sets, reading edge offsets, reading edge data, and reading target point data is an IR module unrelated to graph calculation, and an IR module for executing edge data calculation, merging calculation results, update results, and live point sets is a module related to graph calculation; for the modules irrelevant to graph calculation, instantiating corresponding point transmission pipelines, side control signal transmission pipelines and buffer areas and parameters of control signals in each module according to micro-architecture parameters such as parallelism and data bit width type specified by a user; for the modules related to graph calculation, the modules are instantiated according to user instruction parameters, and only comprise calculation operation and do not comprise control and memory access operation, so that corresponding basic hardware calculators generated according to the calculation operation are connected in the corresponding modules, and the connection between the modules is realized by the input buffer area and the output buffer area. As shown in fig. 6, in the embodiment of the present invention, an on-chip instantiated pipeline mainly includes a Read active vertex set (Read active vertex set) pipeline, a Read Edge data (Read Edge data) pipeline, a Read destination data (Read destination data) pipeline, a task scheduling (Processing scheduling) pipeline, an Edge computing (Edge process) pipeline, a Merge computing result (Merge process result) pipeline, and an Update node and active point set (Update live virtual data and active virtual) pipeline; the actual pipeline is divided into 17 IR modules according to IR, each IR module having parameterized hardware template support.

The invention also provides a high-level synthesis system for the graph-oriented calculation, which is used for executing the steps of the high-level synthesis method for the graph-oriented calculation; the system comprises: the system comprises a graph calculation program generation module, an optimization module, a compiling module, a bottom layer mapping module, a constraint inspection module and a synthesis module;

the compiling module is used for compiling the graph calculation program into a modularized data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction; the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module represents a node in the corresponding dataflow graph, and each IR module has a corresponding parameterized hardware template support;

the comprehensive module is used for generating a comprehensive hardware language code according to each parameterized hardware template and the whole framework when each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions;

in the embodiment of the present invention, the detailed implementation of each module may refer to the description of the method embodiment described above, and will not be repeated here.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A high-level synthesis method for graph-oriented computing is characterized by comprising the following steps:

the target programming model is a functional programming model taking a point as a center, and the target programming model divides a graph calculation task into seven graph operations of reading a live point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and live point sets;

the architecture parameters comprise edge processing operations and updating operations related to graph calculation; the micro-architecture parameters include parallelism and data bit width;

(3) compiling the graph computation program into a modular dataflow intermediate representation according to a pre-designed dataflow graph and the added optimization instructions;

the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node on the dataflow graph, and each IR module has a corresponding parameterized hardware template support;

each IR module is provided with an input buffer area and an output buffer area, wherein the input buffer area is used for receiving data transmitted by the previous IR module and indicating the overflow condition, and the output buffer area is used for storing the result generated by the current IR module for the next IR module to read and generating a control signal according to the overflow condition of the buffer area;

the dataflow graph includes 17 IR modules M₁～M₁₇；

The IR module M₁A graph operation to perform a set of read active points; the IR module M₁Generating n source nodes according to the read point parallelism n in each clock cycle;

the IR module M₂～M₃Graph operations for performing read edge offsets; the IR module M₂From the IR module M according to the read point parallelism n₁Reading n source nodes to transmit source node information, and generating edge offset memory access requests aiming at the read source nodes in sequence; the IR module M₃From the IR module M₂Processing after acquiring a side offset memory access request so as to read the side offset;

the IR module M₃～M₈Graph operations for performing read edge data; the IR module M₄From the IR module M₃Receiving side offsets while receiving from the IR module M₂Reading n source nodes for continuous transmission; the IR module M₅From the IR module M₄Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; the IR module M₆From the IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; the IR module M₇From the IR module M₆Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; the IR module M₃Also from the IR module M₆Processing after acquiring a side data access request to read side data; the IR module M₈From the IR module M₃Receiving side data from the IR module M₇Receiving source node information and side control information, and transmitting the three types of information;

the IR module M₉～M₁₃A graph operation for reading target point data is performed; the IR module M₉From the IR module M₈After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data and transmitting the access requestSource node information and side control information; the IR module M₁₀From the IR module M₉Scheduling after reading the access request of the target node so as to improve the access throughput; the IR module M₁₁From the IR module M₁₀Processing after reading the scheduled target node access request so as to read target node data; the IR module M₁₂From the IR module M₁₁After reading the target node data, sequencing the target nodes according to the source nodes; the IR module M₁₃Passing from the IR module M₁₂Reading source node information, sequenced target node data and side control information;

the IR module M₁₄Graph operations for performing edge data computations; the IR module M₁₄From the IR module M₁₃Reading target node data, performing side data calculation to obtain an updated value, transmitting the updated value, and transmitting source node information and side control information;

the IR module M₁₅Graph operations for performing merging computation results; the IR module M₁₅From the IR module M₁₄After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;

the IR module M₁₆～M₁₇Graph operations for executing the update results and the set of active points; the IR module M₁₆From the IR module M₁₅The output buffer area reads the updating value corresponding to each effective source node and transmits the source node information; the IR module M₁₇From the IR module M₁₆The output buffer area reads the updating result of the effective source node, and merges the updating values of the same effective source node, and after merging, the updating values are written back to the chip for storage;

wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters specified by an optimization instruction, and the IR module M₆In the transmitted source nodes, the source node with the edge belonging to the generated m edge data access requests is an effective source node, and the rest are invalid source nodes for filling.

2. The method of claim 1, wherein each IR module stores the results it produces and the information it needs to deliver in its output buffer;

3. The graph-computation-oriented high-level synthesis method of claim 2, wherein the IR module M₆From the IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information, wherein the method comprises the following steps:

the IR module M₆From the IR module M₅Reading n source nodes from an input buffer area of the source nodes after reading the source nodes and corresponding edge offsets, and obtaining a maximum edge offset e corresponding to the read source nodes and a value c of a read edge counter;

if c + m is less than e, and a source node v exists, the right offset e of the source node v_rvIf the number of the source node v is c + m, the source node v and the source nodes with the numbers smaller than the source node v are transmitted, and an invalid source node is utilized to carry outLine filling, namely transmitting n source nodes at a time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node v and the source node with the number smaller than that of the source node v are selected from the IR module M₆Removing the input buffer;

if c + m is less than e, and a source node u exists, the left offset e of the source node u_lu< c + m, right offset e_ruIf the number of the source node u is larger than c + m, transmitting the source node u and a source node with a number smaller than that of the source node u, filling by using an invalid source node to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node with the number smaller than the source node u is transmitted from the IR module M₆Removing the input buffer;

wherein the initial value of the edge reading counter is 0.

4. The graph computation-oriented high-level synthesis method of claim 1, wherein the IR module M₁₀From the IR module M₉Scheduling after reading the access request of the target node, comprising:

the IR module M₁₀From the IR module M₉After a target node access request is read, obtaining the on-chip point data partition to which the target node access request belongs according to a request address, and distributing the target node access request to a request buffer corresponding to the on-chip point data partition;

5. The graph computation-oriented high-level synthesis method of claim 1, wherein if the number of edges is greater than a preset threshold, the edge data is stored in an off-chip DRAM, and the IR module M₃And when the side data access request is processed, side data is read from the off-chip DRAM.

6. The high-level synthesis method for graph-oriented computing according to any one of claims 1 to 5, further comprising:

7. A high-level synthesis system for graph-oriented computing, comprising: the system comprises a graph calculation program generation module, an optimization module, a compiling module, a bottom layer mapping module, a constraint inspection module and a synthesis module;

the graph computation program generation module is used for generating a graph computation program for describing graph computation tasks according to a predefined target programming model; the target programming model is a functional programming model taking a point as a center, and the target programming model divides a graph calculation task into seven graph operations of reading a live point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and live point sets;

the optimization module is used for appointing architecture parameters and micro-architecture parameters for the graph calculation task by adding an optimization instruction; the architecture parameters comprise edge processing operations and updating operations related to graph calculation; the micro-architecture parameters include parallelism and data bit width;

the compiling module is used for compiling the graph calculation program into a modularized data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction; the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node on the dataflow graph, and each IR module has a corresponding parameterized hardware template support;

the constraint checking module is used for judging whether each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions or not and modifying an optimization instruction when judging that the constraint conditions are not met;

the dataflow graph includes 17 IR modules M₁～M₁₇；

the IR module M₃～M₈Graph operations for performing read edge data; the IR module M₄From the IR module M₃Receiving edgeOffset from the IR module M at the same time₂Reading n source nodes for continuous transmission; the IR module M₅From the IR module M₄Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; the IR module M₆From the IR module M₅After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; the IR module M₇From the IR module M₆Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; the IR module M₃Also from the IR module M₆Processing after acquiring a side data access request to read side data; the IR module M₈From the IR module M₃Receiving side data from the IR module M₇Receiving source node information and side control information, and transmitting the three types of information;

the IR module M₉～M₁₃A graph operation for reading target point data is performed; the IR module M₉From the IR module M₈After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data, and transmitting the source node information and the side control information; the IR module M₁₀From the IR module M₉Scheduling after reading the access request of the target node so as to improve the access throughput; the IR module M₁₁From the IR module M₁₀Processing after reading the scheduled target node access request so as to read target node data; the IR module M₁₂From the IR module M₁₁After reading the target node data, sequencing the target nodes according to the source nodes; the IR module M₁₃Passing from the IR module M₁₂Reading source node information, sequenced target node data and side control information;

the IR module M₁₄Graph operations for performing edge data computations; the IR module M₁₄From the IR module M₁₃Side data calculation after reading target node dataTo obtain an updated value, and transmit the source node information and the side control information at the same time;