CN110750265B - High-level synthesis method and system for graph calculation - Google Patents

High-level synthesis method and system for graph calculation Download PDF

Info

Publication number
CN110750265B
CN110750265B CN201910842736.6A CN201910842736A CN110750265B CN 110750265 B CN110750265 B CN 110750265B CN 201910842736 A CN201910842736 A CN 201910842736A CN 110750265 B CN110750265 B CN 110750265B
Authority
CN
China
Prior art keywords
module
data
graph
edge
source node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910842736.6A
Other languages
Chinese (zh)
Other versions
CN110750265A (en
Inventor
廖小飞
汤嘉武
郑龙
金海�
陈绍鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910842736.6A priority Critical patent/CN110750265B/en
Publication of CN110750265A publication Critical patent/CN110750265A/en
Application granted granted Critical
Publication of CN110750265B publication Critical patent/CN110750265B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation

Abstract

The invention discloses a high-level synthesis method and a system for graph-oriented computation, which belong to the field of big data processing and comprise the following steps: (1) generating a graph calculation program according to a functional programming model with points as centers; (2) specifying architecture parameters and microarchitectural parameters by adding optimization instructions; (3) compiling the graph computation program into a modular dataflow intermediate representation according to the dataflow graph and the optimization instructions; (4) mapping the intermediate representation of the data stream to an underlying architecture according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the hardware template; (5) if the instantiated parameterized hardware templates and the whole framework meet the constraint conditions, turning to the step (6); otherwise, after the optimization instruction is modified, the step (3) is carried out; (6) generating synthesizable hardware language code. The method can provide effective support for generating the RTL from the upper layer language so as to improve the parallelism of the graph calculation executed on the FPGA.

Description

High-level synthesis method and system for graph calculation
Technical Field
The invention belongs to the field of big data processing, and particularly relates to a high-level synthesis method and system for graph-oriented computation.
Background
In the last decade, graph application becomes more and more important with the occurrence of big data analysis problems of biological information networks, social networks, web page graphs and the like, graphs are the best expression of big data association attributes, graph computation is a mining and analysis process of massive, sparse and super-dimensional association based on graph modes, machine learning and deep learning of big data are both dependent on graph computation at present, and graph computation has become one of the mainstream modes of big data processing.
Graph computation has a complex and irregular nature, presenting new challenges to current hardware. For a general-purpose Central Processing Unit (CPU), the instruction-level parallelism is abnormally low even for optimal graph algorithms, mostly below 1.0, and many below 0.5; for a throughput-oriented architecture such as a Graphics Processing Unit (GPU), execution is performed in a Single-Instruction Multiple-Data (SIMD) manner, while power-law distribution of graph Data and irregularity of graph algorithms are inherently not friendly to SIMD mode, and have the problems of uneven load and low bandwidth utilization, and studies show that only less than 16% of the time for the GPU is fully utilized. Reconfigurable hardware such as Field-Programmable Gate Array (FPGA) has received great attention due to its low power consumption and reconfigurable characteristics, so at present, researchers have designed a variety of effective architectures on FPGA for graph computation, and due to the complexity of graph computation itself and the high barrier of hardware programming, it is a very time-consuming matter even for professional researchers to write complete graph computation hardware codes.
In order to relieve FPGA developers from the complicated hardware details, a High Level Synthesis (HLS) system is provided. The HLS system can convert a program written in a high-Level language (most words are C/C + +) into Register-Transfer Level (RTL) codes (such as Verilog, VHDL, and the like), and provides various optimization means so that developers can optimize a hardware structure from a high-Level language layer, and partially provides a visual view to conveniently analyze circuit behaviors of each clock cycle, thereby further improving the performance of generating the RTL. In summary, existing HLS systems fail to provide effective support for high-parallelism execution of graph computations on FPGAs.
Disclosure of Invention
In view of the defects and improvement requirements of the prior art, the invention provides a high-level synthesis method facing graph computation, and aims to provide effective support for generating graph application RTL from an upper language so as to improve the parallelism of graph computation executed on an FPGA.
To achieve the above object, according to an aspect of the present invention, there is provided a high-level synthesis method for graph-oriented computation, including:
(1) generating a graph computation program for describing graph computation tasks according to a predefined target programming model;
the target programming model is a functional programming model taking a point as a center, and the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and active point sets;
(2) assigning architecture parameters and micro-architecture parameters for the graph computation task by adding optimization instructions;
the architecture parameters comprise edge processing operation and updating operation related to graph calculation; the micro-architecture parameters include parallelism and data bit width;
(3) compiling the graph computer program into a modular data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction;
the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node in the dataflow graph, and each IR module is supported by a corresponding parameterized hardware template;
(4) mapping the compiled intermediate representation of the data stream to an underlying architecture according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;
(5) if the instantiated parameterized hardware templates and the whole framework meet the predefined constraint conditions, the step (6) is carried out; otherwise, after the optimization instruction is modified, the step (3) is carried out;
(6) and generating synthesizable hardware language codes according to the instantiated parameterized hardware templates and the whole architecture.
At present, no HLS tool can directly generate a high-parallelism efficient pipeline structure for graph calculation from an upper-level high-level language, on one hand, a user is difficult to accurately describe a required architecture and specify micro-architecture parameters due to limited expression capacity, on the other hand, after the high parallelism is specified for a graph algorithm, a large amount of data dependence and conflict can be generated, and a sufficient bottom layer optimization is not available to realize a storage and calculation structure required by the high parallelism, so that a finally generated hardware circuit can be actually executed in series or cannot be generated due to resource exhaustion.
The method divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, edge data calculation, combining calculation results and updating results and active point set according to a functional programming model, decomposes each graph operation into one or more IR modules through a data flow graph, defines the mapping relation between the IR modules and a hardware template parameterized at a bottom layer, improves the description capacity of an upper layer language on the bottom layer hardware, and can provide effective support for generating graph application RTL from the upper layer language; in addition, in the process of compiling the graph calculation program into the data stream IR and mapping the data stream IR to the bottom layer architecture, micro-architecture parameters such as parallelism and the like specified by the optimization instruction can be transmitted and finally act on the bottom layer architecture, so that the hardware structure can be optimized from a high-level language layer, and the parallelism of the graph calculation executed on the FPGA is effectively improved.
Furthermore, each IR module is provided with an input buffer area and an output buffer area, wherein the input buffer area is used for receiving the data transmitted by the previous module and indicating the overflow condition, and the output buffer area is used for storing the result generated by the current IR module for the next IR module to read, and generating a control signal according to the overflow condition of the buffer area;
the connection between the IR modules is realized by an input buffer and an output buffer.
The invention can effectively reduce the pipeline pause by setting the corresponding input buffer area and output buffer area for each IR module.
Further, the dataflow graph includes 17 IR modules M1~M17
IR Module M1A graph operation to perform a set of read active points; IR Module M1Generating n source nodes according to the read point parallelism n in each clock cycle;
IR Module M2~M3Graph operations for performing read edge offsets; IR Module M2From IR module M according to read point parallelism n1Reading n source nodes to transmit source node information, and generating edge offset memory access requests aiming at the read source nodes in sequence; IR Module M3Slave IR module M2Processing after acquiring a side offset memory access request so as to read the side offset;
IR Module M3~M8Graph operations for performing read edge data; IR Module M4Slave IR module M3Receive side offset while slave IR module M2Reading n source nodes for continuous transmission; IR Module M5Slave IR module M4Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; IR Module M6Slave IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; IR Module M7Slave IR module M6Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; IR Module M3Also from IR module M6Processing after acquiring a side data access request to read side data; IR Module M8Slave IR module M3Receiving side data, from IR module M7Receiving source node information and side control information, and transmitting the three types of information;
IR Module M9~M13A graph operation for reading target point data is performed; IR Module M9Slave IR module M8After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data, and transmitting the source node information and the side control information; IR Module M10Slave IR module M9Scheduling after reading the access request of the target node so as to improve the access throughput; IR Module M11Slave IR module M10Processing after reading the scheduled target node access request so as to read target node data; IR Module M12Slave IR module M11After reading the target node data, sequencing the target nodes according to the source nodes; IR Module M13Transfer slave IR module M12Reading source node information, sequenced target node data and side control information;
IR Module M14Graph operations for performing edge data computations; IR Module M14Slave IR module M13Reading target node data, performing side data calculation to obtain an updated value, transmitting the updated value, and transmitting source node information and side control information;
IR Module M15Graph operations for performing merging computation results; IR Module M15Slave IR module M14After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;
IR Module M16~M17Graph operations for executing the update results and the set of active points; IR Module M16Slave IR module M15The output buffer area reads the updating value corresponding to each effective source node and transmits the source node information; IR Module M17Slave IR module M16The output buffer area reads the updating result of the effective source node, and merges the updating values of the same effective source node, and after merging, the updating values are written back to the chip for storage;
wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters appointed by the optimization instruction, and the IR module M6Among the transferred source nodes, there isThe source nodes of the m edge data access requests with edges belonging to the generated edges are effective source nodes, and the rest are invalid source nodes for filling.
In order to optimize a bottom hardware structure, in the compiling process, an upper layer language is compiled into a fine-grained intermediate representation, and then parallel opportunities in the fine-grained intermediate representation are mined, so that various optimizing means developed for a loop cannot pertinently and practically solve the problems of dependence and conflict existing in the operations, act on various operations contained in the loop, and cannot generate specific support for each operation.
The data flow diagram provided by the invention defines that the 17 IR modules effectively support 7 main diagram operations in the diagram computing tasks, so that the diagram computing tasks can be modularly displayed and expressed from a higher abstraction level, the optimization support is accurately provided, a large amount of data conflicts are avoided, and the execution parallelism of the diagram computing is favorably improved.
Further, each IR module stores the result generated by the IR module and the information to be transmitted into an output buffer area of the IR module;
an IR module MdTo another IR module MsWhen reading data, the IR module MdFirstly, the IR module MsRead the data in the output buffer to the IR module MdThen the IR module MdReading data from its input buffer;
wherein, is an IR module MdAnd IR Module MsAs IR module M1~M17Two different IR modules.
Further, IR module M6Slave IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information, wherein the method comprises the following steps:
IR Module M6Slave IR module M5After reading the source nodes and the corresponding edge offsets, reading n source nodes from the input buffer area of the source nodes, and obtaining the maximum edge offset e corresponding to the read source nodes and the value of the edge reading counterc;
If c + m is equal to e, transmitting n source nodes and m edge data access requests, and increasing the value of the edge reading counter by m; passing n source nodes from the IR module M6Removing the input buffer;
if c + m is larger than e, transmitting n source nodes and m edge data access requests, and simultaneously maintaining the value of the edge counter unchanged; passing n source nodes from the IR module M6Removing the input buffer;
if c + m < e, and the source node v exists, the right offset e of the source node vrvIf the number of the source nodes is c + m, transmitting the source nodes v and the source nodes with the numbers smaller than the source nodes v, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node v and the source nodes with numbers smaller than the source node v are selected from the IR module M6Removing the input buffer;
if c + m < e, and the source node u exists, the left offset e of the source node ulu< c + m, right offset eruIf the number of the source nodes is more than c + m, transmitting the source nodes u and the source nodes with the numbers less than those of the source nodes u, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node with the number less than that of the source node u is selected from the IR module M6Removing the input buffer;
wherein, the initial value of the edge reading counter is 0.
With the above-described optimized scheduling, the IR module M is enabled to perform every clock cycle6N source nodes and m edge data access requests are transmitted, so that the hardware implementation of the bottom layer can be simplified; when the degree of the source node is small, namely edges which do not belong to the source node exist in the transmitted edge data access and storage requests, all the edges of the small-degree nodes can be processed, and therefore the parallel operation among the nodes is realized; when the degree of the source node is large, namely the transmitted edge data access and storage request does not completely contain all edges of the source node, a plurality of edges of the node with large degree can be processed, and therefore intra-node parallelism is achieved. Thus, the present invention enables computation from a graphDue to the irregular characteristic of the tasks, the inter-node parallelism and the intra-node parallelism are flexibly realized, so that the resource utilization rate can be effectively improved, and the execution parallelism can be favorably improved.
Further, IR module M10Slave IR module M9Scheduling after reading the access request of the target node, comprising:
IR Module M10Slave IR module M9After a target node access request is read, obtaining the on-chip point data partition to which the target node access request belongs according to a request address, and distributing the target node access request to a request buffer corresponding to the on-chip point data partition;
the on-chip storage is divided into m on-chip point data divisions according to the read edge parallelism m, m request buffer areas correspondingly correspond to the m on-chip point data divisions one by one, and each clock cycle sends out access requests to the m on-chip point data divisions through the m request buffer areas.
Because a large amount of random access exists when target point data is read, through the scheduling optimization, the on-chip storage is divided into on-chip point data according to the read edge parallelism degree, and simultaneously, the buffer zones with corresponding quantity are generated, so that access conflict can be reduced, and higher throughput is ensured.
Further, if the number of edges is greater than a preset threshold, the edge data is stored in an off-chip DRAM, and the IR module M3And when the side data access request is processed, side data is read from the off-chip DRAM.
Conventional HLS systems use arrays to represent memory locations, and in practice, arrays are difficult to map to the various hardware memory locations required for graph computation; the invention can support off-chip data transmission and can optimize the storage structure.
Further, the high-level synthesis method for graph-oriented computation provided by the present invention further includes:
(7) converting the obtained hardware language code which can be synthesized into a stream file and running on an FPGA development board;
(8) if the performance does not meet the preset performance requirement, modifying the optimization instruction, then executing the steps (3) - (6) to obtain the synthesizable hardware language code again, and turning to the step (7); otherwise, the operation ends.
According to another aspect of the present invention, there is also provided a high-level integration system for graph-oriented computing, comprising: the system comprises a graph calculation program generation module, an optimization module, a compiling module, a bottom layer mapping module, a constraint inspection module and a synthesis module;
a graph computation program generation module for generating a graph computation program for describing graph computation tasks according to a predefined target programming model; the target programming model is a functional programming model taking a point as a center, and the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and active point sets;
the optimization module is used for appointing architecture parameters and micro-architecture parameters for the graph calculation task by adding an optimization instruction; the architecture parameters comprise edge processing operation and updating operation related to graph calculation; the micro-architecture parameters include parallelism and data bit width;
the compiling module is used for compiling the graph calculation program into a modularized data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction; the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node in the dataflow graph, and each IR module is supported by a corresponding parameterized hardware template;
the bottom layer mapping module is used for mapping the compiled data stream intermediate representation to a bottom layer framework according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;
the constraint checking module is used for judging whether each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions or not and modifying the optimization instruction when judging that the constraint conditions are not met;
and the synthesis module is used for generating a synthesizable hardware language code according to each parameterized hardware template and the whole framework when each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the high-level comprehensive method and the system for graph calculation divide a graph calculation task into seven graph operations of reading a live point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and a live point set according to a functional programming model, decompose each graph operation into one or more IR modules through a data flow graph, define the mapping relation between the IR modules and a hardware template parameterized by a bottom layer, improve the description capacity of an upper layer language on bottom layer hardware, and provide effective support for generating a graph from the upper layer language and applying RTL (real time language); in addition, in the process of compiling the graph calculation program into the data stream IR and mapping the data stream IR to the bottom layer architecture, micro-architecture parameters such as parallelism and the like specified by the optimization instruction can be transmitted and finally act on the bottom layer architecture, so that the hardware structure can be optimized from a high-level language layer, and the parallelism of the graph calculation executed on the FPGA is effectively improved. In general, the method provides effective support for generating the RTL of the graph application from the upper layer language, and improves the parallelism of the graph computation executed on the FPGA.
(2) According to the high-level comprehensive method and system for graph calculation, 17 IR modules are defined through the data flow graph, and 7 main graph operations in the graph calculation tasks are effectively supported, so that the graph calculation tasks can be modularly displayed and expressed from a higher abstraction level, optimization support is accurately provided, a large number of data conflicts are avoided, and the execution parallelism of the graph calculation is favorably improved.
(3) According to the high-level comprehensive method and system for the graph calculation, when the degree of the source node is small, all edges of the small-degree nodes can be processed, and therefore the nodes are parallel; when the degree of the source node is large, a plurality of edges of the node with large degree can be processed, and therefore intra-node parallelism is achieved. Therefore, the invention can flexibly realize the inter-node parallelism and the intra-node parallelism according to the irregular characteristic of the graph calculation task, thereby effectively improving the resource utilization rate and being beneficial to improving the execution parallelism.
(4) According to the high-level comprehensive method and system for the graph computing, when a memory access request of read target point data is processed, on-chip storage is divided into on-chip point data according to read edge parallelism, and meanwhile, buffer areas with corresponding quantity are generated, so that memory access conflicts can be reduced, and higher throughput is guaranteed.
Drawings
FIG. 1 is a flowchart of a high-level synthesis method for graph-oriented computing according to an embodiment of the present invention;
FIG. 2 is a data flow diagram provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of a point-centered functional programming model according to an embodiment of the present invention;
FIG. 4 is a block diagram of a data flow framework according to an embodiment of the present invention;
FIG. 5 is a diagram of a prior art point-centric commanded programming model;
fig. 6 is a schematic diagram of a hardware architecture according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Before explaining the technical scheme of the invention in detail, terms related to graph calculation are briefly introduced: the graph data structure is established on a primitive reflecting the real world, and comprises nodes (verticals or v), edges (edges or e) connecting different nodes and attributes (properties), wherein the nodes and the edges have own attributes, and the attributes can be any data of any type; compressed Sparse Row (CSR), an input format of a graph computation programming model, in which a concept of edge offset (offset) is introduced, that is, offset of an edge corresponding to a current vertex in an edge table specifically includes a left offset and a right offset, the left offset is used for recording a starting point of the edge corresponding to the vertex in the edge table, and the right offset is used for recording an end point of the edge corresponding to the vertex in the edge table; in the specific algorithm implementation, an active point set (ActiveVertex) is generally established, a target node is represented as u, an attribute Value is represented as Value, and in the invention, the input format required by a target programming model is CSR; in the aspect of an execution model, common execution models in graph calculation include a Push model and a Pull model, in the Pull model, each iteration of a graph algorithm schedules all nodes and processes all incoming data of the nodes, and because a PageRank algorithm is more suitable for adopting the Pull model, a BFS based on Push needs to obtain source point data and target point data at the same time, hardware overhead is increased, and the performance of the BFS based on Notify-Pull/Pull and Push/Pull is lower.
For providing effective support for generating graph application RTL from upper language to improve the parallelism of graph computation executed on FPGA, the invention provides a high-level synthesis method facing graph computation, as shown in FIG. 1, comprising:
(1) generating a graph computation program for describing graph computation tasks according to a predefined target programming model;
the target programming model is a point-centered functional programming model, and as shown in fig. 2, the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, edge data calculation, merging calculation results, updating results and active point set;
FIG. 3 shows pseudo code for a point-centric functional programming model of the present invention, calculated from a graph of dependencies, in which: in line 0, the address of the CSR format diagram data is appointed by the user, the diagram data is read in, and the storage mode and the address of the data are automatically determined according to the following code; the 1 st line and the 2 nd line correspond to graph operation for reading the active point set, all points are active points due to pull execution, and the 1 st line is actually used as a signal for judging whether the iteration is finished or not, wherein the parallelism of the reading points can be selected; the 3 rd row corresponds to graph operation of read edge offset, parallelism can be specified, and default and point parallelism are kept consistent; line 4 has no real effect, and the target programming model is not executed by loop driving; the 5 th row corresponds to the parallelism of the read edge data, and can specify the parallelism of the read edge, because the sequential read edge does not depend on the source point data, when the node degree is larger, the parallelism in the node is equivalent to the parallelism in the node, and when the node degree is smaller, the parallelism between the nodes is equivalent to the parallelism between the nodes, the continuous operation of hardware resources is ensured, and the load balance problem caused by the uneven degree between the nodes is solved; in the graph operation of the line 6 corresponding to the read target point data, because a large amount of random access memory exists, a corresponding amount of buffer areas are generated according to the read edge parallelism, and the data of the chip is divided and simultaneously the access memory request is scheduled to ensure higher throughput; the 7 th row corresponds to graph operation of edge data calculation, and data bit width and data type can be specified; line 8 corresponds to graph operation of merging computation results, and according to the specified merging logic operation, the specific architecture can be generated according to the read edge parallelism and the parallel accumulation architecture provided in An "An effective graph operator with parallel data management" (author: Yao Pengcheng, etc.); the 10 th line and the 11 th line correspond to the graph operation of the updating result and the active point set, the 10 th line is also operated according to the appointed merging logic, the parallelism of the updating result operation is not more than the parallelism of the reading points, and the on-chip point data storage can be correspondingly set, so that the read-write conflict problem can not be generated;
(2) assigning architecture parameters and micro-architecture parameters for the graph computation task by adding optimization instructions;
the architecture parameters comprise edge processing operation and updating operation related to graph calculation;
the micro-architecture parameters include parallelism and data bit width; the parallelism specifically comprises read point parallelism and read edge parallelism, and the micro-architecture parameters specified by the optimized instruction can also comprise data types, format adjustment and the like according to actual needs;
(3) compiling the graph computer program into a modular data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction;
the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node in the dataflow graph, and each IR module is supported by a corresponding parameterized hardware template;
as a preferred embodiment, each IR module is provided with an input buffer and an output buffer, wherein the input buffer is used for receiving data transmitted by a previous module and indicating an overflow condition, and the output buffer is used for storing a result generated by the current IR module for being read by a next IR module and generating a control signal according to the overflow condition of the buffer;
the connection between the IR modules is realized by an input buffer area and an output buffer area;
by setting a corresponding input buffer area and an output buffer area for each IR module, the pipeline pause can be effectively reduced;
in an embodiment of the invention, as shown in FIG. 4, the dataflow graph includes 17 IR modules M1~M17The numbers are 1-17 in sequence;
IR Module M1A graph operation to perform a set of read active points; IR Module M1Generating n source nodes according to the read point parallelism n in each clock cycle;
IR Module M2~M3Graph operations for performing read edge offsets; IR Module M2From IR module M according to read point parallelism n1Reading n source nodes to transmit source node information, and generating edge offset memory access requests aiming at the read source nodes in sequence; IR Module M3Slave IR module M2Processing after acquiring a side offset memory access request so as to read the side offset;
IR Module M3~M8Graph operations for performing read edge data; IR Module M4Slave IR module M3Receive side offset while slave IR module M2Read nThe source node continues to transmit; IR Module M5Slave IR module M4Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; IR Module M6Slave IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; IR Module M7Slave IR module M6Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; IR Module M3Also from IR module M6Processing after acquiring a side data access request to read side data; IR Module M8Slave IR module M3Receiving side data, from IR module M7Receiving source node information and side control information, and transmitting the three types of information;
IR Module M9~M13A graph operation for reading target point data is performed; IR Module M9Slave IR module M8After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data, and transmitting the source node information and the side control information; IR Module M10Slave IR module M9Scheduling after reading the access request of the target node so as to improve the access throughput; IR Module M11Slave IR module M10Processing after reading the scheduled target node access request so as to read target node data; IR Module M12Slave IR module M11After reading the target node data, sequencing the target nodes according to the source nodes; IR Module M13Transfer slave IR module M12Reading source node information, sequenced target node data and side control information;
IR Module M14Graph operations for performing edge data computations; IR Module M14Slave IR module M13Reading target node data, performing side data calculation to obtain an updated value, transmitting the updated value, and transmitting source node information and side control information;
IR Module M15Graph operations for performing merging computation results; IR mouldBlock M15Slave IR module M14After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;
IR Module M16~M17Graph operations for executing the update results and the set of active points; IR Module M16Slave IR module M15The output buffer area reads the updating value corresponding to each effective source node and transmits the source node information; IR Module M17Slave IR module M16The output buffer area reads the updating result of the effective source node, and merges the updating values of the same effective source node, and after merging, the updating values are written back to the chip for storage;
wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters appointed by the optimization instruction, and the IR module M6In the transmitted source nodes, the source node with edges belonging to the generated m edge data access requests is an effective source node, and the rest are invalid source nodes for filling;
as a further preferred embodiment, each IR module stores the results it produces and the information that needs to be transferred in its output buffer;
an IR module MdTo another IR module MsWhen reading data, the IR module MdFirstly, the IR module MsRead the data in the output buffer to the IR module MdThen the IR module MdReading data from its input buffer;
wherein, is an IR module MdAnd IR Module MsAs IR module M1~M17Two different IR modules;
(4) mapping the compiled intermediate representation of the data stream to an underlying architecture according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;
(5) if the instantiated parameterized hardware templates and the whole framework meet the predefined constraint conditions, the step (6) is carried out; otherwise, after the optimization instruction is modified, the step (3) is carried out;
depending on the actual implementation, the constraint condition may be a requirement for resources, timing, etc.;
when the instantiated parameterized hardware template and the whole framework do not meet the constraint conditions of resources, time sequences and the like, correspondingly, the modification of the optimization instruction can specifically be the reduction of parallelism, the optimization of a data partitioning method, the reduction of data replication, the optimization of a pipeline structure and the like;
(6) generating a synthesizable hardware language code according to each instantiated parameterized hardware template and the whole architecture;
the specific hardware description language can select VHDL, Verilog and the like according to actual needs, and the generated hardware language code is RTL code.
At present, no HLS tool can directly generate a high-parallelism efficient pipeline structure for graph calculation from an upper-level high-level language, on one hand, a user is difficult to accurately describe a required architecture and specify micro-architecture parameters due to limited expression capacity, on the other hand, after the high parallelism is specified for a graph algorithm, a large amount of data dependence and conflict can be generated, and a sufficient bottom layer optimization is not available to realize a storage and calculation structure required by the high parallelism, so that a finally generated hardware circuit can be actually executed in series or cannot be generated due to resource exhaustion.
The high-level comprehensive method for calculating the orientation graph divides a graph calculation task into seven graph operations of reading a live point set, reading an edge offset, reading edge data, reading target point data, calculating edge data, combining a calculation result and updating a result and a live point set according to a functional programming model, decomposes each graph operation into one or more IR modules through a data flow graph, defines the mapping relation between the IR modules and a hardware template parameterized by a bottom layer, improves the description capacity of an upper layer language on the bottom layer hardware, and can provide effective support for generating graph application RTL from the upper layer language; in addition, in the process of compiling the graph calculation program into the data stream IR and mapping the data stream IR to the bottom layer architecture, micro-architecture parameters such as parallelism and the like specified by the optimization instruction can be transmitted and finally act on the bottom layer architecture, so that the hardware structure can be optimized from a high-level language layer, and the parallelism of the graph calculation executed on the FPGA is effectively improved. In general, the high-level synthesis method for the graph calculation can provide effective support for generating the graph from the upper-level language and applying the RTL, and effectively improve the parallelism of the graph calculation executed on the FPGA.
In order to optimize a bottom hardware structure, in the compiling process, an upper layer language is compiled into a fine-grained intermediate representation, and then parallel opportunities in the fine-grained intermediate representation are mined, so that various optimizing means developed for a loop cannot pertinently and practically solve the problems of dependence and conflict existing in the operations, act on various operations contained in the loop, and cannot generate specific support for each operation.
In the high-level comprehensive method for graph calculation, the provided data flow graph defines that the 17 IR modules effectively support 7 main graph operations in graph calculation tasks, so that the graph calculation tasks can be modularly displayed and expressed from a higher abstraction level, optimization support is accurately provided, a large number of data conflicts are avoided, and the execution parallelism of the graph calculation is favorably improved.
In a preferred embodiment, in the above-mentioned high-level synthesis method for graph-oriented computation, the IR module M6Slave IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information, wherein the method comprises the following steps:
IR Module M6Slave IR module M5Reading n source nodes from an input buffer area of the source nodes after reading the source nodes and corresponding edge offsets, and obtaining a maximum edge offset e corresponding to the read source nodes and a value c of a read edge counter;
if c + m is equal to e, transmitting n source nodes and m edge data access requests, and increasing the value of the edge reading counter by m; slave IR module M for n source nodes to be communicated6Removing the input buffer;
if c + m is larger than e, transmitting n source nodes and m edge data access requests, and simultaneously maintaining the value of the edge counter unchanged; will be describedPassing n source node slave IR modules M6Removing the input buffer;
if c + m < e, and the source node v exists, the right offset e of the source node vrvIf the number of the source nodes is c + m, transmitting the source nodes v and the source nodes with the numbers smaller than the source nodes v, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node v and the source nodes with numbers smaller than the source node v are selected from the IR module M6Removing the input buffer;
if c + m < e, and the source node u exists, the left offset e of the source node ulu< c + m, right offset eruIf the number of the source nodes is more than c + m, transmitting the source nodes u and the source nodes with the numbers less than those of the source nodes u, filling by using invalid source nodes to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node with the number less than that of the source node u is selected from the IR module M6Removing the input buffer; so that the next round of source node transmission starts from the source node u;
wherein, the initial value of the edge reading counter is 0.
With the above-described optimized scheduling, the IR module M is enabled to perform every clock cycle6N source nodes and m edge data access requests are transmitted, so that the hardware implementation of the bottom layer can be simplified; when the degree of the source node is small, namely edges which do not belong to the source node exist in the transmitted edge data access and storage requests, all the edges of the small-degree nodes can be processed, and therefore the parallel operation among the nodes is realized; when the degree of the source node is large, namely the transmitted edge data access and storage request does not completely contain all edges of the source node, a plurality of edges of the node with large degree can be processed, and therefore intra-node parallelism is achieved.
The conventional HLS system adopts command programming, and fig. 5 is a pseudo code diagram of a conventional command programming model, in which a loop in a 1 st row represents sequentially processing all active vertices, and a loop in a 4 th row represents sequentially processing all edges of a current active vertex; firstly, considering that the nested loop is subjected to coarse-grained pipelining, namely lines 2-3 are taken as a pipelining stage, the loop of lines 4-8 is taken as a pipelining stage as a whole (only spatial can realize coarse-grained pipelining at any position at present), and then the loop represented by lines 4-8 is subjected to fine-grained pipelining, so that even if the maximum pipelining capability of the HLS tool is achieved, serious pipelining stagnation still exists under the condition of high parallelism; then considering adding parallel instructions, dividing the instructions into inter-point parallel and intra-point parallel, and respectively corresponding to the loop parallel of the 1 st line and the loop parallel of the 4 th line, wherein the two instructions can be added simultaneously; for point-to-point parallelism, for high-parallelism point-to-point parallelism (the parallelism of the graph accelerator can be 16 or 32), when the vertex degree is small, a lot of computing resources are idle in each period, the point-to-point parallelism situation is similar to the point-to-point parallelism, and due to the power law distribution characteristic of graph data, the load among parallel pipelines is uneven.
By contrast, the high-level comprehensive method for the face graph calculation can flexibly realize the inter-node parallelism and the intra-node parallelism according to the irregular characteristic of the graph calculation task, thereby effectively improving the resource utilization rate and being beneficial to improving the execution parallelism.
In a preferred embodiment, in the high-level synthesis method of the above-described orientation graph calculation, the IR module M10Slave IR module M9Scheduling after reading the access request of the target node, comprising:
IR Module M10Slave IR module M9After a target node access request is read, obtaining the on-chip point data partition to which the target node access request belongs according to a request address, and distributing the target node access request to a request buffer corresponding to the on-chip point data partition;
the on-chip storage is divided into m on-chip point data divisions according to the read edge parallelism m, m request buffer areas correspondingly correspond to the m on-chip point data divisions one by one, and each clock cycle sends out access requests to the m on-chip point data divisions through the m request buffer areas.
Because a large amount of random access exists when target point data is read, through the scheduling optimization, the on-chip storage is divided into on-chip point data according to the read edge parallelism degree, and simultaneously, the buffer zones with corresponding quantity are generated, so that access conflict can be reduced, and higher throughput is ensured.
In a preferred embodiment, in the above-mentioned high-level synthesis method for graph-oriented computation, if the number of edges is greater than a preset threshold, the edge data is stored in an off-chip DRAM, and the IR module M3When processing a side data access request, reading side data from an off-chip DRAM; the specific threshold may be determined based on the actual on-chip storage capacity.
Conventional HLS systems use arrays to represent memory locations, and in practice, arrays are difficult to map to the various hardware memory locations required for graph computation; the invention can support off-chip data transmission and can optimize the storage structure.
In order to implement the board-on-board operation of the FPGA, as shown in fig. 1, the high-level synthesis method for the graph-oriented calculation may further include:
(7) converting the obtained hardware language code which can be synthesized into a stream file and running on an FPGA development board;
(8) if the performance does not meet the preset performance requirement, modifying the optimization instruction, then executing the steps (3) - (6) to obtain the synthesizable hardware language code again, and turning to the step (7); otherwise, the operation is ended;
the specific performance requirements can be set according to the actual graph calculation task requirements, and when the performance does not meet the performance requirements, the modification of the optimization instruction can specifically be reducing the parallelism, optimizing a data partitioning method, reducing the data replication, optimizing a pipeline structure and the like.
In this embodiment, after the FPGA upper board is implemented to run according to the high-level comprehensive method of the above-mentioned graph-oriented calculation, a finally implemented bottom hardware architecture is as shown in fig. 6, and an actual pipeline is divided into 17 IR modules according to a data stream intermediate representation, where a module for executing reading live point sets, reading edge offsets, reading edge data, and reading target point data is an IR module unrelated to graph calculation, and an IR module for executing edge data calculation, merging calculation results, update results, and live point sets is a module related to graph calculation; for the modules irrelevant to graph calculation, instantiating corresponding point transmission pipelines, side control signal transmission pipelines and buffer areas and parameters of control signals in each module according to micro-architecture parameters such as parallelism and data bit width type specified by a user; for the modules related to graph calculation, the modules are instantiated according to user instruction parameters, and only comprise calculation operation and do not comprise control and memory access operation, so that corresponding basic hardware calculators generated according to the calculation operation are connected in the corresponding modules, and the connection between the modules is realized by the input buffer area and the output buffer area. As shown in fig. 6, in the embodiment of the present invention, an on-chip instantiated pipeline mainly includes a Read active vertex set (Read active vertex set) pipeline, a Read Edge data (Read Edge data) pipeline, a Read destination data (Read destination data) pipeline, a task scheduling (Processing scheduling) pipeline, an Edge computing (Edge process) pipeline, a Merge computing result (Merge process result) pipeline, and an Update node and active point set (Update live virtual data and active virtual) pipeline; the actual pipeline is divided into 17 IR modules according to IR, each IR module having parameterized hardware template support.
The invention also provides a high-level synthesis system for the graph-oriented calculation, which is used for executing the steps of the high-level synthesis method for the graph-oriented calculation; the system comprises: the system comprises a graph calculation program generation module, an optimization module, a compiling module, a bottom layer mapping module, a constraint inspection module and a synthesis module;
a graph computation program generation module for generating a graph computation program for describing graph computation tasks according to a predefined target programming model; the target programming model is a functional programming model taking a point as a center, and the target programming model divides the graph calculation task into seven graph operations of reading active point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and active point sets;
the optimization module is used for appointing architecture parameters and micro-architecture parameters for the graph calculation task by adding an optimization instruction; the architecture parameters comprise edge processing operation and updating operation related to graph calculation; the micro-architecture parameters include parallelism and data bit width;
the compiling module is used for compiling the graph calculation program into a modularized data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction; the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module represents a node in the corresponding dataflow graph, and each IR module has a corresponding parameterized hardware template support;
the bottom layer mapping module is used for mapping the compiled data stream intermediate representation to a bottom layer framework according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;
the constraint checking module is used for judging whether each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions or not and modifying the optimization instruction when judging that the constraint conditions are not met;
the comprehensive module is used for generating a comprehensive hardware language code according to each parameterized hardware template and the whole framework when each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions;
in the embodiment of the present invention, the detailed implementation of each module may refer to the description of the method embodiment described above, and will not be repeated here.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A high-level synthesis method for graph-oriented computing is characterized by comprising the following steps:
(1) generating a graph computation program for describing graph computation tasks according to a predefined target programming model;
the target programming model is a functional programming model taking a point as a center, and the target programming model divides a graph calculation task into seven graph operations of reading a live point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and live point sets;
(2) assigning architecture parameters and micro-architecture parameters for the graph computation task by adding optimization instructions;
the architecture parameters comprise edge processing operations and updating operations related to graph calculation; the micro-architecture parameters include parallelism and data bit width;
(3) compiling the graph computation program into a modular dataflow intermediate representation according to a pre-designed dataflow graph and the added optimization instructions;
the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node on the dataflow graph, and each IR module has a corresponding parameterized hardware template support;
(4) mapping the compiled intermediate representation of the data stream to an underlying architecture according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;
(5) if the instantiated parameterized hardware templates and the whole framework meet the predefined constraint conditions, the step (6) is carried out; otherwise, after the optimization instruction is modified, the step (3) is carried out;
(6) generating a synthesizable hardware language code according to each instantiated parameterized hardware template and the whole architecture;
each IR module is provided with an input buffer area and an output buffer area, wherein the input buffer area is used for receiving data transmitted by the previous IR module and indicating the overflow condition, and the output buffer area is used for storing the result generated by the current IR module for the next IR module to read and generating a control signal according to the overflow condition of the buffer area;
the connection between the IR modules is realized by an input buffer area and an output buffer area;
the dataflow graph includes 17 IR modules M1~M17
The IR module M1A graph operation to perform a set of read active points; the IR module M1Generating n source nodes according to the read point parallelism n in each clock cycle;
the IR module M2~M3Graph operations for performing read edge offsets; the IR module M2From the IR module M according to the read point parallelism n1Reading n source nodes to transmit source node information, and generating edge offset memory access requests aiming at the read source nodes in sequence; the IR module M3From the IR module M2Processing after acquiring a side offset memory access request so as to read the side offset;
the IR module M3~M8Graph operations for performing read edge data; the IR module M4From the IR module M3Receiving side offsets while receiving from the IR module M2Reading n source nodes for continuous transmission; the IR module M5From the IR module M4Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; the IR module M6From the IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; the IR module M7From the IR module M6Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; the IR module M3Also from the IR module M6Processing after acquiring a side data access request to read side data; the IR module M8From the IR module M3Receiving side data from the IR module M7Receiving source node information and side control information, and transmitting the three types of information;
the IR module M9~M13A graph operation for reading target point data is performed; the IR module M9From the IR module M8After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data and transmitting the access requestSource node information and side control information; the IR module M10From the IR module M9Scheduling after reading the access request of the target node so as to improve the access throughput; the IR module M11From the IR module M10Processing after reading the scheduled target node access request so as to read target node data; the IR module M12From the IR module M11After reading the target node data, sequencing the target nodes according to the source nodes; the IR module M13Passing from the IR module M12Reading source node information, sequenced target node data and side control information;
the IR module M14Graph operations for performing edge data computations; the IR module M14From the IR module M13Reading target node data, performing side data calculation to obtain an updated value, transmitting the updated value, and transmitting source node information and side control information;
the IR module M15Graph operations for performing merging computation results; the IR module M15From the IR module M14After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;
the IR module M16~M17Graph operations for executing the update results and the set of active points; the IR module M16From the IR module M15The output buffer area reads the updating value corresponding to each effective source node and transmits the source node information; the IR module M17From the IR module M16The output buffer area reads the updating result of the effective source node, and merges the updating values of the same effective source node, and after merging, the updating values are written back to the chip for storage;
wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters specified by an optimization instruction, and the IR module M6In the transmitted source nodes, the source node with the edge belonging to the generated m edge data access requests is an effective source node, and the rest are invalid source nodes for filling.
2. The method of claim 1, wherein each IR module stores the results it produces and the information it needs to deliver in its output buffer;
an IR module MdTo another IR module MsWhen reading data, the IR module MdFirstly, the IR module MsRead the data in the output buffer to the IR module MdThen the IR module MdReading data from its input buffer;
wherein, is an IR module MdAnd IR Module MsAs IR module M1~M17Two different IR modules.
3. The graph-computation-oriented high-level synthesis method of claim 2, wherein the IR module M6From the IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information, wherein the method comprises the following steps:
the IR module M6From the IR module M5Reading n source nodes from an input buffer area of the source nodes after reading the source nodes and corresponding edge offsets, and obtaining a maximum edge offset e corresponding to the read source nodes and a value c of a read edge counter;
if c + m is equal to e, transmitting n source nodes and m edge data access requests, and increasing the value of the edge reading counter by m; passing n source nodes from the IR module M6Removing the input buffer;
if c + m is larger than e, transmitting n source nodes and m edge data access requests, and simultaneously maintaining the value of the edge counter unchanged; passing n source nodes from the IR module M6Removing the input buffer;
if c + m is less than e, and a source node v exists, the right offset e of the source node vrvIf the number of the source node v is c + m, the source node v and the source nodes with the numbers smaller than the source node v are transmitted, and an invalid source node is utilized to carry outLine filling, namely transmitting n source nodes at a time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node v and the source node with the number smaller than that of the source node v are selected from the IR module M6Removing the input buffer;
if c + m is less than e, and a source node u exists, the left offset e of the source node ulu< c + m, right offset eruIf the number of the source node u is larger than c + m, transmitting the source node u and a source node with a number smaller than that of the source node u, filling by using an invalid source node to transmit n source nodes at one time, transmitting m edge data access requests and increasing the value of a read edge counter by m; the source node with the number smaller than the source node u is transmitted from the IR module M6Removing the input buffer;
wherein the initial value of the edge reading counter is 0.
4. The graph computation-oriented high-level synthesis method of claim 1, wherein the IR module M10From the IR module M9Scheduling after reading the access request of the target node, comprising:
the IR module M10From the IR module M9After a target node access request is read, obtaining the on-chip point data partition to which the target node access request belongs according to a request address, and distributing the target node access request to a request buffer corresponding to the on-chip point data partition;
the on-chip storage is divided into m on-chip point data divisions according to the read edge parallelism m, m request buffer areas correspondingly correspond to the m on-chip point data divisions one by one, and each clock cycle sends out access requests to the m on-chip point data divisions through the m request buffer areas.
5. The graph computation-oriented high-level synthesis method of claim 1, wherein if the number of edges is greater than a preset threshold, the edge data is stored in an off-chip DRAM, and the IR module M3And when the side data access request is processed, side data is read from the off-chip DRAM.
6. The high-level synthesis method for graph-oriented computing according to any one of claims 1 to 5, further comprising:
(7) converting the obtained hardware language code which can be synthesized into a stream file and running on an FPGA development board;
(8) if the performance does not meet the preset performance requirement, modifying the optimization instruction, then executing the steps (3) - (6) to obtain the synthesizable hardware language code again, and turning to the step (7); otherwise, the operation ends.
7. A high-level synthesis system for graph-oriented computing, comprising: the system comprises a graph calculation program generation module, an optimization module, a compiling module, a bottom layer mapping module, a constraint inspection module and a synthesis module;
the graph computation program generation module is used for generating a graph computation program for describing graph computation tasks according to a predefined target programming model; the target programming model is a functional programming model taking a point as a center, and the target programming model divides a graph calculation task into seven graph operations of reading a live point set, reading edge offset, reading edge data, reading target point data, calculating edge data, combining calculation results and updating results and live point sets;
the optimization module is used for appointing architecture parameters and micro-architecture parameters for the graph calculation task by adding an optimization instruction; the architecture parameters comprise edge processing operations and updating operations related to graph calculation; the micro-architecture parameters include parallelism and data bit width;
the compiling module is used for compiling the graph calculation program into a modularized data flow intermediate representation according to a pre-designed data flow graph and the added optimization instruction; the data flow diagram decomposes each diagram operation into one or more IR modules and describes the connection relation among the IR modules; each IR module corresponds to a node on the dataflow graph, and each IR module has a corresponding parameterized hardware template support;
the bottom layer mapping module is used for mapping the compiled data stream intermediate representation to a bottom layer framework according to the mapping relation between the IR module and the hardware template, and instantiating a pipeline and a buffer area in the corresponding hardware template according to the specified parallelism and data bit width;
the constraint checking module is used for judging whether each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions or not and modifying an optimization instruction when judging that the constraint conditions are not met;
the comprehensive module is used for generating a comprehensive hardware language code according to each parameterized hardware template and the whole framework when each instantiated parameterized hardware template and the whole framework meet predefined constraint conditions;
each IR module is provided with an input buffer area and an output buffer area, wherein the input buffer area is used for receiving data transmitted by the previous IR module and indicating the overflow condition, and the output buffer area is used for storing the result generated by the current IR module for the next IR module to read and generating a control signal according to the overflow condition of the buffer area;
the connection between the IR modules is realized by an input buffer area and an output buffer area;
the dataflow graph includes 17 IR modules M1~M17
The IR module M1A graph operation to perform a set of read active points; the IR module M1Generating n source nodes according to the read point parallelism n in each clock cycle;
the IR module M2~M3Graph operations for performing read edge offsets; the IR module M2From the IR module M according to the read point parallelism n1Reading n source nodes to transmit source node information, and generating edge offset memory access requests aiming at the read source nodes in sequence; the IR module M3From the IR module M2Processing after acquiring a side offset memory access request so as to read the side offset;
the IR module M3~M8Graph operations for performing read edge data; the IR module M4From the IR module M3Receiving edgeOffset from the IR module M at the same time2Reading n source nodes for continuous transmission; the IR module M5From the IR module M4Reading the edge offset and the source node data and matching, thereby adding the corresponding edge offset in the source node data; the IR module M6From the IR module M5After reading the source node and the corresponding edge offset, generating m edge data access requests according to the read edge parallelism m sequence, and continuously transmitting the source node information; the IR module M7From the IR module M6Reading an edge data access request and source node information, and marking edges which do not belong to a transmitted source node as invalid edges to generate edge control information; the IR module M3Also from the IR module M6Processing after acquiring a side data access request to read side data; the IR module M8From the IR module M3Receiving side data from the IR module M7Receiving source node information and side control information, and transmitting the three types of information;
the IR module M9~M13A graph operation for reading target point data is performed; the IR module M9From the IR module M8After receiving the side data, the source node information and the side control information, generating a target node access request according to the side data, and transmitting the source node information and the side control information; the IR module M10From the IR module M9Scheduling after reading the access request of the target node so as to improve the access throughput; the IR module M11From the IR module M10Processing after reading the scheduled target node access request so as to read target node data; the IR module M12From the IR module M11After reading the target node data, sequencing the target nodes according to the source nodes; the IR module M13Passing from the IR module M12Reading source node information, sequenced target node data and side control information;
the IR module M14Graph operations for performing edge data computations; the IR module M14From the IR module M13Side data calculation after reading target node dataTo obtain an updated value, and transmit the source node information and the side control information at the same time;
the IR module M15Graph operations for performing merging computation results; the IR module M15From the IR module M14After reading the updated values, merging the updated values according to the read edge parallelism m, and transmitting source node information and edge control information;
the IR module M16~M17Graph operations for executing the update results and the set of active points; the IR module M16From the IR module M15The output buffer area reads the updating value corresponding to each effective source node and transmits the source node information; the IR module M17From the IR module M16The output buffer area reads the updating result of the effective source node, and merges the updating values of the same effective source node, and after merging, the updating values are written back to the chip for storage;
wherein, the read point parallelism n and the read edge parallelism M are micro-architecture parameters specified by an optimization instruction, and the IR module M6In the transmitted source nodes, the source node with the edge belonging to the generated m edge data access requests is an effective source node, and the rest are invalid source nodes for filling.
CN201910842736.6A 2019-09-06 2019-09-06 High-level synthesis method and system for graph calculation Active CN110750265B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910842736.6A CN110750265B (en) 2019-09-06 2019-09-06 High-level synthesis method and system for graph calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910842736.6A CN110750265B (en) 2019-09-06 2019-09-06 High-level synthesis method and system for graph calculation

Publications (2)

Publication Number Publication Date
CN110750265A CN110750265A (en) 2020-02-04
CN110750265B true CN110750265B (en) 2021-06-11

Family

ID=69276106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910842736.6A Active CN110750265B (en) 2019-09-06 2019-09-06 High-level synthesis method and system for graph calculation

Country Status (1)

Country Link
CN (1) CN110750265B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553834B (en) * 2020-04-24 2023-11-03 上海交通大学 Concurrent graph data preprocessing method based on FPGA
CN111880807A (en) * 2020-07-31 2020-11-03 Oppo广东移动通信有限公司 Deep learning compiling method, device, equipment and storage medium
CN112541310B (en) * 2020-12-18 2021-10-29 广东高云半导体科技股份有限公司 Logic comprehensive control method and device
CN112817982B (en) * 2021-02-08 2022-09-30 南京邮电大学 Dynamic power law graph storage method based on LSM tree

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178749A (en) * 2006-11-09 2008-05-14 松下电器产业株式会社 Program conversion apparatus
CN103092573A (en) * 2013-01-16 2013-05-08 清华大学 C program to register transfer level (RTL) comprehensive method of pipeline division and module parallel optimization
US8671371B1 (en) * 2012-11-21 2014-03-11 Maxeler Technologies Ltd. Systems and methods for configuration of control logic in parallel pipelined hardware
CN107179932A (en) * 2017-05-26 2017-09-19 福建师范大学 The optimization method and its system instructed based on FPGA High Level Synthesis
WO2018125250A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Systems, methods, and apparatuses for heterogeneous computing
US10146738B2 (en) * 2016-12-31 2018-12-04 Intel Corporation Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430169B2 (en) * 2014-05-30 2019-10-01 Apple Inc. Language, function library, and compiler for graphical and non-graphical computation on a graphical processor unit
WO2016177405A1 (en) * 2015-05-05 2016-11-10 Huawei Technologies Co., Ltd. Systems and methods for transformation of a dataflow graph for execution on a processing system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178749A (en) * 2006-11-09 2008-05-14 松下电器产业株式会社 Program conversion apparatus
US8671371B1 (en) * 2012-11-21 2014-03-11 Maxeler Technologies Ltd. Systems and methods for configuration of control logic in parallel pipelined hardware
CN103092573A (en) * 2013-01-16 2013-05-08 清华大学 C program to register transfer level (RTL) comprehensive method of pipeline division and module parallel optimization
WO2018125250A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Systems, methods, and apparatuses for heterogeneous computing
US10146738B2 (en) * 2016-12-31 2018-12-04 Intel Corporation Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data
CN110121698A (en) * 2016-12-31 2019-08-13 英特尔公司 System, method and apparatus for Heterogeneous Computing
CN107179932A (en) * 2017-05-26 2017-09-19 福建师范大学 The optimization method and its system instructed based on FPGA High Level Synthesis

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An Efficient Graph Accelerator with Parallel;Pengcheng Yao等;《PACT "18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques》;20181101;1-12页 *
Generating Configurable Hardware from Parallel Patterns;Raghu Prabhakar等;《ASPLOS "16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems》;20160325;651-665页 *
Towards Dataflow-based Graph Accelerator;Hai Jin等;《2017 IEEE 37th International Conference on Distributed Computing Systems》;20170717;1981-1992页 *
面向图计算的加速器;郑龙等;《中国计算机学会通讯》;20180731;10-15页 *

Also Published As

Publication number Publication date
CN110750265A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110750265B (en) High-level synthesis method and system for graph calculation
US8099584B2 (en) Methods for scalably exploiting parallelism in a parallel processing system
CN106383695B (en) The acceleration system and its design method of clustering algorithm based on FPGA
US8225074B2 (en) Methods and systems for managing computations on a hybrid computing platform including a parallel accelerator
US8984498B2 (en) Variance analysis for translating CUDA code for execution by a general purpose processor
US20190138373A1 (en) Multithreaded data flow processing within a reconfigurable fabric
CN110704360A (en) Graph calculation optimization method based on heterogeneous FPGA data flow
CN101479704A (en) Programming a multi-processor system
US20190057060A1 (en) Reconfigurable fabric data routing
Giles et al. Designing OP2 for GPU architectures
US11556756B2 (en) Computation graph mapping in heterogeneous computer system
US20230353458A1 (en) Neural network computing-oriented modeling method and apparatus for distributed data routing
US20190130269A1 (en) Pipelined tensor manipulation within a reconfigurable fabric
CN112580792B (en) Neural network multi-core tensor processor
CN103996216A (en) Power efficient attribute handling for tessellation and geometry shaders
WO2023071238A1 (en) Computational graph compiling and scheduling methods and related products
CN111639054B (en) Data coupling method, system and medium for ocean mode and data assimilation
Shang et al. Slopes: hardware–software cosynthesis of low-power real-time distributed embedded systems with dynamically reconfigurable fpgas
Niu et al. A scalable design approach for stencil computation on reconfigurable clusters
Fernando et al. Mampsx: A design framework for rapid synthesis of predictable heterogeneous mpsocs
US20190042941A1 (en) Reconfigurable fabric operation linkage
CN113748399B (en) Method, apparatus and readable medium for scheduling computational graphs on heterogeneous computing resources
Yu et al. Accelerated Synchronous Model Parallelism Using Cooperative Process for Training Compute-Intensive Models
Huang et al. Synthesis of custom networks of heterogeneous processing elements for complex physical system emulation
Jendrsczok et al. Generated horizontal and vertical data parallel gca machines for the n-body force calculation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant