CN115687707A

CN115687707A - Acceleration subgraph matching method based on CPU-FPGA hybrid platform

Info

Publication number: CN115687707A
Application number: CN202210864241.5A
Authority: CN
Inventors: 张显
Original assignee: Huaihua University
Current assignee: Huaihua University
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2023-02-03

Abstract

The invention discloses an accelerating subgraph matching method based on a CPU-FPGA (Central processing Unit-field programmable Gate array) hybrid platform, which provides a synergistic hybrid architecture based on the CPU-FPGA to accelerate subgraph matching and provides a subgraph matching algorithm mixed in stages to perform subgraph matching; and realizing the load balance of the FPGA multiple computing units by a partition division method according to the candidate vertex auxiliary data structure set. In addition, a plurality of computing units of the FPGA are designed, host-to-Kernel Dataflow is enabled, and the enumeration kernel function accelerates subgraph matching by using optimization technologies such as Looppiline, loopuunlolling, dataflow and functional linking. Finally, the performance of the mixed subgraph matching method provided by the invention is tested on a real data set and a synthesized data set through a large number of experiments, and the experimental result shows that the performance of the mixed subgraph matching method is superior to that of the most advanced subgraph matching method.

Description

Acceleration subgraph matching method based on CPU-FPGA hybrid platform

Technical Field

The invention relates to the technical field of subgraph matching, in particular to an acceleration subgraph matching method based on a CPU-FPGA (Central processing Unit-field programmable Gate array) hybrid platform.

Background

With the advent of the big data age, more and more data is represented and stored in graph structures. The graph can express complex relations among data objects, and is widely applied to the fields of social networks, web networks, material structures, transportation, city planning, medical information and the like. Accordingly, graph data processing is becoming more and more important in today's life, and graph data generated in these fields is growing at an explosive rate as the degree of informatization of human society is increased. By 2021, social network Facebook users exceeded 30 million, with a friend factor of over 100 million. The subgraph matching is widely applied to graph database query, molecular interaction network analysis, recommendation systems, social network analysis, biological data analysis and the like.

The subgraph matching problem is solved by giving a query graph q and finding all subgraphs which have the same structure as the query graph and the same node labels from a data graph G. For example, in FIG. 1, given a data graph G and a query graph q, { (u 0, v 0), (u 1, v 3), (u 2, v 4), (u 3, v 12) } is a sub-graph match of q to G. However, computing all subgraph matches is an NP-hard problem, especially looking for all subgraph embeddings on large graphs. Therefore, how to efficiently solve the subgraph matching problem is a problem to be solved urgently at present.

Most subgraph matching methods in the prior art on CPUs are mostly based on a backtracking method, i.e., recursive partial embedding is performed by mapping query vertices to data vertices. This approach is limited by the design features of a general purpose CPU, and this sequential solution exhibits an undesirable response time and poor scalability when processing large amounts of graph data. Further, single instruction, multiple data (SIMD) instruction level parallelism based on general purpose CPUs also does not appear flexible enough in terms of high parallelism for graph data processing, and general purpose CPUs are not efficient enough in terms of data locality constraints for irregular graphs.

Unlike CPU and GPU fixed architectures, field Programmable Gate Arrays (FPGAs) are reconfigurable computing platforms whose computational engines can be flexibly defined by the user depending on the application characteristics, without the time overhead and production costs associated with the use of Application Specific Integrated Circuit (ASIC) production. The FPGA chip includes many resources, which can help users to design custom applications, such as adaptive logic module (alm), memory Block (BRAM), embedded multiplier for digital signal processing, etc., and other peripheral interfaces, such as external memory (DDR, HMC, etc.) and peripheral component interconnect express interface (PCIe). When software is "executed" on an FPGA, it does not perform as much as compiled and assembled instructions are executed on the CPU and GPU, but instead the data stream goes through a deep pipeline customized on the FPGA to match the operations expressed in the software. Because the data flow pipeline hardware is matched to the software, control overhead is eliminated, thereby improving performance and efficiency. In addition, the parallelism of the FPGA can realize the data parallelism (SIMD), the task parallelism (a plurality of pipelines) and the superscalar execution (a plurality of independent instructions executed in parallel) and the pipeline parallelism to be used together, thereby achieving the best parallel performance. Meanwhile, compared with a GPU and a CPU, the FPGA has better energy efficiency ratio due to the low power consumption characteristic of the FPGA. FPGAs thus offer an alternative to computational acceleration at the hardware level, which presents a great advantage in terms of parallelism compared to CPUs. FPGAs show a great advantage in power consumption compared to GPUs.

In the industry, FPGAs have been applied to achieve complex system acceleration. For example, microsoft uses FPGA to accelerate Bing search and Azure machine learning. FPGAs are also pushed by cloud server providers to accelerate cloud servers, such as Amazon, alibaba, tencent, huawei, and others. In academia, FPGAs are used to accelerate a variety of different research problems, including many graph processing problems, AI acceleration, etc.

However, the enumeration stage of subgraph matching is a calculation-intensive task, so how to fully utilize a pipeline mechanism and a data flow mechanism of the FPGA to achieve space data parallel and time data parallel of the FPGA so as to accelerate the enumeration task of subgraph matching, thereby efficiently solving the problem of subgraph matching.

Disclosure of Invention

Aiming at the problems, the invention provides a mixed subgraph matching algorithm for automatically selecting the matching sequence according to the density of the data graph, and based on a CPU-FPGA cooperative framework, the subgraph matching is accelerated by utilizing a pipeline mechanism and low power consumption characteristics of the FPGA.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

an acceleration subgraph matching method based on a CPU-FPGA hybrid platform is characterized by comprising the following steps:

step 1: inputting a data graph and a query graph to be queried into a memory of a CPU (central processing unit);

and 2, step: performing subgraph matching based on a CPU-FPGA collaborative hybrid architecture, and finding all subgraphs which have the same structure as a query graph and the same node labels from a given data graph, wherein the subgraph comprises the following three stages:

the first stage, a CPU end is used as a host end, a candidate vertex set is generated according to a data graph and a query graph, filtering is carried out through LDF and NLF rules, and then an auxiliary data structure is constructed by utilizing a CFL method;

in the second stage, the CPU end selects a GQL method or an RI method to generate a query matching sequence according to the density of the data graph;

in the third stage, subgraph matching enumeration is respectively carried out on a CPU end and an FPGA end according to the candidate vertex set, the auxiliary data structure and the query matching sequence, and results of the CPU end and the FPGA end are summarized to obtain a final subgraph matching result;

and step 3: and outputting the final sub-graph matching embedding result.

Further, the step of constructing the candidate vertex set and the auxiliary data structure in the first stage is:

step 201: for a given query graph q and data graph G, vertex u ∈ V (q) in the query graph q, vertex V ∈ V (G) of the data graph G, and if there is a corresponding match from graph q to graph G for (u, V), adding vertex V to the set of candidate vertices, where V represents the set of vertices;

step 202: filtering the candidate vertex set by using LDF and NFL rules to obtain a filtered candidate vertex set C (u);

step 203: constructing a BFS tree qt according to a given query graph q;

step 204: and (3) gradually accessing the vertexes of the query graph q from top to bottom to find the vertexes in the corresponding data graph G so as to construct an auxiliary data structure A, then carrying out bottom-to-top refinement by using CFL (computational fluid dynamics), eliminating undesired candidate points and deleting the nonexistent vertexes in the adjacency list.

Further, the step of generating the query matching sequence in the second stage includes:

step 301: if the data map is a dense map, selecting a GOL method to generate a query matching sequence, and turning to step 302; if the data map is a sparse map, selecting an RI method to generate a matching sequence of the corresponding query maps, and turning to step 303;

step 302: the GOL method comprises the following steps: firstly, selecting u '= arg minu epsilon V (q) | C (u) | as a starting vertex of a matching sequence, and then selecting u' = arg minu epsilon N (§ C (u) | as a next vertex of the matching sequence in the GQL method iteration;

step 303: the RI method comprises the following steps: first select u' = argmaxu ∈ V (q) d (u) as matching order § c. Then its iterative selection u' = argmaxu ∈ N (§) - § N (u) § N as the next vertex of the matching order.

Further, the step of matching the third-stage subgraph comprises the following steps:

step 401: calculating local candidates by adopting an intersection-based embedding method at the CPU end according to the obtained matching sequence to obtain a corresponding sub-graph matching result;

step 402: partitioning the candidate vertex set based on the auxiliary data structure and the number of kernel computing units, and dividing the candidate vertex set into a plurality of blocks, wherein one block is executed at a CPU (central processing unit) end, and the other blocks are executed on an FPGA (field programmable gate array), so that the load balance of a plurality of computing units at the CPU end and the FPGA end is finally realized;

step 403: designing a plurality of computing units in parallel by utilizing the space data of the FPGA to run an enumeration kernel function in the intersection-based embedding method; enabling Host-to-Kernel Dataflow by utilizing FPGA time data parallel design;

step 404: summarizing the matching enumeration result of the FPGA terminal and the CPU terminal diagram.

Further, the specific operation steps of step 402 are:

step 4021: determining a partition factor k = min (k, | C (u);

step 4022: and starting partitioning candidates from the root vertex of A, and if the number of the root vertices of the auxiliary data structure A is less than k, continuing partitioning along the candidate of the next vertex in the matching order until the last vertex of A is reached.

The invention has the beneficial effects that:

firstly, the invention provides a sub-graph matching algorithm with a staged mixing, wherein a filtering candidate vertex set uses LDF (the label and depth filter) and NLF (the neighbor borwood label frequency filter) to filter mapping vertexes; constructing an auxiliary data structure by using a CFL method; and selecting GQL and RI according to the query vertex sequence and the sparsity of the data graph, and using a set intersection-based method for enumerating subgraph matching results. Meanwhile, in the enumeration stage, the matching sequence of the query graph is automatically selected according to whether the data graph is a dense graph or a sparse graph, namely, a GOL (generic object language) query sequence method is selected for the dense data graph, and an RI (inverse representation) query sequence method is selected for the sparse data graph, so that the problem of the matching speed in the enumeration stage in subgraph matching can be solved.

Secondly, the invention provides a CPU-FPGA collaborative hybrid architecture to accelerate subgraph matching, and a host (CPU) mainly processes and filters candidate vertexes, constructs a candidate vertex data structure, constructs a query matching sequence, partitions the candidate vertex data and enumerates partial subgraph matching; the kernel side (FPGA) mainly processes enumeration tasks of subgraph matching, and can obviously accelerate enumeration speed of subgraph matching.

Thirdly, the invention provides a partition division method of the candidate vertex auxiliary data structure set, which can effectively realize the load balance of multiple computing units at a host end and an FPGA end.

Fourthly, the method of Spatial Data Parallelism (Spatial Data Parallelism) and temporal Data Parallelism (Spatial Data Parallelism) of the FPGA is used for accelerating the enumeration speed of subgraph matching.

The performance of the mixed subgraph matching method is tested on a real data set and a synthesized data set through a large number of experiments, and the final experiment result shows that the performance of the mixed subgraph matching method is superior to that of the most advanced subgraph matching method.

Drawings

FIG. 1 is a given data graph G and query graph q, where FIG. 1 (a) is the query graph q and FIG. 1 (b) is the data graph G;

FIG. 2 is a structure of an auxiliary data structure A;

FIG. 3 is a graph QL method subgraph matching time situation;

FIG. 4 illustrates the time situation of sub-graph matching in the CECI method;

FIG. 5 is the CPU usage of GraphQL and CECI method during subgraph matching;

FIG. 6 is a system structure diagram of the present invention for CPU and FPGA cooperative processing;

fig. 7 is an example of an auxiliary data structure set, and is a first layer of an original auxiliary data structure set A, A, a second layer of a, and a third layer of a, respectively, from left to right;

FIG. 8 is a conceptual diagram of a Host-to-Kernel Dataflow;

FIG. 9 is a result of an eu2005 data set enumeration performance test;

FIG. 10 is a Youtube dataset enumeration performance test result;

FIG. 11 is the result of enumeration performance evaluation under the synthetic data set SD 05;

FIG. 12 is the result of enumeration performance evaluation under the synthetic data set SD 10;

fig. 13 shows the result of enumeration of performance evaluation in the synthetic data set SD 15.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following description will be made with reference to the accompanying drawings and embodiments.

1. Using graph and subgraph matching as background

Definition 1: a graph G is defined as a tuple G = { V, E, ∑ V, L }, where V represents a non-empty set of vertices in the graph,

represents the set of edges in the graph, Σ V is the set of vertex labels in the graph, L is a label function used to assign labels to vertices, and L (u) represents the label of vertex u.

Definition 2: an undirected graph refers to the assignment of labels to each vertex in the graph by a label function, where the edges between the vertices are non-directional. The edges in the undirected graph are an unordered pair, usually indicated by parentheses, e.g., (u, V) indicates that there is an edge between vertex u and vertex V, where u, V ∈ V, and (u, V) is equivalent to (V, u).

Definition 3: given two graphs G = (V, E, ∑) and G '= (V', E ', ∑'), if:

(1)

(2)

(3)

u and V belong to V (namely, the vertexes u and V at two ends of any one edge in E are in the vertex set V);

(4)

u ' V ' belongs to V ' (namely, two vertexes u ' and V ' of any one side in E ' are in the vertex set V ');

then the graph G' is called a subgraph of the graph G.

Definition 4: given a data graph G (V, E, L), a data graph Q = (V ', E', L '), if there is a injective function f: V' → V such that the following condition holds:

(1)

(2)

then graph Q is a sub-graph isomorphism of graph G, denoted as

Based on the above definition, then given a query graph q and a data graph G, subgraph matching is to find all subgraphs from the data graph G that are isomorphic to the given query graph q.

The invention refers to a subgraph isomorphic mapping as a subgraph matching embedding, then, assuming that a query graph q is a connected graph and the number of vertexes | V (q) | ≧ 3 of the query graph, the enumeration process for finding all subgraph matches of a single vertex or a single edge is as follows:

for example, for query graph q in fig. 1 (a) and data graph G in fig. 1 (b), all sub-graph matching embedding for graph q in graph G can be expressed as: { (u) ₀ ,ν ₀ ),(u ₁ ,ν ₃ ),(u ₂ ,ν ₄ ),(u ₃ ,ν ₁₂ )}, {(u ₀ ,ν ₀ ),(u ₁ ,ν ₅ ),(u ₂ ,ν ₄ ),(u ₃ ,ν ₁₂ )},{(u ₀ ,ν ₀ ),(u ₁ ,ν ₅ ),(u ₂ ,ν ₄ ),(u ₃ ,ν ₁₃ )},{(u ₀ , ν ₁ ),(u ₁ ,ν ₆ ),(u ₂ ,ν ₇ ),(u ₃ ,ν ₁₅ )},{(u ₀ ,ν ₂ ),(u ₁ ,ν ₈ ),(u ₂ ,ν ₉ ),(u ₃ ,ν ₁₆ )},{(u ₀ ,ν ₂ ),(u ₁ , ν ₈ ),(u ₂ ,ν ₉ ),(u ₃ ,ν ₁₇ )}。

The graph of interest for the present invention is a nondirectional label graph G = (V, E, Σ, L), where V represents the set of vertices of the graph and E represents the set of edges of the graph. Given vertex u ∈ V, N (u) represents a neighbor vertex u' of vertex u. The label graph has a label function L to assign a label L (L e sigma) to the vertex u (u e V), and d (u) represents the degree of the vertex u in the graph G. Common symbols for the present invention are given in table 1.

TABLE 1 common symbols

Definition 5: given a query graph q and a data graph G, the set of candidate vertices C (u) is a set of vertices for a data graph, i.e., u ∈ V (q), V ∈ V (G), and if there is a corresponding match from graph q to graph G (u, V), then vertex V belongs to a vertex in C (u).

Example 2: for the query graph q in fig. 1 (a) and the data graph G in fig. 1 (b), the corresponding BFS tree tq, the set of candidate vertices C (u) and the auxiliary data structure a are shown in fig. 2.

For a given query graph q and its BFS tree tq, in candidate auxiliary data structure A, contiguous vertices u and u _n If (u, u) _n ) E (q), however

Then u is _n Non-tree neighbors of u. For non-tree neighbors u and u in candidate assistance data structure A _n V (v ∈ C (u)) and v _n (v _n ∈C(u _n ) Is called a non-tree candidate neighbor. A inherits the parent-child vertex relationship of tq, where u's parent and child vertices are represented using up and uc, respectively.

Definition 6: a matching sequence is an arrangement of query graph vertex set V (q), and [ i ] represents the ith vertex in the matching sequence. J represents the set of vertices from matching order indices i to j (1 ≦ i ≦ j § j).

Algorithm 1 below describes a general subgraph matching algorithm staging flow. The input of the subgraph matching algorithm is a query graph q and a data graph G, and the output is all matching embedding from q to G. The subgraph matching algorithm based on backtracking search can be divided into the following stages: inputting a data graph G and a query graph q, filtering to generate a candidate vertex set C (u), constructing a candidate data structure A, generating a matching sequence, performing sub-graph matching enumeration according to the candidate vertex set C (u), the candidate data structure A and the matching sequence, and outputting embedded results of all sub-graph matching.

Algorithm 1

2. Existing subgraph matching method based on CPU (Central processing Unit) platform, GPU (graphics processing Unit) platform and FPGA (field programmable Gate array) platform

1. CPU-based method

The existing subgraph matching algorithm based on the CPU is divided into three categories: the first class of subgraph matching algorithm is based on a direct enumeration framework, which is an algorithm that enumerates all results by searching a data graph G, such as QuickSI, RI, VF2+ +, and the like. The second type of algorithm is based on an index enumeration framework, which is to build an index on the data graph G and then complete all matching queries with the help of the index, such as algorithms of GADDI, SPath, SGMath, and the like. The third type of algorithm is based on a preprocessing enumeration framework, firstly generating a candidate vertex set for a query data graph according to the vertices of the query graph, constructing an auxiliary data structure to maintain the edges between the candidate vertex sets, then generating a query matching sequence set of the query graph, and finally performing enumeration on the auxiliary data structure according to the matching sequence to generate all matching results, such as algorithms of GraphQL, turboIso, CFL-Match, CECI, DP-iso and the like. Meanwhile, subgraph matching algorithms have also been widely researched on multi-core CPU platforms, such as algorithms of VF3P, pgx.iso, PSM, and the like.

2. GPU-based method

The GPU architecture is a multi-core stream processor with tens, each stream processor containing hundreds of single-core processors, and therefore has the advantage of massive parallelism. Many subgraph matching algorithms are based on GPU platform, taking advantage of the parallelism of the GPU to speed up subgraph matching. For example: the GpSM works in GPUs and takes edges as basic units, the GpSM connects candidate edges in parallel to form a partial solution in a verification stage, and the process is repeatedly constructed until a final solution is obtained. GunrockSM runs in GPU platform, and adopts binary connection strategy to collect candidate set of each edge of query graph q, and connects them to find final matching result. GSI provides a pre-allocation merging method, which comprises two parts of filtering and splicing, wherein a strategy based on vertex splicing is used to replace a strategy based on edge splicing, in the splicing process, new results can be written into a buffer area and stored firstly, after a new intermediate table is well allocated, the new results are written into the new intermediate table, and through the mode, the splicing process does not need to be redone, so that a large amount of workload is saved, and the overall performance is improved. PBE is a partition-based enumeration method that uses a GPU to accelerate subgraph, by partitioning the graph into partitions, each of which fits into the GPU memory, the GPU processing one partition at a time, and searching the partitions for a matching subgraph of a given pattern as in a small graph.

3. FPGA-based method

An FPGA is an array of large numbers of small processing units that contains up to millions of programmable 1-bit adaptive logic blocks. And thus has deep pipeline parallelism and reconfigurable characteristics. FPGAs provides an efficient energy efficiency ratio solution to provide specialized hardware for graph processing applications. More and more researchers have recently developed various graph algorithms and graph processing frameworks for application on FPGA platforms. For example: MACIEJ proposes using FPGAs to accelerate the Substream center Maximum matching (Substream-Central Maximum matching). In order to effectively utilize FPGA resources, a method taking sub-streams as a center is provided, an input data stream is divided into sub-streams which are independently processed, and the communication cost is reduced while higher parallelism is realized. FAST utilizes a CPU-FPGA co-designed architecture, and utilizes the pipeline parallelism of FPGA to accelerate the matching of subgraphs on one machine. In addition to accelerating these generic graph algorithms on FPGAs, there are a number of generic frameworks designed on FPGAs, such as ThunderGP, edge-centralized panel, graff, graphGen, etc., in order to facilitate the application of graph algorithms on FPGAs. However, these frameworks are based on building with specific programming models (e.g., edge centers, vertex centers), which limits the implementation of highly optimized subgraph matching algorithms.

3. The matching method provided by the invention

The invention firstly tests the time of each stage of the most advanced subgraph matching algorithm to be executed on the CPU. Since the enumeration stage of the subgraph matching algorithm is a computationally intensive task, a CPU-FPGA based collaborative hybrid architecture is designed to accelerate subgraph matching and provide a phased hybrid subgraph matching algorithm. And a partition division method according to the candidate vertex auxiliary data structure set is also realized so as to achieve the load balance of the FPGA multiple computing units. By designing a plurality of computing units of the FPGA, the Host-to-Kernel Dataflow is enabled, and the enumeration Kernel Function accelerates subgraph matching by using optimization technologies such as Loop pipeline, loop Unrolling, dataflow and Function Inlining.

The subgraph matching mainly comprises three stages of filtering a candidate vertex set, constructing a tag set and enumerating a subgraph. In order to analyze which phase of the three phases has the most resource consumption, the execution time of each phase of the subgraph matching method GraphQL and CECI is respectively tested. In the test, the data atlas uses HPRD, and the query graph uses a sparse graph and a dense graph containing 8,16 and 32 vertexes respectively. Testing the time spent by each stage when the GraphQL method queries 3 x 108 sub-graphs. Fig. 3 shows the execution time of each stage when the graph ql method is used for subgraph matching. The test CECI method spent execution time for each stage when querying 9 × 109 subgraphs. Fig. 4 shows execution times of respective phases when the sub-graph matching is performed using the CECI method.

In FIGS. 3-4, d _ i, s _ i, d represents Zhou Mitu (d (q) ≧ 3), s represents the sparse graph (d (q) < 3), and i represents the number of query graph vertices. It is obvious from the test results that the execution time of the enumeration process of subgraph matching is the longest and occupies more than 99% of the execution time of all phases by using GraphQL and CECI methods, regardless of whether the sparse graph and the dense graph are used.

The invention also tests the use condition of CPU when matching subgraph by using GraphQL and CECI method, wherein the data graph uses HPRD data set, and the query graph uses d _32. FIG. 5 shows the CPU usage when performing subgraph matching using GraphQL and CECI methods.

Test results show that the CPU utilization rate is low in the initial stage of subgraph matching, but the CPU utilization rate is very high in the enumeration stage of the incoming subgraph matching, and more than 90% of CPU resources are occupied. From the above experiments, in the subgraph matching, the enumeration stage of each algorithm occupies the longest CPU time and needs the most CPU resources.

Based on the analysis, the invention provides a system architecture for cooperative processing of a CPU and an FPGA, wherein a host end is a general CPU and is mainly used for generating a candidate vertex set C (u), constructing an auxiliary data structure A, generating a query graph matching sequence, partitioning the candidate vertex auxiliary data structure set, performing subgraph matching enumeration on part of the vertex data structure set after C (u) is partitioned, calculating result statistics and the like.

And the kernel end is an FPGA acceleration card and is mainly used for accelerating the enumeration function matched with the sub-graph. The FPGA and the CPU are communicated through PCIe. The overall architecture of the system is shown in fig. 6.

When sub-graph matching is carried out, a data graph and a query graph are firstly read into a CPU (central processing unit) end memory, and the overall execution flow of the system is as follows:

(1) The CPU end generates a candidate vertex set based on LNF and NLF rule filtering according to the data graph and the query graph, and a candidate auxiliary data structure A is constructed by using a CFL method;

(2) And at the CPU end, generating a corresponding query graph matching sequence according to the sparsity of the data graph. Automatically selecting a query graph matching sequence according to whether the data graph is a dense graph or a sparse graph in an enumeration stage, wherein the dense graph adopts a GOL query sequence method, and the sparse graph adopts an RI query sequence method;

(3) Dividing A into a plurality of data sets based on the constructed candidate auxiliary data structure A and the number of kernel computing units, and realizing load balance of multiple computing units at a host end and an FPGA end;

(4) Designing a plurality of computing units in parallel by utilizing the spatial data of the FPGA to run an enumeration kernel function matched with the subgraph; the method has the advantages that Host-to-Kernel Dataflow is enabled by FPGA time data parallel design, loop pipeline, loop unlolling, dataflow, function importing and the like are enabled by Kernel enumeration functions, and optimized enumeration acceleration is achieved.

(5) And finally, accumulating the subgraph matching result enumerated by the enumeration kernel function at the gathered FPGA end and the subgraph matching result of a part of vertex data structure set at the CPU end by the CPU, and finally outputting all the subgraph matching results which are isomorphic with the input query graph in the input data graph.

A sub-graph matching algorithm with staged mixing is described below. The method comprises the whole processes of candidate vertex filtering method and auxiliary data structure construction, query matching sequence method, enumeration method and staged mixing method.

1. Generation of C (u) and construction A

Generating a candidate vertex set C (u) according to the query graph and the data graph, and filtering the candidate vertex set C (u) by using an LDF (the label and depth filter) and an NLF (the neighbor born label frequency filter), wherein the LDF is a rule of labels and degrees, the label values of the vertices are required to be the same, and the degree of v is more than or equal to the degree of u; NLF is a neighbor tag frequency rule, and the neighbor tag frequency of v is required to be greater than or equal to the tag frequency of u, so that a filtered candidate vertex set is obtained.

The auxiliary data structure A is constructed by adopting a CFL method, the vertex of the query graph is gradually visited by adopting a top-down construction mode, and the vertex in the corresponding data graph G is found, so that the auxiliary data structure A is constructed, namely the auxiliary data structure A comprises C (u) corresponding to all the vertices in the query graph q, for example, the query graph q has three vertices u1, u2 and u3, then three candidate vertex sets C (u 1), C (u 2) and C (u 3) exist, and the three candidate vertex sets can form the auxiliary data structure A.

The process of building the auxiliary data structure a from top to bottom includes three phases: forward candidate generation (generating a candidate set in a forward processing manner); backward candidate pruning (pruning undesired candidates in backward processing); and constructing an adjacency list (constructing the adjacency list corresponding to the query vertex and the parent node in the data graph). Finally, the CFL performs bottom-up refinement, including candidate refinement and adjacency list pruning.

2. Query order method

And automatically selecting the GQL method or the RI method according to the consistency of the data map. The dense data graph adopts a GOL method to generate a query matching sequence, and the sparse data graph adopts an RI method to generate a query matching sequence. Then the enumeration is performed in the enumeration phase using the corresponding query matching order. The following describes the algorithm idea of the GQL and RI methods to generate the query matching order.

GQL the GQL method uses a left deep junction-based method that models queries as a left deep junction tree, where leaf nodes are a set of candidate vertices. The GQL method firstly selects u' = arg min _u∈V(q) I C (u) | as the starting vertex of the matching sequence § then the GQL method iteratively selects u' = arg min _u∈N(§)-§ And taking the C (u) as the next vertex of the matching sequence, and obtaining the final query matching sequence after multiple iterations.

RI-RI method for generating matching sequence according to structure of query graph q§ to. It first selects u' = arg max _u∈V(q) d (u) as the matching order § b. Then its iterative selection u' = arg max _u∈N(ξ)-ξ | N (u) # is as the next vertex in the matching order. The RI method selects vertices with more neighbors in front of the matching order § accordingly.

3. Enumeration method

The enumeration stage is the stage requiring the most resources and the longest execution time in the subgraph matching process. Therefore, the algorithm optimization of the enumeration phase is the key to the acceleration of subgraph matching. In the enumeration stage, a recursive enumeration procedure in algorithm 1 is used to find all matching embedded results. When local candidates are calculated, an intersection-based enumeration technology (set intersection based enumeration technology) is adopted, the process of set intersection is shown as algorithm 2, and enumeration is performed according to the matching sequence of the query graph q and the auxiliary data structure a to obtain a final sub-graph matching result. (The procedure of set interaction based show as Algorithm 2.)

And 2, algorithm: computing local candidates using intersection-based methods

4. Staged mixing method

The subgraph matching process of the invention uses a staged mixing algorithm, when a candidate vertex set C (u) is generated, LDF and NLF rules are used for filtering the candidate vertex set, an auxiliary data structure set A is constructed by using a CFL method, a query sequence stage is generated, a GQL or RI method is automatically selected according to the density of a data graph to generate a matching sequence, an enumeration method based on set intersection is selected in the enumeration stage, and a detailed staged mixing method is shown as an algorithm 3.

Algorithm 3: staged mixing algorithm

5. Secondary data set partitioning

In order to achieve load balancing of the auxiliary data structure sets between each computing unit and the CPU in the FPGA core, a partitioning strategy introduced by algorithm 4 is adopted: the first step is to determine the partitioning factor k, and then for partitioning the auxiliary data structure set a, we start partitioning candidates from the root vertex of a, and if the number of root vertices of a is less than k, we continue partitioning along candidates for the next vertex in § order.

And 4, algorithm: APartion (A, C (u), k, § i)

Example (c): as shown in fig. 7, the original auxiliary data structure set is as shown in fig. 7, where k =3 is assumed, and the root candidate vertices are first partitioned into 3 parts: { v0}, { v1}, and { v2}, and then the sets of neighboring vertices are respectively chosen layer by layer for the root candidate vertices from the auxiliary data structure set a.

Generally, C (u) is divided into several blocks, one block is executed by a CPU, a plurality of computing units are arranged on an FPGA, one computing unit executes one block, and no repeated search space exists among partitions, so that repeated enumeration results cannot occur among the partitions. Meanwhile, the partition can enable each computing unit to better realize load balance.

6. FPGA kernel design

The parallelism, reconfigurability and low energy consumption of the deep pipeline of the FPGA are the key points for accelerating the data intensive operation. Whereas the compute-intensive task of subgraph matching is in the enumeration phase. Therefore, the design of the kernel function is the enumeration function of subgraph matching. The invention utilizes the space Data Parallelism (Spatial Data Parallelism) and the Time Data Parallelism (Time Data Parallelism) of the FPGA to accelerate the enumeration stage of subgraph matching. The spatial data parallelism is to design a plurality of computing units to run an enumeration kernel function matched with a subgraph, and the whole design structure is a kernel end as shown in fig. 6, wherein the number of the computing units is respectively set to 1,2,4,8,16 in an experiment to carry out a performance test. The enumeration function kernel of subgraph matching is compiled into multiple computing units, and a clenqueTask command is called multiple times in an unordered command queue to achieve data parallelism.

The Host-to-Kernel Dataflow is enabled by FPGA time data parallel design. Enabling the Host-to-Kernel Dataflow may further improve the performance of the Kernel accelerator, which may enable the Kernel to start restarting a new set of data while the previous set of data is still being processed. A conceptual diagram of the Host-to-Kernel Dataflow is shown in FIG. 8.

The longer the time required for the kernel to process a set of data from start to finish, the greater the chance of using host-to-kernel data flow to improve performance. Time parallelism is achieved, i.e. different stages of the same core process different sets of data from multiple clenqueTask commands in a pipelined manner. To implement host-to-kernel data flow, the kernel must implement the ap _ ctrl _ chain protocol using a pragma HLS interface. I.e., in the kernel function by the # pragma HLS INTERFACE ap _ ctrl _ chain port = return bundle = control instruction.

In order to realize the deep pipeline parallelism of the FPGA, the kernel enumeration Function enables Loop pipeline, loop unregling, dataflow, function lining and the like to have optimized enumeration acceleration.

In the enumeration phase of subgraph matching, circulation is an important implementation way of the enumeration matching process. The loop optimization technology is an important aspect of a pipeline parallelism high-performance accelerator based on an FPGA (field programmable gate array) architecture. By default, the loop is neither streamlined nor expanded. Thus cycles in enumerating kernel functions need to be optimized using a cycle pipeline and cycle unrolling technique. The specific implementation is to use the following instructions in the loop statement:

#pragma HLS PIPELINE//Loop Pipeline

#pragma HLS UNROLL//Loop unroll

in loop unrolling, complete loop unrolling can consume a significant amount of device resources, especially when the number of loop iterations is large. Partial loop unrolling may therefore improve performance with less hardware resources. The instructions for the partial loop unrolling are as follows:

#pragma HLS UNROLL factor＝n//n is a number

therefore, we first use round-robin pipeline optimization in enumerating kernel functions, while using as small a round-robin body as possible in the round-robin, and limited round-robin iterative expansion (i.e., defining factor = n) to further improve performance.

Data flow optimization is a powerful technique for improving the performance of kernel functions, which can support task-level pipeline parallelism in the kernel to improve performance. It allows the compiler to schedule multiple functions of the kernel to run concurrently to achieve higher throughput and lower latency.

Examples

The invention uses a mixed platform based on a CPU-FPGA to accelerate subgraph matching, the matching algorithm uses a subgraph matching algorithm mixed by stages, and a pipeline parallelism enumeration stage of the FPGA is utilized. The experimental data includes a large number of real data sets and synthetic data sets. The experimental environment and data set are described in detail below.

1. Experimental configuration

The experimental environment is as follows: the CPU end configuration is AMD Ryzen threader 3970X 32-Core Processor, and the CPU main frequency is 3.7GHz; the number of the cores is thirty-two cores; sixty-four threads; DDR memory 128GB.

The Xilinx Alveo U250 data center accelerator card selected by the FPGA provides 1.3M LUTs, 11.5k DSP slices,64GB of DDR4 memory (77 GB/s total bandwidth), dual QSFP28 Gbps network interface, 4320BRAM blocks and PCIe interface.

Data set: the real data set is used to synthesize a data set to evaluate system performance.

A real data set: eight real datasets were used: eu2005, youtube, DBLP, yeast, human, HPRD, wordNet, US Patents, the characteristics of these datasets are shown in table 2. The datasets Yeast, human, HPRD, wordNet are labeled datasets. For a label-free dataset, a label is randomly selected from the label set to be assigned to a vertex.

Synthesizing a data set: the resultant data set is generated using a pammat tool, which is a multi-threaded RMAT graph generator. Four parameters of the pammat are set: a =0.45, b =0.15, c =0.15, d =0.25. Since the ratio of a: b and a: c is approximately 3:1 in many of the real world figures. While randomly assigning different tags to the set of vertices. By changing the tag set | Σ | from 4 to 32, the vertex set | V | from 4M (million) to 128M, and the degree d from 4 to 32. In the experiments, the synthetic data set used by default was | V | =64m, d =16, | Σ | =16. And meanwhile, the V, d and sigma are respectively changed to test the expansibility of the algorithm and the platform.

Change | V |: we synthesized 5 data maps with 4M, 16M, 32M, 64M and 128M vertices, respectively.

Change | ∑ |: we generated 5 data maps with different tag numbers of 4,8,16, 20, 32, respectively.

Changing d we synthesized 5 data plots with an average of 4,8,16, 20, 32, respectively.

The detailed data set characteristics of the composite data set are shown in table 3.

TABLE 2 characteristics of real world datasets

TABLE 3 characteristics of the synthetic data set

Inquiring a graph: as previously described, subgraphs are randomly extracted from each data graph to generate a query graph. The vertices of the query graph are changed from 4 to 32. For the query graph with each vertex number, the query graph with the vertex number of 4 generates a dense query graph, the query graphs with the other vertex numbers generate sparse query graphs, the dense query graph is d (q) is more than or equal to 3, and the sparse query graph is d (q) <3.QiD and Qis indicate that the dense query graph and the sparse query graph contain i vertices, respectively. The query graph data characteristics are shown in table 4.

TABLE 4 characteristics of query atlas

2. Results of the experiment

According to the experimental configuration, the matching method provided by the invention is utilized for matching.

Firstly, when the number of kernel computing units n =4 is selected, testing and comparing the enumeration performance of a CPU-FPGA-based staged mixed subgraph matching algorithm and the most advanced subgraph matching algorithm GQL, CFL, CECI and DP-iso. The data graphs respectively adopt eu2005 and Youtube data sets, the query graphs respectively use Q16D, Q D and Q16S, Q S, and the query subgraph matching number is 9 × 109. The evaluation results were as follows:

based on the number of core computing units being 4, enumeration performance evaluation is performed on the eu2005 data set, and the result of enumeration performance comparison of each algorithm is shown in fig. 9.

On the eu2005 dataset, our method PH-CF enumeration performs better because the dataset is relatively large in the real dataset. The highest acceleration ratios for the enumeration performances of the PH-CF method versus the CFL, CECI, DP-iso methods were 16.07, 38.61, 11.46, respectively. It can be seen that the enumeration performance of the PH-CF method is relatively stable.

An enumeration performance test is performed on the Youtube data set, and the results of the enumeration performance evaluation comparison of the algorithms are shown in fig. 10.

When 9 × 109 sub-graph matches are queried on the Youttube data set, the enumeration performance of the PH-CF versus CFL, CECI, and DP-iso methods of the invention are respectively 5.16, 5.87, and 3.99 at the highest acceleration ratio.

Performance evaluations were next performed under the synthetic data set. Setting the number of kernel computing units n =4, selecting SD05, SD10 and SD15 data sets in the table 3 respectively for the synthetic data set, using Q16D, Q D and Q16S, Q S respectively for the query graph, querying 9 × 1010 subgraph matching, and testing and comparing the performance of CFL, CECI, DP-iso algorithm and our algorithm PH-CF.

An enumeration performance evaluation is performed on the SD05 data set, and the result of comparing enumeration performance of each algorithm is shown in fig. 11.

On the synthetic dataset SD05, 9 × 1010 sub-map matches were queried and the evaluation results showed that the enumeration performance of our method PH-CF versus CFL, CECI, DP-iso was 4.08, 4.67, 3.16 times the highest acceleration ratio, respectively.

An evaluation of the enumeration performance is performed on the synthetic data set SD10, and the result of comparing the enumeration performance of each algorithm is shown in fig. 12. On the synthetic dataset SD10, 9 × 1010 sub-map matches were queried and the evaluation results showed that the enumeration performance of our method PH-CF versus CFL, CECI, DP-iso methods were 2.78, 3.41, 3.78 for the highest acceleration ratio, respectively.

An enumeration performance evaluation is performed on the synthetic data set SD15, and the result of comparing enumeration performance of each algorithm is shown in fig. 13. On the synthesis data SD15, 9 × 1010 sub-map matching was queried, and the evaluation results showed that the enumeration performance of our method PH-CF versus CFL, CECI, DP-iso was 7.47, 5.49, 3.82 for the highest acceleration ratio, respectively. In addition, the evaluation results show that the PH-CF method has better robustness in enumeration performance of various data sets and more stable enumeration time.

The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An acceleration subgraph matching method based on a CPU-FPGA hybrid platform is characterized by comprising the following steps:

step 2: performing subgraph matching based on a CPU-FPGA collaborative hybrid architecture, and finding all subgraphs which have the same structure as a query graph and the same node labels from a given data graph, wherein the subgraph comprises the following three stages:

in the second stage, the CPU terminal selects a GQL method or an RI method to generate a query matching sequence according to the consistency of the data graph;

and step 3: and outputting the final sub-graph matching embedding result.

2. The CPU-FPGA-based hybrid platform accelerated subgraph matching method of claim 1, wherein the construction steps of the candidate vertex set and the auxiliary data structure in the first stage are as follows:

step 203: constructing a BFS tree qt according to a given query graph q;

3. The CPU-FPGA-based hybrid platform accelerated sub-graph matching method of claim 1, wherein the step of generating the query matching sequence at the second stage comprises:

step 301: if the data graph is a dense graph, selecting a GOL method to generate a query matching sequence, and turning to the step

A step 302; if the data map is a sparse map, selecting an RI method to generate a matching sequence of the corresponding query map, and turning to step 303;

step 302: GOL method: first, u' = arg min is selected _u∈V(q) I C (u) | as the starting vertex of the matching order § then selection of the GQL method iteration u' = arg min _u∈N(§)-§ Taking | C (u) | as the next vertex of the matching sequence;

step 303: the RI method comprises the following steps: first select u' = arg max _u∈V(q) d (u) as the matching order § b. Then its iterative selection u' = arg max _u∈N(§)-§ | N (u) # § as the next vertex of the matching order.

4. The CPU-FPGA-based hybrid platform accelerated sub-graph matching method of claim 1, wherein the third-stage sub-graph matching step comprises:

step 402: partitioning the candidate vertex set based on the auxiliary data structure and the number of the kernel computing units, and dividing the candidate vertex set into a plurality of blocks, wherein one block is executed at a CPU end, and the other blocks are executed on an FPGA (field programmable gate array), so that the load balance of the multiple computing units at the CPU end and the FPGA end is finally realized;

step 403: designing a plurality of computing units in parallel by utilizing the spatial data of the FPGA to run an enumeration kernel function in the intersection-based embedding method; enabling Host-to-Kernel Dataflow by utilizing FPGA time data parallel design;

5. The CPU-FPGA-based hybrid platform accelerated sub-graph matching method as set forth in claim 3, wherein the specific operation steps of step 402 are:

step 4021: determining a partition factor k = min (k, | C (u);