CN110990638B

CN110990638B - Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment

Info

Publication number: CN110990638B
Application number: CN201911029459.3A
Authority: CN
Inventors: 邹磊; 林殷年; 苏勋斌
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2023-04-28
Anticipated expiration: 2039-10-28
Also published as: WO2021083239A1; CN110990638A

Abstract

The invention discloses a large-scale data query acceleration method based on an FPGA-CPU heterogeneous environment and a device for realizing the same on an FPGA, which are used for expressing large-scale data to be queried as a large-scale graph data set in an RDF format of a resource description framework, realizing query acceleration based on the FPGA-CPU heterogeneous environment, solving the problem of rapid query on the data on the large-scale data set and accelerating the query of the graph database, and can be widely applied to the technical field of application based on graph data processing. The method is applied to the natural language question-answering intelligent query. The implementation shows that the query acceleration ratio is more than twice by adopting the method of the invention, which can reach ten times acceleration and can better meet the application requirement on higher response time.

Description

Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment

Technical Field

The invention belongs to the technical field of information search and inquiry, relates to a large-scale data search acceleration technology, and particularly relates to a large-scale data search acceleration method based on an FPGA-CPU heterogeneous environment and an implementation device thereof on an FPGA.

Background

In large-scale data retrieval and querying, a graph database is a database that uses graph structures for semantic queries, using nodes, edges, and attributes to represent and store data. The key concept of the system is a graph that directly associates data items in storage with data nodes and sets of edges representing relationships between the nodes. These relationships enable the data in the storage units to be linked together and retrieved through specific normalized operation and query languages. The most important content in the graph database is the relationship between the data, which are permanently stored in the database itself, and the speed of querying the entity association relationship in the graph database is the main factor in evaluating the performance of the graph database. The user can intuitively display the association relation of the entities by using the graph database, which is indispensable for processing highly interconnected data. A series of data, such as a power network, which has important significance in representing relationships between entities in real life, are rapidly and efficiently retrieved, added and analyzed through a graph database.

Retrieving data from a graph database requires a query language other than SQL, which is designed to process data in a relational system, and therefore cannot process traversal graphs efficiently. By 2019, a general graph query language like SQL has not yet emerged. However, there have been some standardization efforts to provide query languages for large vendors. The SPARQL (SPARQL Protocol and RDF Query Language) language, among other things, uses triplets to express user queries, supporting multiple aggregation functions, supported by most vendors.

To date, graph databases have found application in a wide range of fields. Such as automatic association recommendations in Google search engines, fraud detection in e-commerce transaction networks, pattern discovery in social networks, etc., all require support for graph databases. Currently, most graph databases achieve some performance when dealing with millions to tens of millions of edge data sets, but the performance is significantly degraded when the data size is scaled up to billions or even billions. And not a linear decrease, but a decrease in geometric level. This is due to the fact that the policies and algorithms adopted inside these graph databases are not suitable for such huge data sizes, but also because most of the graph data run completely on the CPU, are processed by adopting serial algorithms, and need to read data from a slow storage unit such as a disk, and cannot enjoy the advantages brought by the development of new hardware at present.

The performance bottleneck of the graph database is to a large extent that when the graph data is large in size, the Join operation on the graph needs to consume a large amount of time and computing resources. To implement join operations on the graph, the most common approach is to traverse one of the sets, read the adjacency list of each node, then intersect it with the other set, and finally add the result to the result set. The main disadvantage of this approach is that the order of points in the traversal map has a great impact on computational performance, and thus the traversal order of points needs to be carefully chosen. Second, this operation can result in a large number of memory cell accesses, severely affecting the speed of the process.

Currently, the performance of most graph databases on small-scale data sets has reached a certain level, however, some of these strategies will fail after the data set is scaled up, even preventing efficient execution of the system. The invention aims to solve the problem of rapid query of data on a large-scale data set by utilizing an FPGA hardware platform.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a large-scale data query acceleration method based on an FPGA-CPU heterogeneous environment for a graph database and an implementation device thereof on the FPGA, which solve the problem of rapid query of data on a large-scale data set, accelerate the query of the graph database and can be widely applied to the technical field of application based on graph data processing.

The technical scheme provided by the invention is as follows:

a large-scale data query acceleration device based on an FPGA-CPU heterogeneous environment comprises: the system comprises a large-scale database module, a data preprocessor, a CPU end control unit and an FPGA end (comprising an FPGA end storage unit and an FPGA end calculation unit).

The large-scale database module is used to store large-scale graph datasets, such as LUBM (the Lehigh University Benchmark) datasets, represented in RDF (resource description framework ) format. The number of the nodes is tens of millions or more, and the number of the edges is hundreds of millions or more. The large-scale database module comprises a graph database data and a data query support module. RDF is composed of nodes and edges, the nodes represent entities/resources and attributes, and the edges represent the relationships between entities and attributes. In general, the source point of an edge in a graph is called a subject, the label on the edge is called a predicate, and the node pointed to becomes an object.

The data preprocessor is used for preprocessing data and preparing for data retrieval inquiry. The pretreatment comprises the following steps: extracting part of data required by connection operation (join operation) of the graph database; the access friendly data index is generated from the specification of the CSR (sparse line compression, compressed Sparse Row) format. Meanwhile, aiming at the characteristic of the best Bit operation performance in the hardware environment, a self-designed Block number and binary segment method (BID & BS, block ID & Bit Stride method) is adopted to compress and store the adjacent list data and the candidate point list data. The specific steps are that for an array, if the elements therein are all positive integers and have a fixed range, a binary string can be used to represent the array, where an i bit of 0 indicates that i is not present in the array and an i bit of 1 indicates that i is present in the array. When the array of the representation method is used for taking the intersection, only two binary strings are required to be subjected to bit-by-bit AND operation, and the representation method is very suitable for hardware realization at the FPGA end. But such binary strings tend to be sparse, wasting a lot of memory space. Thus, the binary string can be divided into several blocks (blocks) of uniform length, and the unique number (Block ID) of each Block is represented with as many bits. Wherein if the binary strings within this block are all 0's, then the don't care can be discarded directly. When the intersection is taken, firstly, the block numbers are matched, and two blocks matched with the block numbers are taken and intersected according to the bit, so that the result of the intersection of the part can be directly obtained. The method ensures that intersection calculation can be normally performed without data decompression when the large-scale data query acceleration device based on the FPGA-CPU heterogeneous environment is used for query acceleration, thereby greatly improving query performance and improving the efficiency of a graph calculation algorithm based on graph data query;

The CPU end control unit is used for calling the data preprocessor to perform certain preprocessing on the data, specifically, calling the data preprocessor to obtain the graph data which is compressed and converted by the preprocessor and is suitable for being read by the FPGA end computing unit. And then data required by the FPGA end computing unit are transmitted to the FPGA end computing unit, so that the data are more convenient to read by the FPGA end computing unit. Meanwhile, the CPU control unit also receives results returned by the FPGA end computing units, each result is in the form of a tuple formed by a plurality of node IDs and a BID & BS block, the CPU end control unit simultaneously controls the plurality of FPGA end computing units, the FPGA end computing units are orderly, the first FPGA end computing unit processes the connection operation of two tables, the second FPGA end computing unit processes the connection operation of three tables, and so on. The CPU terminal control unit converts the data output by each FPGA terminal calculation unit into a final connection operation result or inputs of the next FPGA terminal calculation unit, the specific conversion method is consistent, N elements are assumed in the result, for the first N-1 elements in the tuple, the first N-1 elements are node IDs mapped onto an offset array of a certain CSR format (structure), the first N-1 elements are numbers which need to be mapped back into a graph database when the final result is output, the Nth element is a BID & BS block, the BID & BS block needs to be decompressed, and each obtained number can form a final result with the first N-1 elements. In other words, the final result is in the form of a tuple of length N, with elements being positive integers, each tuple representing a numbered combination of a join operation that meets the requirements;

The FPGA end computing unit is used for executing a plurality of table connection (join) algorithms on FPGA hardware, the algorithms are realized on the logic units of the FPGA, and the characteristics of hardware operation are optimized. The specific optimization strategy is as follows: firstly, data are divided and put into memory unit ports of different DRAMs (Dynamic Random Access Memory, dynamic random access memories), so that the parallel access efficiency of the FPGA hardware to the memory units is improved; secondly, a data buffer area is designed by utilizing a high-speed BRAM (Block Random Access Memory, on-chip random access memory) unit on FPGA hardware; and finally, constructing a three-stage pipeline for data reading, calculating and writing back by utilizing the self-defining characteristic of FPGA hardware, and designing parallel processing logic according to actual conditions in each operation. Finally, the FPGA end computing unit can quickly return the result obtained by the single multi-table connection operation to the CPU control unit.

The invention also provides a large-scale data query acceleration method based on the heterogeneous environment of the FPGA-CPU, which aims at realizing the query acceleration based on the FPGA-CPU aiming at a large-scale graph data set expressed in the RDF format of the resource description framework; the method comprises the following steps:

1) Identifying large-scale data to be queried as a plurality of elements, including: entity, entity attribute, and relationship; correspondingly representing the entities as nodes in the RDF format, representing the inherent attributes of the entities as the attributes of the nodes in the RDF format, and representing the relationship between the entities and the relationships between the entities and the attributes as edges in the RDF format; and carrying out data preprocessing on large-scale data expressed in an RDF format, wherein the RDF format data consists of nodes and edges, the source point of one edge in the graph is a subject, the label on the edge is a predicate, and the pointed node is an object.

In the application of the specific implementation to the intelligent question-answering query of the natural language, individuals which can be endowed with attributes and relations in a knowledge base are defined as entities, correspondingly expressed as nodes in an RDF format, and the attributes of the nodes are defined as the inherent attributes of the individuals, such as the birthdays of the people, names of the things and the like; the entity-to-entity relationship and the entity-to-attribute relationship are represented as edges in RDF format.

The data preprocessing comprises the following steps: 11 Dividing the graph data in the large-scale graph data set into a plurality of subgraphs according to predicates, and storing the adjacent matrix of each subgraph data by using a CSR format after optimized storage consumption.

The large-scale data expressed by the RD F format consists of nodes and edges, wherein the nodes represent entities/resources and attributes, and the edges represent the relationships between the entities and the attributes. In general, the source point of an edge in a graph is called a subject, the label on the edge is called a predicate, and the node pointed to becomes an object.

The large-scale data query acceleration method based on the FPGA-CPU heterogeneous environment provided by the invention is used for continuously and randomly reading the list of all nodes linked by any node in the graph, wherein the list is called an adjacency list. The adjacency list is divided into an outgoing-side adjacency list and an incoming-side adjacency list, and represents adjacency lists when nodes are subject or object, respectively. And arranging the adjacent lists of all the nodes according to a certain sequence to form a two-dimensional list, namely, an adjacent matrix. The CSR storage format and the CSC storage format of the graph can improve the operating efficiency of the access.

12 For the adjacency list in each CSR format graph data, compressing by using a block number and binary segment method BID & BS method to obtain compressed CSR format data.

The Block number and binary segment method (BID & BS, block ID & Bit Stride method) compresses and stores the adjacency list data and the candidate point table data. The method comprises the following specific steps:

121 Using binary string to represent adjacency list (positive integer array) in CSR, wherein the value of element in string is 0 or 1, and the element in array is the number of node;

for an array, if the elements therein are positive integers and have a fixed range, a binary string may be used to represent the array, where an i-th bit of 0 indicates that i is not present in the array and an i-th bit of 1 indicates that i is present in the array. When the array of the representation method is used for taking the intersection, only two binary strings are required to be subjected to bit-by-bit AND operation, and the representation method is very suitable for hardware realization at the FPGA end.

122 Dividing the binary string into a plurality of blocks (blocks) with identical lengths (bits), and setting a unique number (Block ID) with the same length for each Block;

binary strings tend to be sparse, wasting a lot of memory space. The invention divides the binary string into a plurality of blocks (Block) with consistent length, and uses the same number of bits to represent the unique number (Block ID) of each Block; when the binary strings in the block are all 0, the binary strings are directly discarded; .

123 When the intersection of two sets formed by a plurality of blocks is taken, firstly matching the block numbers, and taking intersection of two blocks with identical block numbers according to the bit, thus directly obtaining the intersection result. 2) Before calculation starts, writing the CSR format data obtained by preprocessing into a storage unit at the FPGA end to obtain physical addresses of adjacent lists compressed into CSR formats in the storage unit at the FPGA end, wherein subgraphs corresponding to different predicates are obtained;

3) When the user makes a query, the following operations are performed:

31 After analyzing the user inquiry, obtaining candidate point table data according to the data in the graph;

in the implementation, the large-scale graph data set adopts a gStore, and the gStore uses the degrees of each variable in the SPARQL query and the labels on the edges to filter possible values of each variable so as to analyze and obtain candidate point table data;

32 The CPU control unit transmits the candidate point table data and control signals required by the FPGA end computing unit in the operation process to the FPGA end computing unit for performing algorithm computation and optimization of table connection (join); the control signal includes a signal for controlling the FPGA-side computing unit to start computing, and a hardware address and auxiliary information (such as a length of a candidate point table, which CSR format data is needed) of the CSR format data on the FPGA-side storage unit.

The method comprises the following steps:

321 The data are put into the memory unit ports of the DRAM by dividing the data, so that the parallel access efficiency of the FPGA hardware to the memory units is improved;

322 A data buffer area is arranged by utilizing a high-speed on-chip random access memory BRAM unit on FPGA hardware;

323 Based on the self-defined characteristic of FPGA hardware, constructing a three-stage pipeline for data reading, calculating and writing back, and setting a parallel processing module in each calculation and reading operation;

33 After the calculation unit at the FPGA end finishes the calculation, the result is transmitted back to the control unit at the CPU end, the CPU end generates a final calculation result by using the results and converts the final calculation result into a corresponding character string result, namely a query result which is output to a user. The specific operation comprises the following steps:

331 The CPU control unit receives the results returned by the FPGA end computing units, each result is in the form of a tuple formed by a plurality of node IDs and a BID & BS block, the CPU end control unit simultaneously controls a plurality of FPGA end computing units, the FPGA end computing units are orderly, the first FPGA end computing unit processes the connection operation of two tables, the second FPGA end computing unit processes the connection operation of three tables, and the like.

332 The CPU terminal control unit converts the result of each FPGA terminal calculation unit into the final result of the connection operation or the input of the next FPGA terminal calculation unit.

Setting N elements in the result (one tuple) of the FPGA end computing unit, mapping the first N-1 elements in the tuple to node IDs on an offset array of a CSR format (structure), and mapping the node IDs back to numbers in a graph database for outputting; the nth element is a BID & BS block, which decompresses the nth element to obtain a number of node numbers that should be included in the intersection result. These numbers all form a final result with the previous N-1 elements. The final result is in the form of a tuple of length N, elements node numbering, each tuple representing a numbered combination of join operations meeting the query requirements. In the number combination, the numbers and the node data are in one-to-one correspondence, and a node set which should appear in a query answer is determined, namely a query result finally output to a user by the system.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a large-scale data query acceleration method based on an FPGA-CPU heterogeneous environment and a realization device thereof on an FPGA, which can be applied to database query of intersection operation involving a candidate point table and an adjacent list during processing connection operation, solve the problem of rapid query of data on a large-scale data set, accelerate graph database query, and can be widely applied to the application technical fields based on graph data processing such as social networks, financial wind control, internet of things application, relationship analysis, IT operation and maintenance, recommendation engines and the like. The invention can be combined with the graph database system by adjusting the input and output formats, so that the query response speed of the graph database system is improved. The method is applied to the intelligent query of the natural language questions and answers, characters, events and things in the natural language questions and answers data are identified as entities, and the entities are correspondingly expressed as nodes in an RDF format; the attributes of an entity are defined as attributes of a node, the relationships between entities and attributes are represented as edges in RDF format. The specific implementation shows that by adopting the technical scheme of the invention, the query acceleration ratio is more than twice, the ten times acceleration can be achieved, the application requirement with higher response time requirement can be better met, and for example, the real-time graph mode query discovery can be realized.

Drawings

Fig. 1 is a schematic diagram of a CSR structure (format) used in an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating implementation of a BID & BS compression method according to an embodiment of the present invention.

Fig. 3 is a block diagram of the structure of the accelerator apparatus/accelerator provided by the present invention.

Fig. 4 is a Kernel structure block diagram of the 0 th layer of the accelerating device provided by the invention.

FIG. 5 is a block flow diagram of an intersection calculation in a query process in accordance with an embodiment of the present invention.

Detailed Description

The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.

The invention provides a large-scale data query acceleration method based on an FPGA-CPU heterogeneous environment and used for a graph database and an implementation device thereof on the FPGA. Acceleration map database queries can support a large number of application scenarios based on map data. For example: an application scenario of a certain fixed pattern in the graph data needs to be quickly found: for example, the stock relationships between companies may be expressed in the form of graph data. Fraud detection is a very suitable application scenario for graph databases. In modern fraud and various types of financial crimes, such as banking fraud, credit card fraud, e-commerce fraud, insurance fraud, etc., fraudsters often use means to change their identity to achieve the goal of escaping from the rules of the wind control. However, it is often difficult for fraudsters to change all network related associations, and to perform the same operations synchronously across all network groups involved to avoid wind control. While the graph data may establish a user tracking perspective that tracks the global. However, without the use of an accelerator, the time penalty is unacceptable. After the accelerating device is used, the algorithm running speed can be improved by more than 2 times. In addition, there are application scenarios that require support for custom real-time queries, including many research institutions and business companies to build domain-specific or open-domain natural language question-answering systems based on graph data (typically knowledge graph). At the bottom of these question-answering systems, diagram database support is required to quickly obtain the information needed to parse and answer natural language questions from the diagram data. For example, in the framework of an intelligent question-answering system developed by an artificial intelligent company, after natural language is converted into SPARQL which can be identified by a graph database by using machine learning tools such as sentence vector coding, sentence pattern analysis, word sense filtering, emotion recognition, text clustering and classification, a knowledge base which covers the fields of common sense, old poems, life events, music and the like and has a scale approaching to hundred million sides is searched by utilizing the graph database to store the background in a triplet form. Such visits are around 20 ten thousand times per day, with about 5 links concurrent during peak hours, with an average of 10 requests per second. The time delay in this step is approximately 30% of the overall time delay. After the technical scheme of the invention is used for accelerating the query of the graph database, the time delay occupation ratio can be reduced to 15%, and the throughput of the peak period is doubled, so that the company is helped to realize the response speed within seconds.

In the method for accelerating the large-scale data query based on the heterogeneous environment of the FPGA-CPU, the large-scale data is a database, such as a graph database, of intersection operation related to candidate point tables and adjacency lists during processing connection operation, and the method can be applied to applications such as natural language question-answer query acceleration. The specific implementation takes the representation of the natural language knowledge base as graph data as an example, and comprises the following operations:

1) And determining the actual meaning corresponding to the nodes and the edges in the graph data. In a natural language knowledge base, a node generally represents all of the knowledge base that can produce a relationship, and a subject having an attribute, such as a person, thing, place, etc., and an edge represents a relationship between subjects, such as a spouse, birth place, location, etc.

2) The attributes of nodes and edges in the graph data are determined. The attributes of a node generally represent inherent characteristics of the entity to which the node corresponds, such as the age, sex, date of birth, name of the place, etc. of a person. The nature of the edges generally represents the characteristics of the strap. For example, in a mutual spouse relationship, there may be attributes such as start time, end time, etc.

3) According to the definition, a certain graph data format is adopted to convert the data into graph data. Taking RDF format as an example, the definition of nodes and edges and the definition mode of the respective attributes in the RDF format have detailed format regulations, and data conversion is carried out according to the format regulations.

In the prior art, a method for converting natural language into SPARQL queries in a graph database may include the steps of:

1) Entity recognition is performed, and elements in natural language are associated with nodes in the graph database.

Elements included in natural language may have corresponding nodes in the graph database, but the names of the nodes and natural language do not necessarily coincide exactly in literal content. For example, "liqueur" is mentioned in natural language, and in the graph database, the label of a node that is likely to correspond is "liqueur (tangshi)". It is necessary to link the two. The current common methods basically use pattern matching or deep learning methods to identify entities by means of information in a graph database.

2) Determining a dependency relationship;

by dependency is meant a semantic relationship between entities in natural language. Typically, a dependency corresponds to two nodes and an edge in the graph data. Currently, a method for generating a dependency tree is commonly used for determining the dependency.

3) A query is generated.

By using the entity information and the dependency information obtained above, a query identifiable by the graph database can be generated by a machine learning method.

Graph database queries are very basic graph data operations, and have certain requirements on graph database queries, whether they provide the query itself to the user or are based on the query application interface. When the graph data is large in size, a great deal of time and computing resources are consumed to perform Join operations on the graph. The Join operation on the graph is similar to the Join operation in the table in the relational database, and the match items in the two sets are searched according to a certain condition. Except that equivalent conditions are typically used in relational databases to determine if the elements match, and Join operations in graph databases need to determine if the elements match by determining if there are relationships between the elements. Join operations in the graph database involve more memory read and computation operations than Join operations in the relational database, and are therefore more complex.

Essentially, the purpose of performing Join operations on the graph is to perform computation of subgraph isomorphism. In most graph data, the user query may be represented as a query graph. Execution of a query is equivalent to finding sub-graphs in the entire data graph that have isomorphic relationships with the query graph. In computer theory, the isomorphic problem is defined as: the two simple graphs G and H are said to be isomorphic if and only if there is a one-to-one correspondence σ mapping node 1 … … n of G to node 1 … … n of H, such that any two nodes i and j in G are connected if and only if the corresponding two nodes σ (i) and σ (j) in H are connected. If G and H are directed graphs, then the definition of isomorphism further requires that for any two connected nodes i and j in G, the edge (i, j) is in the same direction as its corresponding edge (σ (i), σ (j)) in H. The parallel computing efficiency of the FPGA is high. FPGA belongs to parallel computing, and can execute algorithms of a plurality of instructions at a time, while traditional ASIC, DSP and even CPU are all serial computing, and can only process one instruction set at a time, if ASIC and CPU need to speed up, more methods are to increase frequency, so the main frequency of ASIC and CPU is generally higher. Although the common dominant frequency of FPGAs is low, for some special tasks, a large number of relatively low-speed parallel units are more efficient than a small number of high-efficiency units. In addition, from a certain point of view, there is no so-called "computation" inside the FPGA, and the end result is almost similar to ASIC "circuit-straight", so that the execution efficiency is greatly improved.

RDF (Resource Description Framework), a resource description framework, is essentially a Data Model (Data Model). It provides a unified standard for describing entities/resources. Briefly, a method and means to represent things. RDF is formally represented as an SPO triplet, sometimes referred to as a statement, which we also refer to as a knowledge in a knowledge graph.

RDF is composed of nodes and edges, the nodes represent entities/resources and attributes, and the edges represent the relationships between entities and attributes. In general, the source point of an edge in a graph is called a subject, the label on the edge is called a predicate, and the node pointed to becomes an object.

In the present invention, it is necessary to continuously randomly read the list of all nodes linked by any node in the graph. In graph theory, this list is referred to as a adjacency list. In the directed graph, the adjacency list is divided into an outgoing-side adjacency list and an incoming-side adjacency list, and the adjacency list indicates adjacency lists when nodes are subject or object, respectively. The two-dimensional list formed by arranging the adjacent lists of all the nodes in a certain order is called an adjacent matrix. However, in a computer system, continuous random access to memory locations will result in poor operating efficiency. Thus, researchers have proposed the CSR and CSC (Compressed Sparse Row, column sparse compressed storage) storage formats of the graph by referencing the CSR (Compressed Sparse Row) storage format of the sparse matrix.

The CSR storage format of FIG. 1 is composed of two arrays C and E. Wherein E is formed by end-to-end connection of the adjacency list of all nodes. Since the graph database system typically sets node IDs for nodes, the adjacency lists of all nodes can be combined in order of node IDs from large to small. And the length of C is the same as the number of nodes in the graph, the value of the node with subscript i is equal to the position of the first element in E of the adjacency list of the node with ID i. When the adjacency list is an outgoing edge adjacency list, it is called a CSR format, and when the adjacency list is an incoming edge adjacency list, it is called a CSC format. Since array C represents the offset of the neighbor list in E, array C is also referred to as an offset array.

The technical scheme of the invention comprises the following steps: the device comprises a data preprocessing part, a CPU control unit part and an FPGA calculation unit part. In specific implementation, the invention adopts Verilog language to write the computing unit of the FPGA end, and uses C++ to write the control unit program of the CPU end. The development and operation of the invention are based on the U200 FPGA board card sold by Xilinx company and the matched operation and development environment. U200 is equipped with 16gb x 4 memory space, 2175 x 18k on-chip memory space. Therefore, when the present invention is implemented in U200, the adjacency list is divided into 128 groups, and 128-way parallel intersection calculation is performed. The matched development environment can map hardware logic corresponding to the Verilog language into FPGA hardware.

The invention is based on an open-source graph database system gStore, and realizes a CPU-FPGA heterogeneous graph database system query accelerator aiming at a 50 hundred million-edge-scale lubm database and join queries related in the corresponding benchmark.

Data preprocessing section

Currently, there are a variety of standard formats for storing and representing graph data, among which the RDF format is widely used. The graph data format to which the present invention is applicable is the RDF format. In the calculation process, the adjacency matrix of the graph data is stored using the CSR format.

In RDF format, the relationships between nodes are represented using triplets. When user query processing is performed, the values of the subject, object and predicate in the triples all need to be matched. However, neighbors linked by different predicates cannot be distinguished in the CSR format. Therefore, in the present invention, the graph data is partitioned according to predicates. All edges with the same predicate are drawn to form a sub-graph, and a separate CSR structure (format) is generated for each sub-graph. During the query process, it is determined by additional parameters which CSR data is to be read.

Thus, there is still a drawback. The scale of the subgraph obtained according to predicates may be far smaller than that of the original graph, so that an offset array in the CSR is very sparse, and a large amount of storage space is wasted. Thus, the present invention maintains a mapping structure for each sub-graph in advance. For the nodes with the total degree of more than or equal to 1 in the given subgraph, the number of the nodes is set as n, the nodes are renumbered as 0 … … n-1 according to the sequence of ID from small to large, then an offset matrix in CSR is constructed under the new numbering, and an E matrix is kept unchanged, namely the elements in the adjacent list are still node IDs before renumbering.

Meanwhile, in order to improve the parallelism of the algorithm on FPGA hardware and reduce the calculation complexity, the invention adopts a binary bit string-based data compression mode, which is called as a BID & BS structure.

From a hardware perspective, the fastest running speed is binary bit manipulation. To use bit manipulation to complete the intersection, the most intuitive idea is to represent the elements in the set as a binary string, where the ith bit takes 1, indicating that the element with the number i exists in the set, and not (assuming that the elements in the set are numbered, in fact, most graph databases will number the nodes). Then, the two binary strings need only be bitwise and operated once, and the result is the binary string representation of the intersection. However, in the graphs of practical applications, most of the graphs are sparse graphs, and the average degree is not high, so that a large number of continuous 0 s exist in the binary string, and the performance is affected.

The basic idea of the BID & BS compression method is to divide the above binary string into blocks, give a unique Block ID (BID) for each part, and inside each Block, express the presence or absence of an aggregation element by a binary string Bit Stride (BS). Therefore, for a block without elements, the block can be directly removed and not calculated, so that the effect of data compression is achieved, and the problem of data sparseness is relieved to a certain extent. For the block with at least one element, a merging method can be used, the same BID is found through comparison, and then the corresponding BS is subjected to bit pressing and operation to obtain a result. For convenience, we note that the length of each BID is g bits, the length of each BS is s bits, and the number of elements in the whole set is Ω, and obviously, Ω/s different blocks can be obtained at this time, each block is allocated a unique BID, and since the size of g is controllable, there is no need to worry about the problem of insufficient BID space.

For example, assuming that there are two sets s_0= {0,2,3,5,6,7,64,65,66}, s_1= {2,3,7,18}, let s=4, g=5, Ω=128, the resulting BID & BS structure is as shown in fig. 2:

CPU control unit part

1. Loading phase (offline):

the data is read into the host memory from the local index, and since the two table join is based on a predicate, the graph data is divided into multiple CSR structures (formats) according to predicate ids, one for each CSR. Because of the directed graph, two sets of CSRs are required, one set for storing the CSR structure (format) of the outgoing edge and one set for storing the CSR structure (format) of the incoming edge. Since the two CSRs do not affect each other, they are stored on the memory cells of the two FPGA cards. To facilitate the FPGA computation unit accessing the CSRs, the discontinuous vertex ids of each CSR are mapped to continuous index ids.

2. Query execution phase (online, taking trivariate queries as an example):

first, a candidate point set for each variable is found.

Second, for each candidate point set, its priority score is calculated with an evaluation function, tending to begin join with a small number of variables for the candidate point.

Third, executing the computing unit program of two table join: for two table join, the CPU control unit transmits two candidate point tables to the FPGA end, the former needs to map the vertex id to the index id corresponding to CSR, the latter compresses the candidate point table, the whole candidate point table is changed into a binary string, the kth bit is 1, the vertex with id k appears in the candidate point table, each w bit is divided into a group, the binary bits of the group id and the w bits are spliced to represent a group of adjacent vertex ids, if the w bits are all 0, the vertex id can not be stored, and finally the FPGA computing unit returns the intermediate result of the two table join.

Fourth, executing the computing unit program of three-table join: for the third table, it may follow the join of the first two tables, and since each item of the intermediate result of the two tables received by the CPU side control unit is a data pair, the second item of the data pair is a set of vertex ids, and the vertex ids need to be decompressed and then mapped into index ids of CSR, and then are transmitted to the FPGA side computing unit of the three tables as new input, and then the CPU side control unit obtains the final result.

Since the control unit and the calculation unit use the manner of data flow, the calculation unit of two-table join and the calculation unit of three-table join may correspond to two FPGA cards. For the benchmark (performance evaluation criteria) of LUBM (the Lehigh University Benchmark), the two computing units can execute independently, each accessing a different CSR structure (format) without conflict.

User queries for three or more variables can be handled with the same logic, except that more FPGA hardware resources are required. Thereby embodying the scalability of the present design.

FPGA computing unit part

Fig. 3 is a block diagram showing the structure of the accelerator device/accelerator according to the present invention. In the loading stage, the control unit program on the CPU writes the CSR data required for join calculation into the memory of the FPGA hardware. In the query execution stage, under a specific execution sequence, join is split into a plurality of layers, and data transfer is carried out between the layers in a streaming mode. That is, when the calculation of the previous layer obtains an intermediate result, the intermediate result can be immediately transferred to the next layer, and the calculation of the next layer can be started without waiting until all the intermediate result calculations are completed, so that the parallelism is greatly improved. For each hierarchy, the control unit program at the CPU end needs to transmit the candidate point table and the control parameters required by the calculation of the present hierarchy to the FPGA. Likewise, the control unit program uses the form of streams to transfer them to achieve an overall data stream.

Taking layer 0 as an example, the modular structure at each layer is shown in fig. 4. The adjacency list read from the FPGA memory and the candidate point list obtained from the CPU end are transmitted to a plurality of modules for processing the intersection of two or more tables. The specific number of parallel modules depends on the size of the data set and the specific configuration of the FPGA hardware.

Also taking layer 0 as an example, fig. 5 illustrates a flow of intersection fetching calculation in the query process according to an embodiment of the present invention. The incoming candidate point table is equally divided into N parts, corresponding to being placed into N buckets. And each time an adjacency list is received, each element can be put into a corresponding bucket according to the range of node IDs in N buckets, and then merging and intersection taking calculation is carried out in each bucket. After a valid result is obtained, the result is returned to the CPU.

In particular implementations, the large-scale database module is used to store large-scale graph datasets represented in RDF (Resource Description Framework) format. The number of the nodes is tens of millions or more, and the number of the edges is hundreds of millions or more. For example, the LUBM (the Lehigh University Benchmark) dataset (used to test the performance of the present invention) contains 50 hundred million edges and about 9 hundred million nodes. Another example is the dbpetia dataset extracted from wikipedia, which includes 30 hundred million edges, 1600 ten thousand nodes. Such large-scale data sets have high requirements on the performance of the stand-alone graph database. The invention selects the university developed gStore (https:// gitsub. Com/pkumod/gStore) to provide the graph database software support, because the performance of single machine query on large-scale data is better.

It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. The method for accelerating the query of the large-scale data based on the heterogeneous environment of the FPGA-CPU comprises the steps of expressing the large-scale data to be queried as a large-scale graph data set in an RDF format of a resource description framework, and realizing the acceleration of the query based on the heterogeneous environment of the FPGA-CPU; the method comprises the following steps:

1) Identifying large-scale data to be queried as a plurality of elements, including: entity, entity attribute, and relationship; correspondingly representing the entities as nodes in the RDF format, representing the inherent attributes of the entities as the attributes of the nodes in the RDF format, and representing the relationship between the entities and the relationships between the entities and the attributes as edges in the RDF format; RDF format data consists of nodes and edges, wherein a source point of one edge in the graph is a subject, a label on the edge is a predicate, and a pointed node is an object;

Data preprocessing is carried out on large-scale data expressed in an RDF format, and the following operations are carried out:

11 Dividing the graph data in the large-scale graph data set into a plurality of subgraphs according to predicates, and storing an adjacent matrix of each subgraph data by using a CSR format after optimized storage consumption; the adjacency matrix is a two-dimensional list formed by arranging adjacency lists of all nodes according to a certain sequence; the adjacency list is a list of all nodes linked by any node in the graph; the diagram adopts a CSR storage format or a CSC storage format;

12 For the adjacent list in each CSR format diagram data, compressing by using a block number and binary segment method BID & BS method to obtain compressed CSR format data; the following operations are performed:

121 A binary string is used for representing a positive integer array, the positive integer array is an adjacent table in CSR, the value of an element in the string is 0 or 1, and the element in the array is the number of a node;

122 Dividing the binary string into a plurality of blocks with identical lengths, and setting a unique number Block ID with the same length for each Block; when the binary strings in the block are all 0, the binary strings are directly discarded;

123 When the intersection is taken from two sets formed by a plurality of blocks, firstly matching block numbers, and taking intersection according to the bit from two blocks with the same block numbers to obtain an intersection result;

2) Writing the CSR format data obtained by preprocessing into a storage unit at the FPGA end to obtain physical addresses of adjacent lists compressed into CSR formats in the storage unit at the FPGA end, wherein the subgraphs correspond to different predicates;

3) When the user makes a query, the following operations are performed:

32 The CPU control unit transmits the candidate point table data and the control signal of the FPGA end computing unit in the operation process to the FPGA end computing unit for table connection computation and optimization; the following operations are performed:

323 Based on FPGA hardware, constructing a three-stage pipeline for data reading, calculating and writing back, and setting a parallel processing module in each calculation and reading operation;

33 After the calculation unit at the FPGA end finishes the calculation, the result is transmitted back to the control unit at the CPU end, the CPU end generates a numbered result and then converts the numbered result into a corresponding character string result, namely a query result which is output to a user.

2. The method for accelerating the large-scale data query based on the heterogeneous environment of the FPGA-CPU according to claim 1, wherein in step 31), the possible values of each variable are filtered and resolved by specifically using the degrees and the labels on the edges of each variable in the SPARQL query on the large-scale graph dataset gStore, so as to obtain the candidate point table data.

3. The method for accelerating the large-scale data query based on the heterogeneous environment of the FPGA-CPU according to claim 1, wherein in the step 32), the control signals comprise signals for controlling the calculation unit at the FPGA end to start calculation, hardware addresses of CSR format data on the storage unit at the FPGA end and auxiliary information.

4. The method for accelerating the large-scale data query based on the heterogeneous environment of the FPGA-CPU according to claim 1, wherein the step 33) comprises the following operations:

331 The connection operation of the ordered processing table of the FPGA end computing unit; the CPU control unit receives the result returned by the FPGA end computing unit, and the form of the returned result is a tuple formed by a plurality of node IDs and a BID & BS block; the CPU end control unit simultaneously controls a plurality of FPGA end calculation units;

332 The CPU end control unit converts the result of each FPGA end calculation unit into the final result of the connection operation or the input of the next FPGA end calculation unit, and the method comprises the following steps:

Setting N elements in a result tuple of the FPGA end computing unit, mapping the first N-1 elements in the tuple to node IDs on an offset array in a CSR format, and mapping the node IDs back to numbers in a graph database for outputting; the nth element is a BID & BS block;

decompressing the N-th element to obtain a plurality of node numbers;

the obtained node number and the N-1 elements in the front form a final result;

the final result is in the form of a tuple with the length of N and the element of the tuple is the node number; each tuple represents a numbered combination of a join operation meeting the query requirements;

the number combination is the query result finally output to the user by the system.

5. The method for accelerating the large-scale data query based on the heterogeneous environment of the FPGA-CPU according to claim 1, wherein the method is applied to natural language intelligent question-answer query, and the large-scale data to be queried and processed are natural language question-answer data; identifying characters, events and things in the natural language question-answering data as entities, and correspondingly representing the characters, the events and the things as nodes in an RDF format; the attributes of the entities are defined as attributes of the nodes.

6. A large-scale data query acceleration device based on an FPGA-CPU heterogeneous environment comprises: the system comprises a large-scale database module, a data preprocessor, a CPU end control unit and an FPGA end; the FPGA end comprises an FPGA end storage unit and an FPGA end calculation unit;

A. The large-scale database module is used for storing a large-scale graph data set expressed in a resource description framework RDF format; the number of data nodes is tens of millions or more, and the number of edges is hundreds of millions or more; the large-scale database module comprises a graph database data and a data query support module;

B. the data preprocessor is used for preprocessing data and comprises the following steps:

connecting operation of the graph database;

extracting part of data in the data;

generating a memory friendly data index according to the sparse line compression CSR format;

the method BID & BS compresses and stores the adjacent table data and the candidate point table data by adopting a block number and binary segment method;

the CPU end control unit is used for calling the data preprocessor to preprocess the data, and specifically comprises the following steps:

calling a data preprocessor to obtain compressed and converted graph data, and reading the compressed and converted graph data by an FPGA end computing unit;

then transmitting the data required by the FPGA end computing unit to the FPGA end computing unit;

meanwhile, the CPU control unit also receives results returned by the FPGA end computing units, each result is in the form of a tuple formed by a plurality of node IDs and a BID & BS block, and the CPU end control unit simultaneously controls the plurality of FPGA end computing units;

The CPU terminal control unit converts the data output by each FPGA terminal calculation unit into a final connection operation result or the input of the next FPGA terminal calculation unit; the final result is in the form of a tuple with a length of N and positive integers, and each tuple represents a numbering combination of a connection operation meeting the query requirement;

the FPGA end computing unit is used for executing a plurality of table connection operations on FPGA hardware and optimizing the implementation on a logic unit of the FPGA; comprising the following steps:

firstly, dividing data, and putting the data into different DRAM memory cell ports;

secondly, designing a data buffer area by utilizing a high-speed on-chip random access memory (BRAM) unit on FPGA hardware;

and finally, constructing a three-stage pipeline for reading, calculating and writing back data by utilizing FPGA hardware, and designing parallel processing logic in each operation, so that a calculation unit at the FPGA end quickly returns a result obtained by single multi-table connection operation to a CPU control unit.

7. The device for accelerating the large-scale data query based on the heterogeneous environment of the FPGA-CPU according to claim 6, wherein the large-scale graph data set expressed in the RDF format is specifically a LUBM data set.

8. The large-scale data query acceleration device based on the heterogeneous environment of FPGA-CPU as claimed in claim 6, wherein the block number and binary segment BID & BS method comprises the following specific steps:

121 A binary string is used for representing a positive integer array, the value of an element in the string is 0 or 1, and the element in the array is the number of a node;

123 When the intersection is taken from two sets formed by a plurality of blocks, firstly matching the block numbers, and taking intersection from two blocks with identical block numbers according to the bit to obtain an intersection result.