CN110990638A

CN110990638A - Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment

Info

Publication number: CN110990638A
Application number: CN201911029459.3A
Authority: CN
Inventors: 邹磊; 林殷年; 苏勋斌
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-04-10
Anticipated expiration: 2039-10-28
Also published as: CN110990638B; WO2021083239A1

Abstract

The invention discloses a large-scale data query acceleration method for a graph database based on an FPGA-CPU heterogeneous environment and an implementation device thereof on the FPGA, which represent large-scale data to be queried and processed into a large-scale graph data set in a Resource Description Framework (RDF) format, realize query acceleration based on the FPGA-CPU heterogeneous environment, solve the problem of rapidly querying the data on the large-scale data set, accelerate the query of the graph database and can be widely applied to the technical field of application based on graph data processing. The method is applied to natural language question-answering intelligent query. The implementation shows that the query acceleration ratio is more than two times and can reach ten times of acceleration by adopting the method disclosed by the invention, and the application requirement with higher response time requirement can be better met.

Description

Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment

Technical Field

The invention belongs to the technical field of information search and query, relates to a large-scale data search acceleration technology, and particularly relates to a large-scale data query acceleration method based on an FPGA-CPU heterogeneous environment and an implementation device of the method on an FPGA.

Background

In large scale data retrieval and querying, a graph database is a database that uses graph structures for semantic queries, using nodes, edges, and attributes to represent and store data. The key concept of the system is a graph, which directly associates data items in storage with data nodes and sets of edges between nodes representing relationships. These relationships enable the data in the storage units to be linked together and retrieved through a particular normalized operation and query language. The most important content in a graph database is the relationships between data that are permanently stored in the database itself, and the speed of querying the entity associations in the graph database is a major factor in evaluating the performance of the graph database. Users can intuitively display the association of entities using a graph database, which is essential for processing highly interconnected data. A series of data with important significance of the relation between physical bodies in real life, such as a power network, is quickly and efficiently retrieved, added and deleted and analyzed through a graph database.

Retrieving data from graph databases requires a query language other than SQL, which is designed to process data in relational systems and therefore cannot efficiently process traversal graphs. By 2019, a graph query language such as SQL is not available. However, there has been some standardization work to provide query languages for large suppliers. Among them, the sparql (sparql Protocol and RDF Query language) language uses triples to express user queries, supports multiple aggregation functions, and is supported by most vendors.

Graph databases have found application in a wide range of fields to date. Such as automatic association recommendations in the Google search engine, fraud detection in the e-commerce transaction network, pattern discovery in social networks, and so forth, all require support from a graph database. Most graph databases currently achieve some performance when processing millions to tens of millions of data sets, but show significant performance degradation as the data scale expands to hundreds of millions or even billions. And not a linear reduction but a geometric reduction. On the one hand, the strategies and algorithms adopted inside these graph databases are not suitable for such huge data size, and on the other hand, most graph data run on the CPU completely, are processed by adopting serial algorithms, and need to read data from a slow storage unit such as a disk, and cannot enjoy the advantages brought by the current development of new hardware.

The performance bottleneck of the graph database is that, when the graph data is large in scale, a great deal of time and computing resources are consumed for performing Join operation on the graph. To implement join operations on a graph, the most common approach at present is to traverse one of the sets, read the adjacency list of each node, then intersect it with the other set, and finally add the result to the result set. The main drawback of this approach is that the order of traversing points in the graph has a significant impact on computational performance, and therefore, careful selection of the order of traversing points is required. Secondly, such operations can result in large memory cell accesses, which severely affect the speed of processing.

At present, most graph databases achieve a certain performance on small-scale data sets, however, some strategies in the graph databases fail after the data sets are enlarged, and even prevent the system from being executed efficiently. The invention aims to solve the problem of rapidly inquiring data on a large-scale data set by utilizing an FPGA hardware platform.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a large-scale data query acceleration method for a graph database based on an FPGA-CPU heterogeneous environment and a realization device thereof on the FPGA, which solve the problem of rapidly querying data on a large-scale data set, accelerate the query of the graph database and can be widely applied to the technical field of application based on graph data processing.

The technical scheme provided by the invention is as follows:

a large-scale data query accelerating device based on an FPGA-CPU heterogeneous environment comprises: the system comprises a large-scale database module, a data preprocessor, a CPU end control unit and an FPGA end (comprising an FPGA end storage unit and an FPGA end calculation unit).

The large-scale database module is used for storing a large-scale graph data set represented in an RDF (Resource description framework) format, such as a lubm (the high University benchmark) data set. The number of nodes is in the tens of millions and above, and the number of edges is in the hundreds of millions. The large-scale database module comprises a database data and a data query support module. RDF consists of nodes representing entities/resources, attributes, and edges representing relationships between entities and attributes. Generally, the source point of an edge in the graph is called a subject, the label of the edge is called a predicate, and the pointed node becomes an object.

The data preprocessor is used for preprocessing data and preparing for data retrieval and query. The pretreatment comprises the following steps: extracting part of data required by connection operation (join operation) of a graph database; the memory-friendly data index is generated according to the specification of the CSR (Compressed Sparse Row) format. Meanwhile, aiming at the characteristic of best bit operation performance in a hardware environment, a self-designed Block number and binary segment method (BID & BS, Block ID & BitStride method) is adopted to compress and store the adjacency list data and the candidate point list data. The specific steps are that, for an array, if the elements in the array are all positive integers and have a fixed range, a binary string can be used to represent the array, wherein the ith bit is 0 to indicate that the number i does not appear in the array, and the ith bit is 1 to indicate that the number i appears in the array. When the array of the expression method is used for taking the intersection, only two binary strings need to be subjected to bit-based AND operation, and the method is very suitable for FPGA end hardware implementation. However, such binary strings are often sparse, wasting a lot of storage space. Thus, a binary string may be divided into blocks (blocks) of uniform length, with as many bits representing a unique number (Block id) for each Block. Wherein if all binary strings within this block are 0, then it can be directly discarded. When the intersection is taken, the block numbers are matched firstly, and the intersection is taken according to the position for the two blocks with the matched block numbers, so that the intersection result of the part can be directly obtained. The method can normally perform intersection calculation without data decompression when the large-scale data query accelerating device based on the FPGA-CPU heterogeneous environment is used for query acceleration, thereby greatly improving the query performance and improving the efficiency of a graph calculation algorithm based on graph data query;

the CPU end control unit is used for calling the data preprocessor to carry out certain preprocessing on the data, specifically, calling the data preprocessor to obtain the image data which is compressed and converted by the preprocessor and is suitable for being read by the FPGA end computing unit. And then, data required by the FPGA end computing unit is transmitted to the FPGA end computing unit, so that the FPGA end computing unit can read the data more conveniently. Meanwhile, the CPU control unit also receives results returned by the FPGA end computing units, each result is in a form of a tuple consisting of a plurality of node IDs and a BID & BS block, the CPU end control unit simultaneously controls a plurality of FPGA end computing units, the FPGA end computing units are ordered, the first FPGA end computing unit processes the connection operation of two tables, the second FPGA end computing unit processes the connection operation of three tables, and the like. The CPU side control unit converts the data output by each FPGA side calculation unit into the result of the final connection operation or the input of the next FPGA side calculation unit, the specific conversion method is consistent, if N elements exist in the result, for the first N-1 elements in the tuple, the first N-1 elements are node IDs mapped to an offset array of a certain CSR format (structure), when the final result is output, the first N-1 elements need to be mapped back to numbers in a database, the Nth element is a BID & BS block, the first N-1 elements need to be decompressed, and each obtained number can form a final result together with the first N-1 elements. In other words, the final result is in the form of a tuple with length N and positive integer elements, each tuple representing the required combination of the numbers of a join operation;

the FPGA end computing unit is used for executing a plurality of table connection (join) algorithms on FPGA hardware, the algorithms are realized on logic units of the FPGA, and optimization is carried out aiming at the running characteristics of the hardware. The specific optimization strategy is as follows: firstly, dividing data, and putting the data into different DRAM (Dynamic Random Access Memory) Memory unit ports to improve the parallel Access efficiency of FPGA hardware to the Memory units; secondly, designing a data buffer area by using a high-speed BRAM (Block Random Access Memory) unit on FPGA hardware; and finally, constructing a three-stage pipeline for data reading, calculation and write-back by utilizing the self-defined characteristic of FPGA hardware, and designing parallel processing logic according to actual conditions in each operation. Finally, the FPGA end computing unit can quickly return the result obtained by the single-time multi-table connection operation to the CPU control unit.

The invention also provides a large-scale data query acceleration method based on the FPGA-CPU heterogeneous environment, which aims at realizing the query acceleration based on the FPGA-CPU aiming at the large-scale graph data set expressed in the resource description framework RDF format; the method comprises the following steps:

1) identifying large-scale data to be queried as a plurality of elements, including: entities, entity attributes and relationships; correspondingly representing the entity as a node in an RDF format, representing the inherent attribute of the entity as the attribute of the node in the RDF format, and representing the relationship between the entity and the relationship between the entity and the attribute as edges in the RDF format; and performing data preprocessing on large-scale data expressed in an RDF format, wherein the RDF format data consists of nodes and edges, a source point of one edge in the graph is a subject, labels on the edges are predicates, and pointed nodes are objects.

In the application of the intelligent question-answering query to the natural language, individuals capable of being endowed with attributes and relations in a knowledge base, including characters, events, things and the like, are defined as entities and correspondingly expressed as nodes in an RDF format, and the attributes of the nodes are defined as the inherent attributes of the individuals, including the birthdays of the characters, the names of the things and the like; the relationships between entities and the relationships between entities and attributes are represented as edges in the RDF format.

The data preprocessing comprises the following steps: 11) dividing the graph data in the large-scale graph data set into a plurality of sub-graphs according to predicates, and storing the adjacent matrix of each sub-graph data by using a CSR format after optimizing storage consumption.

The large-scale data represented by the RD F format consists of nodes and edges, wherein the nodes represent entities/resources and attributes, and the edges represent the relationships between the entities and the attributes. Generally, the source point of an edge in the graph is called a subject, the label of the edge is called a predicate, and the pointed node becomes an object.

The large-scale data query acceleration method based on the FPGA-CPU heterogeneous environment continuously and randomly reads a list of all nodes linked with any node in a graph, wherein the list is called an adjacency list. The adjacency list is divided into an out-edge adjacency list and an in-edge adjacency list, which respectively represent adjacency lists when the node is used as a subject or an object. And arranging the adjacency lists of all the nodes in a certain sequence to form a two-dimensional list, namely the adjacency matrix. The CSR storage format and CSC storage format of the graph may improve the operational efficiency of the access.

12) And compressing the adjacency list in each CSR format graph data by using a block number and binary segment method BID & BS method to obtain compressed CSR format data.

The Block number and binary segment method (BID & BS, Block ID & Bit Stride method) compresses and stores the adjacency list data and candidate point list data. The method comprises the following specific steps:

121) representing an adjacency list (which is a positive integer array) in the CSR by using a binary string, wherein elements in the string take values of 0 or 1, and the elements in the array are the serial numbers of nodes;

for an array, if the elements are positive integers and have a fixed range, a binary string may be used to represent the array, where an ith bit of 0 indicates that i is not present in the array and an ith bit of 1 indicates that i is present in the array. When the array of the expression method is used for taking the intersection, only two binary strings need to be subjected to bit-based AND operation, and the method is very suitable for FPGA end hardware implementation.

122) Dividing the binary string into a plurality of blocks (blocks) with the same length (bits), wherein each Block is provided with a unique number (Block ID) with the same length;

binary strings are often sparse, wasting a lot of storage space. The invention divides the binary string into a plurality of blocks (Block) with consistent length, and the same number of bits is used for representing the unique number (Block ID) of each Block; when the binary strings in the block are all 0, directly discarding; .

123) When the intersection of two sets formed by a plurality of blocks is taken, the block numbers are firstly matched, and the intersection is taken according to the position for the two blocks with the consistent block numbers, so that the intersection result can be directly obtained. 2) Before calculation is started, writing the preprocessed CSR format data into a storage unit at the FPGA end to obtain the physical address of the subgraph corresponding to different predicates, which is compressed into an adjacent list in the CSR format, in the storage unit at the FPGA end;

3) when a user makes a query, the following operations are performed:

31) after analyzing the user query, obtaining candidate point table data according to the data in the graph;

in specific implementation, the large-scale graph data set adopts gStore, and the gStore filters possible values of each variable by using degrees and labels on sides of each variable in SPARQL query to obtain candidate point table data;

32) the CPU control unit transmits candidate point table data and control signals required by the FPGA end calculation unit in the operation process to the calculation unit of the FPGA end to perform table connection (join) algorithm calculation and optimization; the control signal includes a signal for controlling the FPGA side computing unit to start computing, and a hardware address and auxiliary information (such as a length of the candidate point table, which CSR format data is needed) of the CSR format data on the FPGA side storage unit.

The method comprises the following steps:

321) by dividing data, the data are put into different DRAM memory unit ports, and the parallel access efficiency of FPGA hardware to the memory units is improved;

322) setting a data buffer area by utilizing a BRAM unit of a high-speed on-chip random access memory on FPGA hardware;

323) constructing a three-stage pipeline for data reading, calculation and write-back based on the self-defined characteristics of FPGA hardware, and setting a parallel processing module in each calculation and reading operation;

33) and after the calculation of the calculation unit at the FPGA end is finished, the result is transmitted back to the CPU end control unit, and the CPU end generates a final calculation result by using the result and converts the final calculation result into a corresponding character string result, namely the query result output to the user. The specific operation comprises the following steps:

331) the CPU control unit receives results returned by the FPGA end computing units, each result is in a tuple formed by a plurality of node IDs and a BID & BS block, the CPU control unit controls a plurality of FPGA end computing units simultaneously, the FPGA end computing units are ordered, the first FPGA end computing unit processes the connection operation of two tables, the second FPGA end computing unit processes the connection operation of three tables, and the like.

332) And the CPU end control unit converts the result of each FPGA end calculation unit into a final result of the connection operation or the input of the FPGA end calculation unit in the next step.

Setting N elements in the result (one tuple) of the FPGA end calculation unit, mapping the first N-1 elements in the tuple to the node ID on the offset array of the CSR format (structure), and mapping the node ID back to the number in the graph database for output; the nth element is a BID & BS block, and the nth element is decompressed to obtain a plurality of node numbers, which should be included in the intersection result. These resulting numbers all form a final result with the first N-1 elements. The final result is in the form of a tuple of length N and node number elements, each tuple representing the combination of the numbers of a join operation that meets the query requirements. In the number combination, the numbers and the node data are in one-to-one correspondence, and a node set which should appear in a query answer is determined, namely the query result which is finally output to the user by the system.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a large-scale data query acceleration method for a graph database based on an FPGA-CPU heterogeneous environment and an implementation device thereof on the FPGA, which can be applied to database query of intersection operation of a candidate point list and an adjacent list during processing connection operation, solve the problem of fast query of data on a large-scale data set, accelerate graph database query, and can be widely applied to the technical field of application based on graph data processing, such as social networks, financial wind control, application of Internet of things, relationship analysis, IT operation and maintenance, recommendation engines and the like. By adjusting the input and output formats, the invention can be combined with a graph database system, and the query response speed of the graph database system is improved. The method is applied to natural language question-answer intelligent query, characters, events and things in natural language question-answer data are identified as entities and correspondingly expressed as nodes in an RDF format; the attribute of the entity is defined as the attribute of the node, and the relationship between the entity and the relationship between the entity and the attribute are represented as edges in the RDF format. The specific implementation shows that by adopting the technical scheme of the invention, the query acceleration rate is more than two times and can reach ten times of acceleration, so that the application requirement with higher response time requirement can be better met, and for example, real-time graph mode query discovery can be realized.

Drawings

Fig. 1 is a schematic diagram of a CSR structure (format) used in the embodiment of the present invention.

Fig. 2 is a schematic diagram of the implementation of the BID & BS compression method according to the embodiment of the present invention.

Fig. 3 is a block diagram of an accelerator apparatus/accelerator according to the present invention.

Fig. 4 is a Kernel structure block diagram of the 0 th layer of the acceleration device provided by the present invention.

FIG. 5 is a block diagram illustrating a process for intersection calculation during a query process according to an embodiment of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a large-scale data query acceleration method for a graph database based on an FPGA-CPU heterogeneous environment and an implementation device of the method on an FPGA. Accelerated graph database queries may support a large number of application scenarios based on graph data. For example: there is a need to quickly find some fixed pattern of application scenarios in graph data: such as holdings between companies, can be expressed in the form of graph data. Fraud detection is a very suitable application scenario for graph databases. In modern fraud and various financial crimes, such as bank fraud, credit card fraud, e-commerce fraud, insurance fraud, etc., the fraudster usually uses means of changing its identity to avoid the fraud of the wind control rule. However, it is often difficult for a fraudster to change all the associations relating to the network and to perform the same operation in all the involved network groups simultaneously to avoid the wind control. While the graph data may establish a user tracking perspective that tracks global. However, without the use of an accelerator, the time penalty is unacceptable. After the accelerating device is used, the operation speed of the algorithm can be improved by more than 2 times. In addition, there are application scenarios that need to support customized real-time queries, which include many research institutions and business companies building domain-specific or open-domain natural language question-and-answer systems based on graph data (typically a knowledge graph). At the bottom of these question-answering systems, graph database support is required in order to quickly obtain from graph data the information needed to parse and answer natural language questions. For example, in an intelligent question-answering system framework developed by a certain artificial intelligence company, after a natural language is converted into a SPARQL which can be identified by a graph database by using machine learning tools such as sentence vector coding, sentence pattern analysis, word meaning filtering, emotion identification, text clustering and classification, the graph database is used for storing a background in a triple form, and a knowledge base which covers the fields of common knowledge, ancient poems, life events, music and the like and has a scale close to hundred million and edge levels is searched. Such accesses are around 20 million times per day, with about 5 links concurrent during peak hours, averaging 10 requests per second. The delay of this step is about 30% of the overall delay. After the technical scheme of the invention is used for accelerating the query of the graph database, the time delay ratio can be reduced to 15 percent, and the throughput in the peak period is doubled, thereby helping the company to realize the response speed within the second level.

In the method for accelerating the large-scale data query based on the FPGA-CPU heterogeneous environment, the large-scale data is a database which relates to intersection operation of a candidate point table and an adjacent list during processing connection operation, such as a graph database, and can be applied to application such as acceleration of natural language question and answer query. The specific implementation takes the natural language knowledge base as an example to be represented as graph data, and comprises the following operations:

1) and determining the actual meanings corresponding to the nodes and the edges in the graph data. In a natural language knowledge base, nodes generally represent all bodies in the knowledge base that can generate interrelations, and have attributes, such as people, things, places, and the like, and edges represent interrelations between the bodies, such as spouses, birth places, and the like.

2) Attributes of nodes and edges in the graph data are determined. Attributes of a node typically represent inherent characteristics of the entity to which the node corresponds, such as the age, gender, date of birth, name of the location, and so forth of the person. The attributes of an edge typically represent the properties that a relationship carries. For example, in a mutual spouse relationship, there may be attributes such as a start time, an end time, and the like.

3) According to the definition, a certain graph data format is adopted to convert the data into graph data. Taking the RDF format as an example, the definitions of the nodes and edges and the definition modes of the respective attributes in the RDF format have detailed format specifications, and data conversion is performed according to the format specifications.

In the prior art, a method for converting natural language to SPARQL queries in a graph database may include the steps of:

1) and performing entity recognition, and establishing association between elements in the natural language and nodes in the graph database.

Elements included in natural language may have corresponding nodes in the graph database, but the names of the nodes and the natural language do not necessarily coincide literally. For example, a natural language refers to "duel", and in a graph database, there is a possibility that a label of a corresponding node is "duel (poetry of down)". It is necessary to link the two. The current methods generally use pattern matching or deep learning methods to identify entities based on information in a graph database.

2) Determining a dependency relationship;

the dependency relationship refers to a semantic relationship between entities in a natural language. Typically, a dependency corresponds to two nodes and an edge in graph data. At present, a dependency relationship is determined by a dependency relationship tree generation method.

3) A query is generated.

By using the entity information and the dependency relationship information obtained above, queries that can be identified by a graph database can be generated by a machine learning method.

Graph database queries are very basic graph data operations that require a certain amount of query data, whether the query itself is provided to the user or a query-based application interface. When the graph data is large in scale, it takes a lot of time and computing resources to perform Join operations on the graph. The Join operation on the graph is similar to the Join operation of the table in the relational database, and the matching items in the two sets are searched according to certain conditions. The difference is that the relational database usually uses equivalence conditions to determine whether elements match, and Join operations in the graph database need to determine whether elements match by determining whether there is a relationship between elements. Join operations in graph databases are more complex, since they involve more memory reads and computation operations than Join operations in relational databases.

Essentially, the purpose of performing Join operations on a graph is to perform computation of sub-graph isomorphism. In most graph data, a user query may be represented as a query graph. The execution of a query is equivalent to finding a subgraph in the overall data graph that has a isomorphic relationship to the query graph. In computer theory, the definition of isomorphic problem is: the two simple graphs G and H are said to be isomorphic if and only if there is a one-to-one correspondence σ that maps node 1 … … n of G to node 1 … … n of H, such that any two nodes i and j in G are connected, and if and only if the corresponding two nodes σ (i) and σ (j) in H are connected. If G and H are directed graphs, then the isomorphic definition further requires that for any two connected nodes i and j in G, the edge (i, j) be in the same direction as its corresponding edge (σ (i), σ (j)) in H. The parallel computing efficiency of the FPGA is high. FPGA belongs to parallel computation, and can execute algorithms of a plurality of instructions once, while traditional ASIC, DSP and even CPU are all serial computation, and can only process one instruction set once, if ASIC and CPU need to be accelerated, more methods are to increase frequency, so the main frequency of ASIC and CPU is generally higher. Although FPGAs generally have a low dominant frequency, for some specific tasks, a large number of relatively slow parallel units are more efficient than a small number of efficient units. In addition, from a certain point of view, the FPGA does not have so-called 'calculation', and the final result is almost similar to ASIC 'circuit direct supply', so that the execution efficiency is greatly improved.

RDF (resource DescriptionFramework), namely a resource description framework, is essentially a Data Model (Data Model). It provides a uniform standard for describing entities/resources. In short, it is a method and means of representing things. RDF is formally represented as an SPO triple, sometimes referred to as a statement (state), which we also refer to as a piece of knowledge in the knowledge graph.

RDF consists of nodes representing entities/resources, attributes, and edges representing relationships between entities and attributes. Generally, the source point of an edge in the graph is called a subject, the label of the edge is called a predicate, and the pointed node becomes an object.

In the present invention, it is necessary to continuously read at random the list of all nodes linked to any node in the graph. In graph theory, this list is called a neighbor list. In the directed graph, the adjacency list is divided into an outgoing adjacency list and an incoming adjacency list, and each of the lists indicates an adjacency list when a node is a subject or an object. A two-dimensional list in which the adjacency lists of all nodes are arranged in a certain order is called an adjacency matrix. However, in computer systems, continuous random access to memory locations results in poor operating efficiency. Therefore, researchers have proposed the CSR storage format and the CSC (Compressed Sparse storage) storage format of the graph by using the CSR (Compressed Sparse Row) storage format of the Sparse matrix.

The CSR storage format of fig. 1 consists of two arrays C and E. Wherein E is formed by connecting the adjacency lists of all nodes end to end. Since the graph database system will typically set node IDs for nodes, the adjacency list of all nodes can be combined in order of node IDs from large to small. And C is the same length as the number of nodes in the graph, the value of the node indexed by i is equal to the position of the first element in E of the adjacency list of nodes with ID i. When the adjacency list is an out-edge adjacency list, it is called a CSR format, and when the adjacency list is an in-edge adjacency list, it is called a CSC format. Array C is also called the offset array because it represents the offset of the adjacency list in E.

The technical scheme of the invention comprises the following steps: the system comprises a data preprocessing part, a CPU control unit part and an FPGA computing unit part. In specific implementation, the invention adopts Verilog language to write the calculation unit of the FPGA end and uses C + + to write the control unit program of the CPU end. The development and operation of the invention are based on the U200 FPGA board card sold by Xilinx company and the matched operation and development environment thereof. U200 is equipped with 16GB by 4 memory space, 2175 by 18K of on-chip memory space. Therefore, when the present invention is implemented in U200, the adjacency list is divided into 128 groups, and 128-way parallel intersection calculation is performed. The matched development environment can map the hardware logic corresponding to the Verilog language into FPGA hardware.

The invention is based on an open source graph database system gStore, aiming at a lubm database with the scale of 50 hundred million and a join query related to a benchmark corresponding to the lubm database, and realizing a CPU-FPGA heterogeneous graph database system query accelerator.

Data preprocessing section

Currently, there are a variety of standard formats for storing and representing graph data, with the RDF format being widely used. The graph data format to which the present invention is applicable is the RDF format. In the calculation process, the adjacency matrix of the graph data is stored using the CSR format.

In the RDF format, relationships between nodes are represented using triples. When the user query processing is carried out, the values of the subject, the object and the predicate in the triple need to be matched. However, neighbors linked by different predicates cannot be distinguished in the CSR format. Therefore, in the present invention, graph data is divided according to predicates. All edges with the same predicate are extracted to form a sub-graph, and a separate CSR structure (format) is generated for each sub-graph. During the query process, it is determined by additional parameters which data in the CSR is to be read.

Thus, there is still a drawback. The size of the subgraph obtained according to the predicate may be much smaller than that of the original graph, so that the offset array in the CSR is very sparse, and a large amount of storage space is wasted. Thus, for each sub-graph, the present invention maintains a mapping structure in advance. And (3) setting the number of nodes with the total degree of being more than or equal to 1 in a given subgraph to be n, renumbering the nodes with the number of n to be 0 … … n-1 according to the sequence of IDs from small to large, and constructing an offset matrix in the CSR under the new numbering, wherein the E matrix is kept unchanged, namely the elements in the adjacency list are still the IDs of the nodes before renumbering.

Meanwhile, in order to improve the parallelism of the algorithm on FPGA hardware and reduce the computational complexity, the invention adopts a data compression mode based on binary bit strings, which is called a BID & BS structure.

From a hardware perspective, the fastest to run is a binary bit operation. The most intuitive idea to use bit manipulation to accomplish intersection fetching is to represent the elements in the set as a binary string, where the ith bit is 1, which indicates that there is an element numbered i in the set, otherwise there is no element (assuming that the elements in the set are numbered, in fact, most graph databases will number nodes). Then, only one bitwise and operation needs to be performed on the two binary strings, and the result is a binary string representation of the intersection. However, in the practical application, most of the graphs are sparse graphs, and the average degree is not high, so that a large number of continuous 0 in the binary string is caused, and the performance is affected.

The basic idea of the BID & BS compression method is to divide the binary string into a number of blocks, give a unique Block ID (BID) to each part, and inside each Block, indicate the presence or absence of an aggregate element with a binary string Bit Stride (BS). Therefore, a block without elements can be directly removed without calculation, so that the effect of data compression is achieved, and the problem of data sparsity is relieved to a certain extent. For a block with at least one element, the merging method can be used to find the same BID by comparison, and then perform bitwise and operation on the corresponding BS to obtain the result. For convenience, it is noted that the length of each BID is g bits, the length of each BS is s bits, and the number of elements in the complete set is Ω, and it is obvious that Ω/s different blocks can be obtained at this time, and each block is allocated with a unique BID.

For example, assuming that two existing sets S _0 ═ {0,2,3,5,6,7,64,65,66}, S _1 ═ {2,3,7,18}, S ═ 4, g ═ 5, and Ω ═ 128, the generated BID & BS structure is as shown in fig. 2:

CPU control unit section

1. Loading phase (offline):

reading the data into a host memory from a local index, and dividing the data into a plurality of CSR structures (formats) according to predicate ids because two table joins are based on a predicate, wherein each CSR corresponds to one predicate id. Because of the directed graph, two sets of CSRs are required, one for storing the CSR structures (formats) of the outgoing edges and one for storing the CSR structures (formats) of the incoming edges. The two CSRs are stored in the memory cells of the two FPGA cards because the two CSRs cannot influence each other. To facilitate FPGA compute unit access to CSRs, the non-contiguous vertex ids of each CSR are mapped to a contiguous index id.

2. Query execution phase (online, taking three-variable query as an example):

in the first step, a set of candidate points for each variable is found.

In the second step, for each candidate point set, a merit function is used to calculate its priority score, tending to start with join from variables with a small number of candidate points.

And thirdly, executing the calculation unit programs of two table join: for two table joins, the CPU control unit transmits two candidate point tables to the FPGA end, the former needs to map the vertex id to the index id corresponding to the CSR, the latter compresses the candidate point tables, the whole candidate point table is changed into a binary string, the kth bit is 1 to indicate that the vertex with the id of k appears in the candidate point table, every w bits are divided into a group, the binary bit and the w bits of the id of the group are spliced to represent a group of adjacent vertex ids, if the w bits are all 0, the storage can be avoided, and finally the FPGA calculation unit returns the intermediate result of the two table joins.

Step four, executing the calculation unit program of the three-table join: for the third table, it may need join with both the first two tables, because each item of the intermediate result of the join of the two tables received by the CPU end control unit is a data pair, the second item of the data pair is a group of vertex ids, and after decompression is needed, the second item is mapped into the index id of the CSR, and is transmitted to the FPGA end computing unit of the join of the three tables as a new input, and then the CPU end control unit obtains the final result.

Since the control unit and the calculation unit use a data flow manner, the calculation unit of the two-table join and the calculation unit of the three-table join can correspond to two FPGA cards. For the benchmark (performance evaluation standard) of lubm (the high University benchmark), two computing units can be executed independently, and each accesses different CSR structures (formats) without conflict.

User queries for three or more variables can be handled with the same logic, except that more FPGA hardware resources are required. Thereby allowing scalability of the design.

FPGA computing unit part

Fig. 3 is a block diagram illustrating the structure of an accelerator apparatus/accelerator according to the present invention. In the loading phase, the control unit program on the CPU writes the CSR data needed to compute the join into the memory of the FPGA hardware. In the query execution stage, under a specific execution sequence, Join is divided into a plurality of layers, and data is transmitted between the layers in a stream mode. Namely, when the previous layer of calculation obtains an intermediate result, the intermediate result can be immediately transmitted to the next layer, and the calculation of the next layer can be started without waiting for the completion of the calculation of all the intermediate results, thereby greatly improving the parallelism. For each level, a control unit program at the CPU end needs to transmit a candidate point table and control parameters required for calculation of the level to the FPGA. Likewise, the control unit program transmits them in the form of streams to achieve an overall data stream.

Taking layer 0 as an example, the module structure at each layer is shown in fig. 4. And the adjacent list read from the FPGA memory and the candidate point list acquired from the CPU end are transmitted to a plurality of modules for processing the intersection of two or more lists. The specific number of parallel modules depends on the size of the data set and the specific configuration of the FPGA hardware.

Also taking layer 0 as an example, fig. 5 shows a flow of intersection calculation in the query process according to the embodiment of the present invention. The incoming candidate point list is equally divided into N equal parts, which is equivalent to putting into N buckets. And each time an adjacency list is received, each element can be placed into a corresponding bucket according to the range of the node IDs in the N buckets, and then the calculation of merging and taking intersection is carried out in each bucket. After a valid result is obtained, the result is transmitted back to the CPU.

In specific implementation, the large-scale database module is used for storing a large-scale graph data set represented in an RDF (resource DescriptionFramework) format. The number of nodes is in the tens of millions and above, and the number of edges is in the hundreds of millions. For example, the LUBM (the Lehigh University benchmark) dataset, which was used to test the performance of the present invention, contains 50 hundred million edges, about 9 million nodes. As another example, the dbpedia dataset extracted from Wikipedia includes 30 hundred million edges and 1600 ten thousand nodes. Such large-scale data sets have high requirements for the performance of stand-alone graph databases. The invention selects the gStore (https:// github. com/pkumod/gStore) developed by university to provide database software support because of the better performance of single-machine query on large-scale data.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A large-scale data query acceleration method based on an FPGA-CPU heterogeneous environment is characterized in that large-scale data to be queried and processed are expressed as a large-scale graph data set in a Resource Description Framework (RDF) format, and query acceleration is realized based on the FPGA-CPU heterogeneous environment; the method comprises the following steps:

1) identifying large-scale data to be queried as a plurality of elements, including: entities, entity attributes and relationships; correspondingly representing the entity as a node in an RDF format, representing the inherent attribute of the entity as the attribute of the node in the RDF format, and representing the relationship between the entity and the relationship between the entity and the attribute as edges in the RDF format; the RDF format data consists of nodes and edges, a source point of one edge in the graph is a subject, labels on the edges are predicates, and pointed nodes are objects;

carrying out data preprocessing on large-scale data expressed in an RDF format, and executing the following operations:

11) dividing graph data in the large-scale graph data set into a plurality of sub-graphs according to predicates, and storing an adjacent matrix of each sub-graph data by using a CSR format after optimized storage consumption; the adjacency matrix is a two-dimensional list formed by arranging adjacency lists of all nodes in a certain sequence; the adjacency list is a list of all nodes linked by any node in the graph; the map adopts a CSR storage format or a CSC storage format;

12) compressing the adjacency list in each CSR format graph data by using a block number and binary segment method BID & BS method to obtain compressed CSR format data; the following operations are performed:

121) representing a positive integer array by using a binary string, wherein the positive integer array is an adjacent table in the CSR, the value of an element in the string is 0 or 1, and the element in the array is the number of a node;

122) dividing the binary string into a plurality of blocks with the same length, wherein each Block is provided with a unique serial number Block ID with the same length; when the binary strings in the block are all 0, directly discarding;

123) when the intersection is taken for two sets formed by a plurality of blocks, firstly matching the block numbers, and taking the intersection according to the position for the two blocks with the consistent block numbers to obtain the intersection result;

2) writing the CSR format data obtained by preprocessing into a storage unit at the FPGA end to obtain the physical address of an adjacency list, compressed into the CSR format, of subgraphs corresponding to different predicates in the storage unit at the FPGA end;

3) when a user makes a query, the following operations are performed:

32) transmitting candidate point table data and a control signal of the FPGA end calculation unit in the operation process to a calculation unit of an FPGA end through a CPU control unit for table connection calculation and optimization; the following operations are performed:

323) constructing a three-stage pipeline for data reading, calculation and write-back based on FPGA hardware, and setting a parallel processing module in each calculation and reading operation;

33) and after the calculation of the calculation unit at the FPGA end is finished, the result is transmitted back to the CPU end control unit, and the CPU end generates a numbering result and converts the numbering result into a corresponding character string result, namely the query result output to the user.

2. The large-scale data query acceleration method based on the FPGA-CPU heterogeneous environment as claimed in claim 1, wherein in step 31), the possible values of each variable are filtered and analyzed by using the degree and the label on the side of each variable in SPARQL query on the large-scale map data set gStore to obtain the candidate point table data.

3. The method as claimed in claim 1, wherein in step 32), the control signals include a signal for controlling the FPGA-side computing unit to start computing, a hardware address of the CSR format data on the FPGA-side storage unit, and auxiliary information.

4. The large-scale data query acceleration method based on the FPGA-CPU heterogeneous environment as claimed in claim 1, wherein the step 33) specifically includes the operations of:

331) the FPGA end computing unit is used for orderly processing the connection operation of the table; the CPU control unit receives a result returned by the FPGA end computing unit, and the returned result is in a tuple consisting of a plurality of node IDs and a BID & BS block; the CPU end control unit simultaneously controls a plurality of FPGA end computing units;

332) the CPU end control unit converts the result of each FPGA end calculation unit into the final result of the connection operation or the input of the FPGA end calculation unit in the next step, and the method comprises the following processes:

setting N elements in a result tuple of a computation unit at the FPGA end, mapping the first N-1 elements in the tuple to a node ID on an offset array in a CSR format, and mapping the node ID back to a number in a graph database for output; the nth element is a BID & BS block;

decompressing the Nth element to obtain a plurality of node numbers;

the obtained node number and the previous N-1 elements form a final result;

the final result is in the form of a tuple with the length of N and the element of the node number; each tuple represents a combination of numbers of a join operation that meets the query requirements;

the combination of the numbers is the query result which is finally output to the user by the system.

5. The large-scale data query acceleration method based on the FPGA-CPU heterogeneous environment as claimed in claim 1, wherein the method is applied to natural language intelligent question-answer query, and the large-scale data to be queried and processed are natural language question-answer data; identifying characters, events and things in the natural language question-answering data as entities, and correspondingly representing the entities as nodes in an RDF format; the attributes of the entities are defined as attributes of the nodes.

6. A large-scale data query accelerating device based on an FPGA-CPU heterogeneous environment comprises: the system comprises a large-scale database module, a data preprocessor, a CPU (central processing unit) end control unit and an FPGA (field programmable gate array) end; the FPGA end comprises an FPGA end storage unit and an FPGA end calculation unit;

A. the large-scale database module is used for storing a large-scale graph data set represented in a Resource Description Framework (RDF) format; the number of data nodes is in the tens of millions and above, and the number of edges is above hundred million; the large-scale database module comprises a database data and a data query support module;

B. the data preprocessor is used for preprocessing data and comprises the following steps:

connecting graph databases;

extracting partial data from the data;

generating a data index friendly to access and storage according to the sparse row compression CSR format;

adopting a block number and binary segment method BID & BS to compress and store the adjacency list data and the candidate point list data;

the CPU end control unit is used for calling a data preprocessor to preprocess data, and specifically comprises the following steps:

calling a data preprocessor to obtain compressed and converted graph data, and reading the graph data by a FPGA end computing unit;

then transmitting data required by the FPGA end computing unit to the FPGA end computing unit;

meanwhile, the CPU control unit also receives results returned by the FPGA end computing unit, each result is in the form of a tuple consisting of a plurality of node IDs and a BID & BS block, and the CPU control unit controls the FPGA end computing units at the same time;

the CPU end control unit converts the data output by each FPGA end computing unit into a result of final connection operation or the input of the next FPGA end computing unit; the final result is in the form of a tuple with the length of N and positive integers of elements, and each tuple represents a serial number combination of a join operation meeting the query requirement;

the FPGA end computing unit is used for executing a plurality of table connection operations on FPGA hardware and realizing optimization on a logic unit of the FPGA; the method comprises the following steps:

firstly, data are divided and put into different DRAM memory unit ports;

secondly, designing a data buffer area by utilizing a BRAM unit of a high-speed on-chip random access memory on FPGA hardware;

and finally, constructing a three-stage pipeline for data reading, calculation and write-back by using FPGA hardware, and designing parallel processing logic in each operation, so that the FPGA-side calculation unit can quickly return the result obtained by the single-time multi-table connection operation to the CPU control unit.

7. The large-scale data query acceleration apparatus based on the FPGA-CPU heterogeneous environment as claimed in claim 6, wherein the large-scale graph dataset expressed in RDF format specifically adopts LUBM dataset.

8. The large-scale data query acceleration device based on the FPGA-CPU heterogeneous environment as claimed in claim 6, wherein the block number and binary segment BID & BS method comprises the specific steps of:

121) representing a positive integer array by using a binary string, wherein elements in the string take values of 0 or 1, and the elements in the array are the serial numbers of nodes;

123) when the intersection is taken for two sets formed by a plurality of blocks, the block numbers are firstly matched, and the intersection is taken for two blocks with the consistent block numbers according to the position to obtain the intersection result.