CA3176758A1

CA3176758A1 - Method and apparatus for introducing data to a graph database

Info

Publication number: CA3176758A1
Application number: CA3176758A
Authority: CA
Inventors: Bo Wang
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2019-04-09
Filing date: 2019-09-29
Publication date: 2020-10-15
Also published as: WO2020206952A1; CN110110108A; CN110110108B

Abstract

A graph database data import method and apparatus, the method comprising: registering a custom spark udf function to a graph database program, so that the graph database establishes a connection with a spark resource by means of the spark udf function (S1); creating a node attribute index in the graph database (S2); using the spark resource to query a hive database, and acquiring queried data (S3); after re-partitioning, registering the queried data to a temporary data table (S4); and, by means of the spark udf function and the node attribute index, importing the temporary data table to the graph database (S5). Real-time import of data can be implemented by means of using the combination of spark and a graph database, without the need to export data in a csv format; the use of spark technology facilitates spark performance optimisation and data import speed adjustment; and, by means of using the concurrency feature of spark, data import speed can be increased without data loss.

Description

METHOD AND APPARATUS FOR INTRODUCING DATA TO A GRAPH DATABASE
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the technical field of data processing, and more particularly to a method and an apparatus for introducing data to a graph database.
Description of Related Art

[0002] Spark is a data processing technique based on clusters and memory. It is able to process massive data when having plural machines assembled together, and can be integrated with graph computing frameworks to compute data. Spark can not only be integrated in different ways but also pre-process data (through aggregation, filtration, and conversion) and introduce the pre-processed data into graph databases.

[0003] In the prior art, there are several ways to introduce data into graph databases, including compiling create languages, loading CSV languages, and using a batch inserter, batch import, and Neo4j -import. Except for create languages, all of these ways have one thing in common that they require the file of interest to be converted into the csv format, which is a trouble in real-world production- and development-related environments.
For example, in a production-related environment, data are usually confidential and thus it is not feasible to most companies to export data from the production-related environment into a csv file, and this approach does not support real-time insertion.
Furthermore, this is particularly impractical for massive data. As to the latter three, they are incapable of real-time introduction. To be specific, introduction of data requires deactivation of a neo4j (a type of graph databases) server, and thus is inherently impossible to be real-time.

[0004] Hence, how to introduce data into a graph database with increased speed is crucial to construction of graphs, and is a pressing issue to address in the art.

Date Regue/Date Received 2022-09-23 SUMMARY OF THE INVENTION

[0005] For addressing the issues of the prior art, embodiments of the present invention provide a method and an apparatus for introducing data to a graph database, which overcome the problems of the prior art such as the necessity of converting data into the csv format before the data can be introduced into a graph database and the incapability of adjusting the speed of data introduction.

[0006] To solve the foregoing one or more technical problems, the present invention adopts the following technical schemes.

[0007] In one aspect, the present invention provides a method for introducing data to a graph database. The method comprising the following steps:

[0008] registering a user-defined spark udf to a graph database program, so that the graph database is connected to a spark resource through the spark udf;

[0009] creating node attribute indexes in the graph database;

[0010] using the spark resource to make enquiry to a hive database to acquire enquiry-generated data;

[0011] re-partitioning the enquiry-generated data and registering them as a temporary data table;
and

[0012] introducing the temporary data table to the graph database using the spark udf and the node attribute indexes.

[0013] Further, before the step of registering a user-defined spark udf to a graph database program, the method further comprises:

[0014] setting up a driver in the spark udf for connecting the graph database and writing it in a static method.

[0015] Further, before the step of registering a user-defined spark udf to a graph database program the method further comprises:

[0016] defining parameters for exporting and importing the spark udf.

[0017] Further, after the step of introducing the temporary data table to the graph database the method further comprises:

Date Regue/Date Received 2022-09-23

[0018] turning off the driver of the graph database and the spark resource.

[0019] Further, the step of using the spark resource to make enquiry to a hive database comprises:

[0020] using a reduce operator of the spark resource to perform corresponding computation on the enquiry-generated data.

[0021] In another aspect, the present invention provides an apparatus for introducing data to a graph database. The apparatus comprises:

[0022] a connecting module, for registering a user-defined spark udf to a graph database program, so that the graph database is connected to a spark resource through the spark udf;

[0023] a creating module, for creating node attribute indexes in the graph database;

[0024] an enquiring module, for using the spark resource to make enquiry to a hive database to acquire enquiry-generated data;

[0025] a partitioning module, for re-partitioning the enquiry-generated data and registering them as a temporary data table; and

[0026] an introducing module, for introducing the temporary data table to the graph database using the spark udf and the node attribute indexes.

[0027] Further, the apparatus further comprises:

[0028] a driver connecting module, for setting up a driver in the spark udf for connecting the graph database and writing it in a static method.

[0029] Further, the apparatus further comprises:

[0030] a configuring module, for defining parameters for exporting and importing the spark udf.

[0031] Further, the apparatus further comprises:

[0032] a deactivating module, for turning off the driver of the graph database and the spark resource

[0033] Further, the step of using the spark resource to make enquiry to a hive database comprises:

[0034] using a reduce operator of the spark resource to perform corresponding computation on the enquiry-generated data.

[0035] The technical schemes of the embodiments of the present invention provide the following beneficial effects:

[0036] 1. The method and apparatus for introducing data to a graph database of the embodiments Date Regue/Date Received 2022-09-23 of the present invention use the combination of spark and graph databases to realize real-time introduction of data while eliminating the need of exporting data into the csv format;

[0037] 2. The method and apparatus for introducing data to a graph database of the embodiments of the present invention use the spark technology to achieve easy optimization of spark performance and adjustment of speed of data introduction; and

[0038] 3. The method and apparatus for introducing data to a graph database of the embodiments of the present invention use the feature of spark about concurrency to accelerate introduction of data without data loss.
BRIEF DESCRIPTION OF THE DRAWINGS

[0039] To better illustrate the technical schemes as disclosed in the embodiments of the present invention, accompanying drawings referred in the description of the embodiments below are introduced briefly. It is apparent that the accompanying drawings as recited in the following description merely provide a part of possible embodiments of the present invention, and people of ordinary skill in the art would be able to obtain more drawings according to those provided herein without paying creative efforts, wherein:

[0040] FIG. 1 is a flowchart of a method for introducing data to a graph database according to one exemplificative embodiment; and

[0041] FIG. 2 is a structural diagram of an apparatus for introducing data to a graph database according to one exemplificative embodiment.
DETAILED DESCRIPTION OF THE INVENTION

[0042] To make the foregoing objectives, features, and advantages of the present invention clearer and more understandable, the following description will be directed to some embodiments as depicted in the accompanying drawings to detail the technical schemes disclosed in these embodiments. It is, however, to be understood that the embodiments referred herein are only a part of all possible embodiments and thus not exhaustive. Based on the embodiments of the present invention, all the other embodiments can be conceived Date Regue/Date Received 2022-09-23 without creative labor by people of ordinary skill in the art, and all these and other embodiments shall be encompassed in the scope of the present invention.

[0043] FIG. 1 is a flowchart of a method for introducing data to a graph database according to one exemplificative embodiment. As shown, the method comprises the following steps.

[0044] The step Si involves registering a user-defined spark udf to a graph database program, so that the graph database is connected to a spark resource through the spark udf.

[0045] Specifically, in the embodiment of the present invention, with the spark udf defined by the user, the graph database and the spark resource can be combined (i.e., having connection between the graph database and the spark resource), so that the data can be introduced into the graph database in a real-time manner without having to exporting the data into the csv format. To compile the user-defined spark udf, development can be implemented using the Java language or other computer-programming languages.
Additionally, the compiled, user-defined spark udf must be registered in the graph database program first because in Java-based methods, only registered user-defined udf can be used. It is to be noted that, in the embodiment of the present invention, the use of the spark udf facilitates optimization of spark performance by, for example, deciding how many portions are the data enquired from hive re-partitioned, how to set spark concurrency, how many computing nodes (executors) to be assigned to a spark task, how much memory space to be assigned to each executor, how many cores to be set for one executor, and how much memory space to be assigned to the driver.

[0046] It is to be noted that the Spark udf is such set that when the functions provided by spark are unable to satisfy user needs, the user-defined function can be used to realize its own business logic. Following is an example:

[0047] public class CreateACCTOBANKCARD2 implements UDF5<String,String,Integer,String,String,String>, Serializable {
Date Regue/Date Received 2022-09-23 @Override public String call(final String pay acct no, final String ttl amt, final Integer ttl times, final String latest_pay time, final String rcvr user) throws Exception {

[0048] //This udf has five input parameters and one output parameter //Its own function works here return "1";

}

[0049] S2 is about creating node attribute indexes in the graph database.

[0050] Specifically, for introduction of massive data, in order to prevent duplication in terms of node and relationship and to ensure fast search, in the embodiment of the present invention, node attribute indexes is created in the graph database to provide every node in the graph database with an attribute index. Without the attribute indexes, data insertion can significantly slow down.

[0051] For example:

[0052] private static Driver driver = null;
static {
driver = OperateNeo4j.connectNeo4j(Constants.url, Constants.neo4jUser, Constants.neo4jPassword);
}
Session session = driver.session();
//The identity card number is taken as the index session.run("create index on :IDNTY NMBR(Idnty Nmbr)");
//The account number is taken as the index session.run("create index on :ACCT NMBR(Acct No)").

[0053] S3 involves using the spark resource to make enquiry to a hive database to acquire enquiry-generated data.

[0054] Specifically, the embodiment of the present invention is about introducing the data of the hive table into the graph database. To introduce the data of the hive table into the graph Date Regue/Date Received 2022-09-23 database, the hive database can be first enquired using the spark resource (specifically, by using the compiled spark sql language to find out the data from the hive database), so as to acquire the enquiry-generated data.

[0055] S4 involves re-partitioning the enquiry-generated data and registering them as a temporary data table.

[0056] Specifically, in the embodiment of the present invention, the data got from the hive database using the spark sql are repartitioned. When doing computation for a resilient distributed dataset (RDD), spark initiates a task for every partition, so the number of the partitions of the RDD determines the total number of the tasks. In this way, with optimization of spark performance, the total number of the tasks can be set by setting the number of the partitions of the RDD. By setting the number of the required computing nodes (executors) and the number of cores in every computing nide, these tasks can be executed concurrently at the same time, so as to accelerate introduction of data into the graph database. It is to be noted that an RDD, or a resilient distributed dataset, is the basic data abstract in Spark. It represents an immutable, partitionable set in which elements can be computed concurrently. An RDD has the features of a data flow model, including automatic failover, location-aware scheduling, and scalability. An RDD allows users to explicitly cache data in the memory during multiple enquires, so that the subsequent enquires can reuse these data. This significantly improves speed of enquires. In addition, since Spark features concurrent computing, each task can execute a part of the whole data without data loss.

[0057] It is to be noted that, in the embodiment of the present invention, the reason of setting the partitions is that, without the partitions, the number of partitions of the enquiry results would be the same as the number of partitions it the hive table, so the concurrency level, and in turn the number of tasks executed concurrently, could not be enhanced. With the re-partitioning step, the number of tasks executed concurrently can be increased, thereby accelerating execution.

[0058] S5 involves introducing the temporary data table to the graph database using the spark udf and the node attribute indexes.

Date Regue/Date Received 2022-09-23

[0059] Specifically, by having the foregoing user-defined spark udf, together with the node attribute indexes created in the graph database, the present invention can have the temporary data table introduced into the graph database. In the embodiment of the present invention, since the hive database is connected through Spark, data can be introduced into the graph database directly, without having to exporting the data into a csv file, and real-time insertion can be achieved.

[0060] As a preferred implementation, in the embodiment of the present invention, before the step of registering a user-defined spark udf to a graph database program, the method further comprises:

[0061] setting up a driver in the spark udf for connecting the graph database and writing it in a static method.

[0062] Specifically, in the embodiment of the present invention, the driver in the user-defined spark udf for connecting the graph database (e.g., neo4j) must be written in the static method. This is to reduce the times of connection between the spark udf and the graph database connecting, thereby reducing resource consumption.

[0063] As a preferred implementation, in the embodiment of the present invention, before the step of registering a user-defined spark udf to a graph database program the method further comprises:

[0064] defining parameters for exporting and importing the spark udf.

[0065] Specifically, in the embodiment of the present invention, the input and output parameters of the spark udf must be defined. In other words, it is necessary to well define the number and the types of the parameters, the type of the output parameters, and that the main return value cannot be null.

[0066] As a preferred implementation, in the embodiment of the present invention, after the step of introducing the temporary data table to the graph database the method further comprises:

[0067] turning off the driver of the graph database and the spark resource.

[0068] Specifically, after the temporary data table is introduced into the graph database, the driver of the graph database and the spark resource need to be turn off to save resources.

Date Regue/Date Received 2022-09-23

[0069] As a preferred implementation, in the embodiment of the present invention, the step of using the spark resource to make enquiry to a hive database comprises:

[0070] using a reduce operator of the spark resource to perform corresponding computation on the enquiry-generated data.

[0071] Specifically, an action operator of spark is requisite for triggering execution of spark because only action operators can execute computation. As to selection of the operator, in the embodiment of the present invention, among action operators, the reduce operator is used rather than other operators such as collect and show. This is because the other operators like collect and show can have negative impact on performance, and the show operator is unable to compute all data. In other words, in the embodiment of the present invention, the reduce operator is used to trigger execution of spark with accelerated data while eliminating the risk of data loss.

[0072] FIG. 2 is a structural diagram of an apparatus for introducing data to a graph database according to one exemplificative embodiment. As shown, the apparatus comprises:

[0073] a connecting module, for registering a user-defined spark udf to a graph database program, so that the graph database is connected to a spark resource through the spark udf;

[0074] a creating module, for creating node attribute indexes in the graph database;

[0075] an enquiring module, for using the spark resource to make enquiry to a hive database to acquire enquiry-generated data;

[0076] a partitioning module, for re-partitioning the enquiry-generated data and registering them as a temporary data table; and

[0077] an introducing module, for introducing the temporary data table to the graph database using the spark udf and the node attribute indexes.

[0078] As a preferred implementation, in the embodiment of the present invention, the apparatus further comprises:

[0079] a driver connecting module, for setting up a driver in the spark udf for connecting the graph database and writing it in a static method.

[0080] As a preferred implementation, in the embodiment of the present invention, the apparatus further comprises:

Date Regue/Date Received 2022-09-23

[0081] a configuring module, for defining parameters for exporting and importing the spark udf.

[0082] As a preferred implementation, in the embodiment of the present invention, the apparatus further comprises:

[0083] a deactivating module, for turning off the driver of the graph database and the resource.

[0084] As a preferred implementation, in the embodiment of the present invention, the step of using the spark resource to make enquiry to a hive database comprises:

[0085] using a reduce operator of the spark resource to perform corresponding computation on the enquiry-generated data.

[0086] To sum up, the technical schemes of the embodiments of the present invention provide the following beneficial effects:

[0087] 1. The method and apparatus for introducing data to a graph database of the embodiments of the present invention use the combination of spark and graph databases to realize real-time introduction of data while eliminating the need of exporting data into the csv format;

[0088] 2. The method and apparatus for introducing data to a graph database of the embodiments of the present invention use the spark technology to achieve easy optimization of spark performance and adjustment of speed of data introduction; and

[0089] 3. The method and apparatus for introducing data to a graph database of the embodiments of the present invention use the feature of spark about concurrency to accelerate introduction of data without data loss.

[0090] It is to be noted that work division among the foregoing functional modules for the order-based phoning system of the present embodiment to implement delivery is merely exemplary. In practical implementations, the work division may be made among different functional modules. In other words, the internal architecture of the order-based phoning system may be reconfigured with different functional modules to perform all or a part of the functions as described previously. In addition, since the order-based phoning system of the present embodiment and the disclosed order-based phoning method in the previous embodiment stem from the same conception, the details of its implementation can be learned from the description made to the method of the previous embodiment, and no repetition is made herein.
Date Regue/Date Received 2022-09-23

[0091] As will be appreciated by people of ordinary skill in the art, implementation of all or a part of the steps of the method of the present invention as described previously may be realized by having a program instruct related hardware components. The program may be stored in a computer-readable storage medium, and the program is about performing the individual steps of the methods described in the foregoing embodiments.
The storage medium may be a ROM/RAM, a hard drive, an optical disk, or the like.

[0092] The preferred embodiments of the present invention described previously are not intended to limit the present invention. Any modification, equivalent replacement, and improvement made under the spirit and principle of the present invention shall be included in the scope of the present invention.

Date Regue/Date Received 2022-09-23

Claims

CA 03176758 2022-09-23What is claimed is:

1. A method for introducing data to a graph database, the method comprising:
registering a user-defined spark udf to a graph database program, so that the graph database is connected to a spark resource through the spark udf;
creating node attribute indexes in the graph database;
using the spark resource to make enquiry to a hive database to acquire enquiry-generated data;
re-partitioning the enquiry-generated data and registering them as a temporary data table; and introducing the temporary data table to the graph database using the spark udf and the node attribute indexes.

2. The method for introducing data to a graph database of claim 1, wherein before the step of registering a user-defined spark udf to a graph database program, the method further comprises:
setting up a driver in the spark udf for connecting the graph database and writing it in a static method.

3. The method for introducing data to a graph database of claim 1 or 2, wherein before the step of registering a user-defined spark udf to a graph database program, the method further comprises:
defining parameters for exporting and importing the spark udf.

4. The method for introducing data to a graph database of claim 2, wherein after the step of introducing the temporary data table to the graph database, the method further comprises:
turning off the driver of the graph database and the spark resource.

5. The method for introducing data to a graph database of claim 1 or 2, wherein the step of using the spark resource to make enquiry to a hive database comprises:
using a reduce operator of the spark resource to perform corresponding computation on the Date Regue/Date Received 2022-09-23 enquiry-generated data.

6. An apparatus for introducing data to a graph database, the apparatus comprising:
a connecting module, for registering a user-defined spark udf to a graph database program, so that the graph database is connected to a spark resource through the spark udf;
a creating module, for creating node attribute indexes in the graph database;
an enquiring module, for using the spark resource to make enquiry to a hive database to acquire enquiry-generated data;
a partitioning module, for re-partitioning the enquiry-generated data and registering them as a temporary data table; and an introducing module, for introducing the temporary data table to the graph database using the spark udf and the node attribute indexes.

7. The apparatus for introducing data to a graph database of claim 6, further comprising:
a driver connecting module, for setting up a driver in the spark udf for connecting the graph database and writing it in a static method.

8. The apparatus for introducing data to a graph database of claim 6 or 7, further comprising:
a configuring module, for defining parameters for exporting and importing the spark udf.

9. The apparatus for introducing data to a graph database of claim 7, further comprising:
a deactivating module, for turning off the driver of the graph database and the resource.

10. The apparatus for introducing data to a graph database of claim 6 or 7, using the spark resource to make enquiry to a hive database is achieved by:
using a reduce operator of the spark resource to perform corresponding computation on the enquiry-generated data.

Date Regue/Date Received 2022-09-23