CN115470284A

CN115470284A - Method and device for importing multi-source heterogeneous data source into Janusgraph database

Info

Publication number: CN115470284A
Application number: CN202211247857.4A
Authority: CN
Inventors: 周松; 周旺; 曹雪峰; 罗辑
Original assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Current assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2022-12-13

Abstract

The invention relates to the technical field of graph database graph storage, and provides a method and a device for importing a Janusgraph graph database from a multi-source heterogeneous data source, wherein the method comprises the following steps: acquiring the type of a data source of data to be imported, and acquiring a read plug-in of a Datax of the data source through matching; according to the data volume of the data to be imported, calculating the number of parallel reading channels of the reading plug-in of the Datax, and reading the data into the reading channels in batches; expanding a writing plug-in of the Datax, initializing a writing channel, matching the initialized writing channel with a reading channel, and continuously acquiring data in the reading channel to the corresponding writing channel; and performing data deduplication verification on data in a writing channel of the writing plug-in, and writing the data passing deduplication verification into the graph database. According to the method and the device for importing the Janus graph database from the multi-source heterogeneous data source, which are disclosed by the exemplary embodiment of the invention, the data of various heterogeneous data sources can be adapted, and the rapid import of the data is realized.

Description

Method and device for importing multi-source heterogeneous data source into Janus graph database

Technical Field

The invention relates to the technical field of graph database and graph storage, in particular to a method and a device for importing a multi-source heterogeneous data source into a Janus graph database.

Background

With the vigorous development of knowledge maps in various fields, more and more user groups are used to store complex knowledge fields in a database through data mining, information processing and knowledge measurement, so that users can conveniently, simply, conveniently and quickly find out the association relationship between data. However, since most of the existing data in each application scenario already exists in the conventional relational database or other Nosql-type databases, the data from these different sources needs to be uniformly imported into the database janussgraph.

At present, the mainstream graph database importing schemes mainly include: the method comprises three modes of batch import based on Janusgraph Api, batch import based on GremlinServer and batch import based on Bulk loader. The first method is based on the batch import of the Janus graph Api, the graph instance and the graph instance transaction object are frequently acquired, a large amount of memory is occupied, and therefore the method is only suitable for scenes with small data size. The second scheme is similar to the first scheme in nature, except that the insertion request is submitted to the gremlin server through a gremlin statement for execution, and the gremlin server connection pool is high in optimization and maintenance cost and large in memory occupation. The third mode can realize large-scale data batch import through hadoop clusters and spark clusters, but only supports json, csv, xml and kryo small-amount data formats. In summary, the three methods face the difficulties of less supported data source types, high memory consumption and slow writing speed. When a scene with multi-source heterogeneous data and needing to be quickly imported is faced, the current scheme cannot meet the requirements of users.

Therefore, how to provide a method capable of adapting to various heterogeneous data sources and rapidly importing a graph database is a technical problem to be solved urgently.

Disclosure of Invention

In view of the above, the present invention is directed to solve the following disadvantages of the existing import Janusgraph database: the types of the data sources supporting the import are few, and the writing speed is slow.

In one aspect, the invention provides a method for importing a multi-source heterogeneous data source into a Janusgraph database, which comprises the following steps:

step S1: acquiring the type of a data source of data to be imported, and acquiring a read plug-in of a Datax of the data source through matching;

step S2: according to the data volume of the data to be imported, calculating the number of parallel reading channels of the reading plug-in of the Datax, and reading the data into the reading channels in batches;

and step S3: expanding a writing plug-in of the Datax, initializing a writing channel, matching the initialized writing channel with a reading channel, and continuously acquiring data in the reading channel to the corresponding writing channel;

and step S4: and performing data deduplication verification on data in a writing channel of the writing plug-in, and writing the data passing deduplication verification into the graph database.

Further, the method for importing the multi-source heterogeneous data source into the Janusgraph database of the invention comprises the following steps of S1:

step S11: receiving link information of a data source, and acquiring metadata information of the data source of current data through a universal database metadata object, wherein the metadata information comprises a database name and version information;

step S12: and searching out the corresponding plug-in name in a character fuzzy matching mode according to the acquired database name and version information.

Further, the method for importing the multi-source heterogeneous data source into the Janusgraph database of the invention comprises the following step S2:

step S21: acquiring the total data amount of the data to be read in through the general query language of the database according to the link information of the data source received in the S11;

step S22: setting an upper limit of data to be read in each batch, and calculating the number of parallel channels according to the upper limit of the data to be read in each batch and the total data amount of the data to be read in;

step S23: and reading the data into the parallel channels according to the upper limit batch set in the step S22.

Further, in the method for importing the multi-source heterogeneous data source into the Janusgraph database according to the present invention, in step S22, the number of parallel channels is calculated according to the upper limit of the data to be read in each batch and the total data amount of the data to be read in, and the method includes:

N＝ROUNDDOWN((T+M-1)/M,0)

in the formula, N is the number of parallel channels, T is the total amount of data to be read, and M is the upper limit of data to be read in for each batch.

Further, in the method for importing the multi-source heterogeneous data source into the Janus graph database, when the number of the parallel channels obtained through calculation is less than 1, the number of the parallel channels is corrected to be 1.

Further, the method for importing the multi-source heterogeneous data source into the Janusgraph database of the invention, step S3, includes:

step S31: compiling a JanusgraphWriter plug-in according to the plug-in format of Datax;

step S32: initializing writing channels with the number equal to that of the reading channels through the written Janus graph writer plug-in;

step S33: the writing channels and the reading channels are numbered respectively, and the numbered writing channels and the numbered reading channels are in one-to-one correspondence;

step S34: monitoring the reading channel through the Janus graph writer plug-in, and acquiring the data of the reading channel to the corresponding writing channel when the reading channel has the data.

Further, the method for importing the multi-source heterogeneous data source into the Janusgraph database of the invention, step S4, includes: verifying whether a data primary key of data acquired by a write channel exists in a Redis cache or not, and when the data primary key exists in the Redis cache, discarding the data and reading in the next piece of data; and when the data primary key does not exist in the Redis cache, writing the data into the graph database.

Further, in the method for importing a multi-source heterogeneous data source into a Janusgraph database according to the present invention, in step S4, when the data primary key does not exist in the Redis cache, writing the data into the graph database includes: when the data primary key does not exist in the Redis cache, the data primary key of the data and the association information between the data primary keys are stored into Hbase of the Janusgraph database by calling an interface of the Janusgraph database, and the full amount of information of the data is stored into the Elasticissearch of the Janusgraph database.

Further, the method for importing the multi-source heterogeneous data source into the Janus graph database further comprises the following steps: and caching the data main key of the data into a Redis cache for subsequent deduplication checking after the data is successfully written into the Janus graph database.

In another aspect, the present invention provides an apparatus for importing a Janusgraph database from a multi-source heterogeneous data source, including:

the device comprises an acquisition unit, a data processing unit and a data processing unit, wherein the acquisition unit is used for determining the data source type of data to be imported and providing an input port and an output port for the outside; the input information of the input port is data source link information of data to be imported, the output information of the output port is metadata information of a data source of the data to be imported, and the metadata information comprises information such as a data source name and a version;

the response unit is used for matching and acquiring a reading plug-in of the Datax of the data source according to the data source type of the data, and providing an input port and an output port to the outside, wherein the input port is the name and the version of the data source, and the output information of the output port is the information of the reading plug-in;

the computing unit is used for computing the number of parallel reading channels of the reading plug-in unit of the Datax according to the data volume of the data to be read, and providing an input port and an output port to the outside, wherein the input information of the input port is the total data volume of the data source to be imported, and the output information of the output port is the number of the parallel channels;

the reading unit is used for reading a reading channel of a data source to be imported and providing a plurality of enabled and disabled input ports and output ports;

and the writing unit is used for performing deduplication checking on the written data and writing the data into a writing channel of the Janus graph database, and provides a plurality of enabled and disabled input ports and output ports.

The device for importing the multi-source heterogeneous data source into the Janus graph database has the following beneficial effects:

1. data in various data source formats can be imported, and the data source access capability of a graph database is enriched;

2. the data reading speed is increased based on multiple channels;

3. based on Redis cache, the duplicate data deduplication efficiency of the data during import can be improved, and extra memory consumption is reduced;

4. based on multiple channels, the data writing speed of the Janus graph database can be improved, and more data can be written into the graph database in unit time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a method for importing a Janusgraph database from a multi-source heterogeneous data source according to an exemplary first embodiment of the present invention.

Fig. 2 is a flowchart of a method for importing a Janusgraph database from a multi-source heterogeneous data source according to an exemplary second embodiment of the present invention.

Fig. 3 is a flowchart of a method for importing a Janusgraph database from a multi-source heterogeneous data source according to an exemplary third embodiment of the present invention.

Fig. 4 is a flowchart of a method for importing a Janusgraph database from a multi-source heterogeneous data source according to an exemplary fifth embodiment of the present invention.

Fig. 5 is an architecture diagram of an apparatus for importing a multiple-source heterogeneous data source into a Janusgraph database according to an exemplary eighth embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below with reference to the accompanying drawings.

It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

The technical principle of the invention is as follows: acquiring a data source type; matching and acquiring the reading plug-in of the data source according to the reading capability of the data source of various types realized by the open source component Datax; dynamically calculating the number of parallel channels according to a channel mechanism of a reading plug-in of an open source assembly Datax to improve the data reading speed of the data source; the writing plug-in of the data x is expanded to realize the writing capability of a Janusgraph graph database, the open source cache component Redis is used for performing de-duplication processing on read data, if a data main key of the read data exists in the Redis, other information of the data does not need to be read again and is directly discarded, and if the data main key does not exist in the Redis, the data main key is written into the graph database; the storage cores of the Janusgraph database are Hbase and Elasticissearch, the read association information between the data primary key and the primary key is respectively stored in the Hbase when the database is written in, the read total information is stored in the Elasticissearch, and the data primary key is stored in a Redis cache for subsequent deduplication processing after the data primary key is successfully stored. Therefore, the data access capability of the data source of the graph database is enriched through the Datax, the reading speed of the data source of the graph database is improved, and the writing capability of fast batch duplicate removal is realized by means of Redis cache. Therefore, more data sources can be matched in unit time, and more data can be written into the graph database.

The terms referred to in the following examples are to be construed as follows:

janussgraph database: the graph is a data structure used for describing individuals in the real world and network relations among the individuals, and the Janus graph database is a database used for storing the graph structure. The Janusgraph memory core relies on open source distributed components such as HBase and Elasticissearch. Therefore, the system has the characteristics of distribution, large-scale expansion and the like, and can be used for modeling various scenes from the construction of a space rocket to a road system, and tracking the medical history of people from the supply chain and the origin of food and even more other scenes by using the universal expressive structure.

And (3) Datax: the DataX is an offline synchronization tool for heterogeneous data sources of an open source of Alibab, and aims to realize stable and efficient data synchronization functions among various heterogeneous data sources such as a relational database (MySQL, oracle and the like), an HDFS (Hadoop distributed file system), a Hive (Hive), an ODPS (ODPS), an HBase (HBase) and an FTP (File transfer protocol).

Hbase: the HBase is a distributed database system which is based on the Google BigTable paper, is established on an HDFS, and is high in reliability, high in performance, column storage, scalable and capable of reading and writing in real time. Hbase may be used when real-time read-write random access to very large scale datasets is required.

Elastic search: the distributed high-expansion high-real-time search and data analysis engine is a distributed, high-expansion and high-real-time search and data analysis engine, and can conveniently enable a large amount of data to have the capabilities of searching, analyzing and exploring. Taking advantage of the horizontal scalability of the Elasticsearch enables data to become more valuable in a production environment.

Fig. 1 is a flowchart of a method for importing a multiple-source heterogeneous data source into a Janusgraph database according to an exemplary first embodiment of the present invention, where as shown in fig. 1, the method in this embodiment includes:

step S1: acquiring a data source type of data to be imported, and acquiring a Datax read plug-in of the data source through matching;

Fig. 2 is a flowchart of a method for importing a multiple-source heterogeneous data source into a Janusgraph database according to an exemplary second embodiment of the present invention, where this embodiment is a preferred embodiment of the method shown in fig. 1, and as shown in fig. 2, step S1 of the method of this embodiment includes:

In practical applications, the version information is 5.36, for example, under the name mysql. According to the method of the embodiment, a plug-in list of the Datax is searched, and the matched read plug-in name is Mysql5Reader. All data sources supported by the Datax can acquire the names of the read plug-ins in such a way, so that the types of data source import of the Janus graph database are enriched.

Fig. 3 is a flowchart of a method for importing a multiple-source heterogeneous data source into a Janusgraph database according to an exemplary third embodiment of the present invention, where this embodiment is a preferred embodiment of the method shown in fig. 1, and as shown in fig. 3, step S2 of the method of this embodiment includes:

An exemplary fourth embodiment of the present invention provides a method for importing a multisource heterogeneous data source into a Janusgraph database, where this embodiment is a preferred embodiment of the method shown in fig. 1 and 3, in step S22 of the method in this embodiment, the number of parallel channels is calculated according to an upper limit of data to be read in each batch and a total data amount of the data to be read in, and the method is as follows:

N＝ROUNDDOWN((T+M-1)/M,0)

In the method of this embodiment, when the number of parallel channels obtained by calculation is less than 1, the number of parallel channels is corrected to 1.

In practical applications, in order to control the utilization efficiency of system resources, the upper data limit of each batch read data source may be set to 5000. When the calculation formula of the method of the present embodiment is used to calculate the number of parallel channels, assuming that the total amount of obtained data is 50000, the number of parallel channels for reading the plug-in of Datax is 10; assuming that the total amount of data obtained is 4999, the parallel channel number of the read plug-in of Datax is 1. After the number of the parallel channels is obtained through calculation, a piece of data such as external keys containing main keys, serial numbers and main keys related to other data, names, time, remarks and the like is read into the channels in batches according to 5000 pieces of data in each batch, and therefore the data reading efficiency is guaranteed.

Fig. 4 is a flowchart of a method for importing a multiple-source heterogeneous data source into a Janusgraph database according to an exemplary fifth embodiment of the present invention, where this embodiment is a preferred embodiment of the method shown in fig. 1, and as shown in fig. 4, step S3 of the method of this embodiment includes:

step S33: the writing channels and the reading channels are numbered respectively, and the numbered writing channels and the numbered reading channels correspond to each other one by one;

In this embodiment, according to the plug-in format of Datax, a janussgraph writer plug-in is written, and the plug-in initializes the number of write channels with the same number according to the number of read channels, and associates the number of write channels with the number of read channels one to ensure that the read data is in the same batch. The plug-in also monitors the read channel and acquires the data of the read channel when the data of the read channel has the data. For example: assuming that the numbers of the reading channels are A, B and C, the Janus graph writer plug-in creates 3 writing channels with the numbers of A, B and C, the data of the A writing channel is always acquired from the A reading channel, and when one piece of data containing information such as a main key, a number, a foreign key, a name, time, remarks and the like exists in the A reading channel, the piece of data is pushed into the A writing channel.

An exemplary sixth embodiment of the present invention provides a method for importing a multiple-source heterogeneous data source into a Janusgraph database, which is a preferred embodiment of the method shown in fig. 1.

Step S4 of the method of this embodiment includes: verifying whether a data primary key of data acquired by a writing channel exists in a Redis cache or not, and discarding the data and reading in the next piece of data when the data primary key exists in the Redis cache; and when the data primary key does not exist in the Redis cache, writing the data into the graph database.

According to the embodiment, through the low-delay query capability of Redis, the duplication elimination of the main key directly queried from the Janus graph database is avoided, and the repeated verification efficiency is greatly improved.

In step S4 of the method according to this embodiment, when the data primary key does not exist in the Redis cache, writing the data into the graph database includes: when the data primary key does not exist in the Redis cache, the data primary key of the data and the associated information between the data primary keys are stored to Hbase of the Janus graph database by calling an interface of the Janus graph database, and the full information of the data is stored to the Elasticisearch of the Janus graph database.

For example, a piece of data containing information such as a primary key, a number, a foreign key, a name, a time, and a remark. The Janus graph writer plug-in unit strips the information of the main key and the foreign key from the data and stores the information into Hbase, and stores the data into the elastic search.

An exemplary seventh embodiment of the present invention provides a method for importing a multiple-source heterogeneous data source into a Janusgraph database, which is a preferred embodiment of the method shown in fig. 1. The method of the embodiment further comprises the following steps: and after the data is successfully written into the Janusgraph database, caching the data primary key of the data into a Redis cache for subsequent deduplication verification.

Fig. 5 is a device for importing a multiple-source heterogeneous data source into a Janusgraph database according to an exemplary eighth embodiment of the present invention, as shown in fig. 5, the device of the present invention includes:

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for importing a Janus graph database from a multi-source heterogeneous data source is characterized by comprising the following steps:

2. The method for importing the multi-source heterogeneous data source into the Janus graph database according to claim 1, wherein the step S1 comprises:

3. The method for importing the multi-source heterogeneous data source into the Janus graph database according to claim 1, wherein the step S2 comprises:

4. The method for importing the multisource heterogeneous data source into the Janusgraph database according to claim 3, wherein in step S22, the number of parallel channels is calculated according to the upper limit of the data to be read in each batch and the total data amount of the data to be read in, and the method is as follows:

N＝ROUNDDOWN((T+M-1)/M,0)

5. The method for importing the multi-source heterogeneous data source into the Janusgraph database according to claim 3, wherein when the number of the calculated parallel channels is less than 1, the number of the parallel channels is corrected to 1.

6. The method for importing the multi-source heterogeneous data source into the Janus graph database according to claim 1, wherein the step S3 comprises:

step S31: writing a Janus graph Writer plug-in according to the plug-in format of the Datax;

step S32: initializing writing channels with the same number as the reading channels through a written JanusgraphWriter plug-in;

step S34: and monitoring the read channel through the Janus graph writer plug-in, and acquiring the data of the read channel to the corresponding write channel when the read channel has the data.

7. The method for importing the multiple-source heterogeneous data source into the Janusgraph database according to claim 1, wherein the step S4 includes: verifying whether a data primary key of data acquired by a write channel exists in a Redis cache or not, and when the data primary key exists in the Redis cache, discarding the data and reading in the next piece of data; and when the data primary key does not exist in the Redis cache, writing the data into the graph database.

8. The method for importing the multisource heterogeneous data source into the Janusgraph database according to claim 1, wherein in step S4, when the data primary key does not exist in the Redis cache, writing the data into the graph database includes: when the data primary key does not exist in the Redis cache, the data primary key of the data and the associated information between the data primary keys are stored to Hbase of the Janus graph database by calling an interface of the Janus graph database, and the full information of the data is stored to the Elasticisearch of the Janus graph database.

9. The method for importing the multi-source heterogeneous data source into the Janusgraph database according to claim 1, further comprising: and after the data is successfully written into the Janusgraph database, caching the data primary key of the data into a Redis cache for subsequent deduplication verification.

10. An apparatus for importing a multiple-source heterogeneous data source into a Janusgraph database, the apparatus comprising:

the device comprises an acquisition unit, a data processing unit and a data output unit, wherein the acquisition unit is used for determining the data source type of data to be imported and providing an input port and an output port for the outside; the input information of the input port is data source link information of data to be imported, the output information of the output port is metadata information of a data source of the data to be imported, and the metadata information comprises information such as a data source name and a version;

the computing unit is used for computing the number of parallel reading channels of the reading plug-in of the Datax according to the data volume of the data to be read, and providing an input port and an output port to the outside, wherein the input information of the input port is the total data volume of the data source to be imported, and the output information of the output port is the number of the parallel channels;

and the writing unit is used for performing deduplication verification on written data and writing the written data into a writing channel of the Janusgraph database, and provides a plurality of enabled and disabled input ports and output ports.