CN114138735A

CN114138735A - Method for quickly loading Janus graph data in batches

Info

Publication number: CN114138735A
Application number: CN202111267971.9A
Authority: CN
Inventors: 马杲灵; 游飞龙; 张林林; 汪睿铭; 陈雪; 石尧; 董博; 廖海峰
Original assignee: Guizhou Shulian Mingpin Technology Co ltd
Current assignee: Guizhou Shulian Mingpin Technology Co ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-04

Abstract

The invention relates to a method for quickly loading Janus graph data in batches, which comprises the following steps: a data preparation stage: storing data of vertexes and edges needing to be loaded in batches into a Hive table for fragment storage so as to facilitate parallel reading of Spark calculation engines, creating a Schema structure of Janusgraph graph data to be loaded, and configuring a mapping relation between the Hive table and the Schema structure; HBase data loading stage: reading the top point and the edge in the Hive table in parallel by using a Spark calculation engine, constructing an RDD (resource description device) data set according to the storage structure and the coding mode of the Janusgraph in an HBase database, and loading the RDD data set into the HBase database corresponding to the Janusgraph storage in an HFile file form; an Elasticissearch index construction phase: using a Spark computing engine to read the vertex and the edge in the Hive table in parallel, extracting the attribute of the vertex or the edge needing to be indexed, constructing an RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the RDD data set into the Elasticissearch index in parallel.

Description

Method for quickly loading Janus graph data in batches

Technical Field

The invention relates to the technical field of big data in computer science, in particular to a method for quickly loading distributed graph database Janusgraph data in batches.

Background

Graph (Graph) is a mathematical logical object representing the relationship between entities and entities, expressed in mathematics as a bituple of G ═ V, E, itself composed of N vertices (V ═ vertex) and M edges (E ═ edge), each vertex corresponding to a number of edges (≦ M), each edge connecting two vertices, an edge may have a direction, if the Graph contains an edge that is directional, it is called a directed Graph, otherwise it is an undirected Graph.

Graph databases, which are a type of NoSQL databases, are non-relational databases, and store relationship information between entities using graph theory, the most common example being interpersonal relationships in social networks. Relational databases are not effective for storing "relational" data, are complex, slow, and beyond expectations in querying, and the unique design of graphic databases just remedies this deficiency.

The janussgraph is an open source distributed graph database, and is widely used in the field of graph data analysis due to the advantages of good universality, high performance, house-opening source codes and the like. The Janus graph supports that databases such as Cassandra and HBase are used as storage for storing complete graph structure data; the method supports using the elastic search, Solr and the like as indexes, and can realize real-time retrieval of the indexed top points and edges. Because HBase, elastic search and the like are widely used in the field of big data and have excellent performance, the scheme mainly aims at the scene that HBase clusters are used as storage and elastic search clusters are used as indexes.

Based on such a scenario, the existing janussgraph has the following problems when loading data:

(1) loading data into the graph through the API provided by the Janus graph can be submitted in a transaction mode, performance overhead of data loading is increased, and offline data loading can be performed without ensuring data consistency through transactions.

(2) The API interface provided by the Janus graph calls the API interface of the HBase database to store data in the HBase database, and in the period, the HBase can frequently perform flush, compact and split operations, so that a large amount of unnecessary resource consumption is caused, and the warehousing efficiency is reduced. And if the API call speed of the HBase database exceeds the writing capability of the HBase database, partial data writing loss may occur, so that the problem of graph data loss is caused.

(3) The API interface provided by the Janus graph calls the API interface of the HBase database to store data, and calls the API interface of the Elasticissearch index to construct an index after the data is successfully returned, and the cluster resources of the HBase data and the Elasticissearch index cannot be fully utilized by the serial writing mode.

Disclosure of Invention

The invention aims to provide a method for rapidly loading Janusgraph data in batches by fully utilizing cluster resources, improving the performance of loading the data into the Janusgraph in batches, solving the problem of slow batch loading of mass data, and solving the problem of graph data loss caused by partial data write loss possibly caused by parallel writing of the mass data through an API (application program interface) of an HBase database.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

a method for fast batch loading of Janus graph data comprises the following steps:

a data preparation stage: storing data of vertexes and edges needing to be loaded in batches into a Hive table for fragment storage so as to facilitate parallel reading of Spark calculation engines, creating a Schema structure of Janusgraph graph data to be loaded, and configuring a mapping relation between the Hive table and the Schema structure;

HBase data loading stage: using a Spark calculation engine to read the top point and the edge in the Hive table in parallel, constructing a first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database, and loading the first RDD data set into the HBase database corresponding to the Janusgraph storage in the form of an HFile file;

an Elasticissearch index construction phase: and reading the vertex and the edge in the Hive table in parallel by using a Spark calculation engine, extracting the attribute of the vertex or the edge needing to be indexed, constructing a second RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the second RDD data set in the Elasticissearch index in parallel.

In the scheme, the graph data is written into the HBase database in a HFile file batch loading mode of the HBase database, and an Elasticissearch index is constructed through a Spark calculation engine, so that the transaction overhead of calling an API (application program interface) of the Janus graph is bypassed.

Further, the step of saving the data that needs to be batched into the vertices and edges of the graph to the Hive table for fragment storage, so as to facilitate parallel reading by the Spark calculation engine, includes:

dividing data needing to be loaded into the graphs in batches into vertexes and edges, respectively forming a vertex Hive table and an edge Hive table, and performing fragment storage on each vertex Hive table and each edge Hive table;

and using a Spark calculation engine to assign a global unique vertex ID to all vertexes of all vertex Hive tables, replacing the associated vertexes of all edges of all edge Hive tables with the assigned vertex ID, and assigning a global unique edge ID to all edges of all edge Hive tables.

Further, the step of creating a Schema structure of the janussgraph to-be-loaded graph data and configuring the mapping relationship between the Hive table and the Schema structure includes:

creating a Schema structure of the Janusgraph to-be-loaded graph data, wherein the Schema structure comprises PropertyKey (attribute), VertexLabel (vertex label), EdgeLabel (edge label) and Mixed index;

and creating a configuration file, and configuring the mapping relation between the Hive table name and the field of the vertex and the edge and the label and attribute of the Janus graph.

In the above scheme, data loading is performed without using an API provided by the janussraph, so that the internal IDs of the janussraph of the vertex and the edge of the Hive table need to be pre-allocated. The Schema structure of the janussgraph needs to be created, and the mapping relation between the Hive table and the Schema structure is configured, so that data can be loaded into a graph according to the configured mapping relation when data loading and index building are performed.

Furthermore, the step of using a Spark calculation engine to read the vertex and the edge in the Hive table in parallel and construct the first RDD data set according to the storage structure and the encoding mode of the janussgraph in the HBase database includes:

connecting the Schema structure created by the Janus graph, acquiring related information of PropertyKey (attribute), VertexLabel (vertex label) and EdgeLabel (edge label), and reading the mapping relation;

and reading data in the vertex Hive table and the edge Hive table which are allocated with the IDs in parallel by using a Spark calculation engine, converting and encoding according to a storage structure and an encoding mode of the Janus graph in the HBase database, enabling each piece of data to contain row keys, column clusters, column names and values of the HBase database, and merging all the converted and encoded data into a first RDD data set.

Further, the step of loading the first RDD data set into the HBase database corresponding to the janussgraph storage in the form of an HFile file includes:

reading the line key partition information of the Janus graph corresponding to the HBase database, re-partitioning the generated first RDD data set, sequencing the RDD data set in the partition according to the sequence of line keys, column clusters and column names, and then storing the first RDD data set in an HDFS (high frequency file format) distributed file system in a fragmentation mode;

reading an HFile file in the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.

In the scheme, in the HBase data loading stage, a Spark calculation engine is used for reading the top point and the edge in the Hive table in parallel, an RDD (resource description device) data set is constructed according to a data storage structure and a coding mode of Janusgraph in an HBase database, the RDD data set is subjected to partition sequencing according to a partition mode of the HBase database to generate an HFile file, the HFile file generated by a single partition is ensured not to cross partitions of a plurality of HBase databases, the storage sequence of single partition data is ensured, so that errors or performance influence cannot be generated when the HFile file is loaded into the HBase database, and then all the generated HFile files are loaded into the HBase database.

Due to the fact that the HFile file is used for loading the HBase database, an API (application programming interface) of the HBase database is not used for loading data, meanwhile, the problem that the transaction overhead is caused by the API of the Janus graph is avoided, data loading and index construction can be executed simultaneously and parallelly, and the integral loading efficiency of the Janus graph data is greatly improved.

Further, the detailed steps of the Elasticsearch index construction phase include:

connecting created Mixed index information, wherein the Mixed index information comprises an index name, an index type, a label and an index attribute name, and reading the configured mapping relation;

using a Spark calculation engine to read a vertex Hive table and an edge Hive table prepared in a data preparation stage in parallel, using index information to judge, if a certain vertex or a certain edge has a configuration index, extracting required index attribute data, and constructing a second RDD data set according to an index storage structure of Janusgraph in an elastic search;

and writing the extracted and converted second RDD data set of the vertex or the edge into an Elasticissearch index in parallel according to the storage position of the Elasticissearch index of the Janusgraph.

In the above scheme, since the data is loaded in the HFile batch loading manner of the HBase database instead of using the API interface of the janussgraph, the index data needs to be written separately, and in order to ensure the index writing efficiency, the present invention uses Spark to read and load the index data in parallel. In the Elasticissearch index construction stage, the vertex and the edge in the Hive table are read in parallel by using Spark, the attribute of the vertex or the edge needing to be indexed is extracted, a second RDD data set is constructed according to the index storage structure of the Janusgraph in the Elasticissearch, and then the second RDD data set is written into the specified storage position of the Janusgraph in the Elasticissearch in parallel.

Compared with the prior art, the invention has the beneficial effects that:

(1) according to the invention, the graph data is written into the HBase database in a HFile file batch loading mode of the HBase database, and the springsearch index is constructed through the Spark calculation engine, so that the transaction overhead of calling the API of the Janusgraph is bypassed.

(2) The method writes the graph data into the HBase database in a HFile file batch loading mode of the HBase database, avoids unnecessary resource consumption caused by an API (application programming interface) of the HBase database, and solves the problem that the graph data is lost due to partial data writing loss possibly caused by parallel writing of mass data through the API of the HBase database.

(3) The loading of the HBase database and the establishment of the Elasticissearch index can be simultaneously executed, namely, the stage (II) and the stage (III) can be simultaneously executed, and HBase cluster and Elasticissearch cluster resources are fully utilized, so that the performance of loading massive data into the Janus graph in batches is greatly improved.

(4) The HFile file generated by conversion can be used as a backup snapshot of the graph data for storage, and the efficiency of redoing the graph can be greatly improved through the HFile file.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a method for fast batch loading of Janus graph data according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Example 1:

the invention is particularly suitable for loading massive data with more than hundred million scales into the Janus graph in batches, and can obviously improve the performance of the data loading graph by using the HBase cluster as storage and the Elasticissearch cluster as an index scene. In a production environment, after massive stock data can be quickly loaded into the graph in batches by the method, incremental data is updated into the graph through the API (application program interface) of the Janus graph, and the consistency of data stored and indexed is ensured through the affairs provided by the Janus graph. The invention is realized by the following technical scheme, as shown in fig. 1, a method for rapidly loading Janus graph data in batches comprises the following steps:

the data preparation stage comprises the following steps: storing the data which needs to be loaded into the vertexes and edges of the graph in batch into the Hive table for fragment storage, so that a Spark calculation engine can read the data in parallel, a Schema structure of the Janusgraph graph data to be loaded is created, and the mapping relation between the Hive table and the Schema structure is configured.

Step 1-1, dividing data needing to be loaded into the graph in batch into a vertex and an edge, respectively forming a vertex Hive table and an edge Hive table, and performing fragment storage on each vertex Hive table and each edge Hive table.

Step 1-2, a Spark calculation engine is used for distributing global unique vertex IDs to all vertexes of all vertex Hive tables, replacing the associated vertexes of all edges of all edge Hive tables with the distributed vertex IDs, and distributing the global unique edge IDs to all edges of all edge Hive tables.

Step 1-3, creating a Schema structure of the Janusgraph graph data to be loaded, wherein the Schema structure comprises PropertyKey (attribute), VertexLabel (vertex label), EdgeLabel (edge label) and Mixed index.

Step 1-4, a configuration file is created, and the mapping relation between Hive table names and fields of the top points and the edges and the labels and attributes of the Janus graph is configured.

(II) HBase data loading stage: and reading the top point and the edge in the Hive table in parallel by using a Spark calculation engine, constructing a first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database, and loading the first RDD data set into the HBase database corresponding to the Janusgraph storage in an HFile file form.

Step 2-1, connecting the Schema structure created in step 1-3 in the data preparation stage, obtaining relevant information of PropertyKey (attribute), VertexLabel (vertex label), and EdgeLabel (edge label), and reading the mapping relationship configured in step 1-4 in the data preparation stage.

And 2-2, using a Spark calculation engine to read data in the vertex Hive table and the edge Hive table which are allocated with the IDs in the step 1-2 in the data preparation stage in parallel, and converting and encoding according to the storage structure and the encoding mode of the Janusgraph in the HBase database.

Using the global unique vertex ID distributed in the step 1-2 of the data preparation stage as the internal vertex ID of the Janus graph, and using the distributed global unique edge ID as the internal edge ID of the Janus graph; and simultaneously, encoding all data according to a Janus graph encoding mode by using the vertex and side information in the step 1-1, enabling each piece of data to contain row keys, column clusters, column names and values of an HBase database, and merging all the data subjected to conversion encoding into a first RDD data set.

In the process of converting the global unique vertex ID into the row key of the HBase database, the last N-bit data of the vertex ID is used as the partition of the Janus graph vertex, and therefore the performance problem caused by the imbalance between the HBase database and the first RDD data set partition is avoided.

And 2-3, reading the row key partition information of the HBase database stored by the Janusgraph, re-partitioning the first RDD data set generated in the step 2-2, sequencing the first RDD data set in the partition according to the sequence of row keys, column clusters and column names, and then storing the first RDD data set in an HFile file form in an HDFS distributed file system in a fragmentation mode.

If the partition position of the HBase database is smaller than that of the vertex of the Janus graph, the first RDD data set can be partitioned into smaller partitions based on partition information of the HBase database when being partitioned again, so that the converted single HFile file fragment data belong to the same HBase database partition, the partition number of the first RDD data set is increased, the data volume of each partition is reduced, the resource consumption of partition sequencing of the first RDD data set is reduced, and the sequencing time is shortened.

And 2-4, reading the HFile file in the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.

(III) an elastic search index construction phase: and reading the vertex and the edge in the Hive table in parallel by using a Spark calculation engine, extracting the attribute of the vertex or the edge needing to be indexed, constructing a second RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the second RDD data set in the Elasticissearch index in parallel.

Step 3-1, connecting Mixed index information created in the step 1-3 in the data preparation stage, wherein the Mixed index information comprises information such as index names, index types, labels, index attribute names and the like, and reading the mapping relation configured in the step 1-4.

And 3-2, using a Spark calculation engine to read the vertex Hive table and the edge Hive table prepared in the step 1-2 in the data preparation stage in parallel, using the index information in the step 3-1 to judge, if a certain vertex or a certain edge has a configuration index, extracting the attribute data of the required index, and constructing a second RDD data set according to an index storage structure of Janusgraph in the elastic search.

And 3-3, writing the extracted and converted second RDD data set of the vertex or the edge into an Elasticissearch index in parallel according to the storage position of the Elasticissearch index of the Janusgraph.

The invention uses HBase database as the storage of the Janus graph data, uses Elasticissearch index as the index of the Janus graph data, completes the tasks of data preparation, data loading and data indexing by using Spark calculation engine, submits the tasks to YARN cluster for execution by Spark calculation engine, and uses HDFS and Hive cluster to store the original data and the intermediate data. The HBase, the YARN, the HDFS and the Hive are all deployed on the same Hadoop cluster, and the Hadoop cluster and the Elasticissearch cluster are independently deployed on different servers.

According to the scheme, the graph data is written into the HBase database in a HFile file batch loading mode of the HBase database, and the Elasticissearch index is constructed through the Spark calculation engine, so that the transaction cost of an API (application program interface) structure of a Janusgraph is bypassed, unnecessary resource consumption caused by the API interface of the HBase database is bypassed, and the problem that the graph data is lost due to the fact that partial data is written and lost when massive data is written in the HBase database in parallel through the API interface of the HBase database is solved. And the loading of the HBase database and the establishment of the Elasticissearch index can be executed simultaneously, namely, the stage (II) and the stage (III) can be executed simultaneously, and HBase cluster and Elasticissearch cluster resources are fully utilized, so that the performance of loading mass data into the Janus graph in batches is greatly improved.

Example 2:

based on the embodiment 1, the enterprise investment relationship map is taken as an example of data requiring a batch loading map, and an enterprise basic information table entry _ info is assumed as a vertex Hive table, and an enterprise investment relationship table entry _ invent _ relationship is assumed as a side Hive table.

The data sample of the entrprise _ info table is as follows:

name of an enterprise	Enterprise unified social information code
		Information technology Co Ltd	123*456
B network technology Ltd	789*147

The data sample of the entreprise _ invent _ relationship table is as follows:

data preparation stage

Step 1-1, importing the entry _ info table and the entry _ invent _ relationship table into Hive for distributed storage to form a vertex Hive table and an edge Hive table.

Step 1-2, using a Spark calculation engine to assign a global unique vertex ID to all vertices in the vertex Hive table entry _ info, assign a global unique edge ID to all edges in the side Hive table entry _ invent _ relationship, and replace the vertices in the edge Hive table with the assigned vertex IDs.

After the replacement is completed, the data of the entrprise _ info table is sampled as follows:

vertex ID	Name of an enterprise	Enterprise unified social information code
			1	Information technology Co Ltd	123*456
2	B network technology Ltd	789*147

The data sample of the entreprise _ invent _ relationship table is as follows:

step 1-3, creating a Schema structure of the Janus graph to-be-loaded graph data, wherein PropertyKey (attribute) to be created in the example comprises an enterprise name, an enterprise unified social information code and an investment proportion, a VertexLabel (vertex label) only has one enterprise, and an EdgeLabel (edge label) only has one investment. Since enterprise search needs to be performed based on the "enterprise name" or the "enterprise unified social information code", a Mixed index needs to be created for the "enterprise name" and the "enterprise unified social information code".

Step 1-4, a configuration file is created, and the mapping relation between Hive table names and fields of the top points and the edges and the labels and attributes of the Janus graph is configured. In this example, the configuration of the entreprise _ info table maps to "Enterprise" VertexLabel (vertex tag) of Janusgraph, the entreprise _ invent _ relation table maps to "invest" EdgeLabel (edge tag) of Janusgraph, and maps "investor ID" to the starting vertex ID of "invest" EdgeLabel (edge tag) of Janusgraph, and maps "invested ID" to the ending vertex ID of "invest" EdgeLabel (edge tag) of Janusgraph. The "enterprise name" and "enterprise unified social information code" fields of the entry _ info table and the "investment proportion" field of the entry _ invent _ relationship table are mapped to the PropertyKey (attribute) corresponding to the Janusgraph, respectively.

(II) HBase data loading stage

And 2-1, connecting the Schema structure configured in the step 1-3, and reading the mapping relation configured in the step 1-4.

Step 2-2, using a Spark calculation engine to read the entry _ info table and the entry _ info _ relationship table of the ID allocated in the step 1-2 in the data preparation stage in parallel, using the information obtained in the step 2-1, converting and encoding the data of the entry _ info table and the entry _ info _ relationship table according to the storage structure and encoding mode of Janusgraph in the HBase database, so that each piece of data contains the row key, the column cluster, the column name and the value of the HBase database, and merging all the converted and encoded data into a first RDD data set.

And 2-3, reading the line key partition information of the HBase database stored by the Janusgraph, re-partitioning the first RDD data set generated in the step 2-2 according to the line key partition mode of the HBase database, sequencing the first RDD data set in the partition according to the sequence of line keys, column clusters and column names, and then storing the first RDD data set in an HDFS (high frequency file format) distributed file system in a fragmentation mode.

And 2-4, reading the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.

(III) Elasticissearch index construction phase

And 3-1, connecting the Mixed index information created by the enterprise name and the enterprise unified social information code PropertyKey in the step 1-3, and reading the mapping relation configured in the step 1-4.

And 3-2, reading the data of two Hive tables of the entry _ info table and the entry _ event _ relationship table prepared in the step 1-2 by using a Spark calculation engine, judging by using Mixed index information in the step 3-1, extracting the data of the enterprise name and the enterprise unified social information code of the entry _ info table needing to create the index, and constructing a second RDD data set according to the storage structure of the Janusgraph in the elastic search index.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for rapidly loading Janus graph data in batches is characterized in that: the method comprises the following steps:

2. The method for fast bulk loading of Janus graph data as claimed in claim 1, wherein: the step of storing the data of the vertexes and edges needing to be loaded in batch into the Hive table for fragment storage so as to facilitate the Spark calculation engine to perform parallel reading comprises the following steps:

3. The method for fast batch loading of Janus graph data as claimed in claim 2, wherein: the creating of the Schema structure of the Janusgraph graph data to be loaded and the configuring of the mapping relationship between the Hive table and the Schema structure include:

creating a Schema structure of the Janusgraph graph data to be loaded, wherein the Schema structure comprises attributes, vertex tags, edge tags and Mixed indexes;

4. The method for fast bulk loading of Janus graph data as claimed in claim 3, wherein: the step of using a Spark calculation engine to read the top point and the edge in the Hive table in parallel and constructing the first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database comprises the following steps:

connecting the Schema structure created by the Janusgraph, acquiring related information of the attributes, the vertex tags and the edge tags, and reading the mapping relation;

5. The method for fast bulk loading of Janus graph data as claimed in claim 4, wherein: the step of loading the first RDD data set into the HBase database corresponding to the Janus graph storage in the form of the HFile file includes:

6. The method for fast bulk loading of Janus graph data as claimed in claim 3, wherein: the detailed steps of the Elasticissearch index construction phase comprise: