CN114138735A - Method for quickly loading Janus graph data in batches - Google Patents

Method for quickly loading Janus graph data in batches Download PDF

Info

Publication number
CN114138735A
CN114138735A CN202111267971.9A CN202111267971A CN114138735A CN 114138735 A CN114138735 A CN 114138735A CN 202111267971 A CN202111267971 A CN 202111267971A CN 114138735 A CN114138735 A CN 114138735A
Authority
CN
China
Prior art keywords
data
edge
vertex
index
janusgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111267971.9A
Other languages
Chinese (zh)
Inventor
马杲灵
游飞龙
张林林
汪睿铭
陈雪
石尧
董博
廖海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Shulian Mingpin Technology Co ltd
Original Assignee
Guizhou Shulian Mingpin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Shulian Mingpin Technology Co ltd filed Critical Guizhou Shulian Mingpin Technology Co ltd
Priority to CN202111267971.9A priority Critical patent/CN114138735A/en
Publication of CN114138735A publication Critical patent/CN114138735A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for quickly loading Janus graph data in batches, which comprises the following steps: a data preparation stage: storing data of vertexes and edges needing to be loaded in batches into a Hive table for fragment storage so as to facilitate parallel reading of Spark calculation engines, creating a Schema structure of Janusgraph graph data to be loaded, and configuring a mapping relation between the Hive table and the Schema structure; HBase data loading stage: reading the top point and the edge in the Hive table in parallel by using a Spark calculation engine, constructing an RDD (resource description device) data set according to the storage structure and the coding mode of the Janusgraph in an HBase database, and loading the RDD data set into the HBase database corresponding to the Janusgraph storage in an HFile file form; an Elasticissearch index construction phase: using a Spark computing engine to read the vertex and the edge in the Hive table in parallel, extracting the attribute of the vertex or the edge needing to be indexed, constructing an RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the RDD data set into the Elasticissearch index in parallel.

Description

Method for quickly loading Janus graph data in batches
Technical Field
The invention relates to the technical field of big data in computer science, in particular to a method for quickly loading distributed graph database Janusgraph data in batches.
Background
Graph (Graph) is a mathematical logical object representing the relationship between entities and entities, expressed in mathematics as a bituple of G ═ V, E, itself composed of N vertices (V ═ vertex) and M edges (E ═ edge), each vertex corresponding to a number of edges (≦ M), each edge connecting two vertices, an edge may have a direction, if the Graph contains an edge that is directional, it is called a directed Graph, otherwise it is an undirected Graph.
Graph databases, which are a type of NoSQL databases, are non-relational databases, and store relationship information between entities using graph theory, the most common example being interpersonal relationships in social networks. Relational databases are not effective for storing "relational" data, are complex, slow, and beyond expectations in querying, and the unique design of graphic databases just remedies this deficiency.
The janussgraph is an open source distributed graph database, and is widely used in the field of graph data analysis due to the advantages of good universality, high performance, house-opening source codes and the like. The Janus graph supports that databases such as Cassandra and HBase are used as storage for storing complete graph structure data; the method supports using the elastic search, Solr and the like as indexes, and can realize real-time retrieval of the indexed top points and edges. Because HBase, elastic search and the like are widely used in the field of big data and have excellent performance, the scheme mainly aims at the scene that HBase clusters are used as storage and elastic search clusters are used as indexes.
Based on such a scenario, the existing janussgraph has the following problems when loading data:
(1) loading data into the graph through the API provided by the Janus graph can be submitted in a transaction mode, performance overhead of data loading is increased, and offline data loading can be performed without ensuring data consistency through transactions.
(2) The API interface provided by the Janus graph calls the API interface of the HBase database to store data in the HBase database, and in the period, the HBase can frequently perform flush, compact and split operations, so that a large amount of unnecessary resource consumption is caused, and the warehousing efficiency is reduced. And if the API call speed of the HBase database exceeds the writing capability of the HBase database, partial data writing loss may occur, so that the problem of graph data loss is caused.
(3) The API interface provided by the Janus graph calls the API interface of the HBase database to store data, and calls the API interface of the Elasticissearch index to construct an index after the data is successfully returned, and the cluster resources of the HBase data and the Elasticissearch index cannot be fully utilized by the serial writing mode.
Disclosure of Invention
The invention aims to provide a method for rapidly loading Janusgraph data in batches by fully utilizing cluster resources, improving the performance of loading the data into the Janusgraph in batches, solving the problem of slow batch loading of mass data, and solving the problem of graph data loss caused by partial data write loss possibly caused by parallel writing of the mass data through an API (application program interface) of an HBase database.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
a method for fast batch loading of Janus graph data comprises the following steps:
a data preparation stage: storing data of vertexes and edges needing to be loaded in batches into a Hive table for fragment storage so as to facilitate parallel reading of Spark calculation engines, creating a Schema structure of Janusgraph graph data to be loaded, and configuring a mapping relation between the Hive table and the Schema structure;
HBase data loading stage: using a Spark calculation engine to read the top point and the edge in the Hive table in parallel, constructing a first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database, and loading the first RDD data set into the HBase database corresponding to the Janusgraph storage in the form of an HFile file;
an Elasticissearch index construction phase: and reading the vertex and the edge in the Hive table in parallel by using a Spark calculation engine, extracting the attribute of the vertex or the edge needing to be indexed, constructing a second RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the second RDD data set in the Elasticissearch index in parallel.
In the scheme, the graph data is written into the HBase database in a HFile file batch loading mode of the HBase database, and an Elasticissearch index is constructed through a Spark calculation engine, so that the transaction overhead of calling an API (application program interface) of the Janus graph is bypassed.
Further, the step of saving the data that needs to be batched into the vertices and edges of the graph to the Hive table for fragment storage, so as to facilitate parallel reading by the Spark calculation engine, includes:
dividing data needing to be loaded into the graphs in batches into vertexes and edges, respectively forming a vertex Hive table and an edge Hive table, and performing fragment storage on each vertex Hive table and each edge Hive table;
and using a Spark calculation engine to assign a global unique vertex ID to all vertexes of all vertex Hive tables, replacing the associated vertexes of all edges of all edge Hive tables with the assigned vertex ID, and assigning a global unique edge ID to all edges of all edge Hive tables.
Further, the step of creating a Schema structure of the janussgraph to-be-loaded graph data and configuring the mapping relationship between the Hive table and the Schema structure includes:
creating a Schema structure of the Janusgraph to-be-loaded graph data, wherein the Schema structure comprises PropertyKey (attribute), VertexLabel (vertex label), EdgeLabel (edge label) and Mixed index;
and creating a configuration file, and configuring the mapping relation between the Hive table name and the field of the vertex and the edge and the label and attribute of the Janus graph.
In the above scheme, data loading is performed without using an API provided by the janussraph, so that the internal IDs of the janussraph of the vertex and the edge of the Hive table need to be pre-allocated. The Schema structure of the janussgraph needs to be created, and the mapping relation between the Hive table and the Schema structure is configured, so that data can be loaded into a graph according to the configured mapping relation when data loading and index building are performed.
Furthermore, the step of using a Spark calculation engine to read the vertex and the edge in the Hive table in parallel and construct the first RDD data set according to the storage structure and the encoding mode of the janussgraph in the HBase database includes:
connecting the Schema structure created by the Janus graph, acquiring related information of PropertyKey (attribute), VertexLabel (vertex label) and EdgeLabel (edge label), and reading the mapping relation;
and reading data in the vertex Hive table and the edge Hive table which are allocated with the IDs in parallel by using a Spark calculation engine, converting and encoding according to a storage structure and an encoding mode of the Janus graph in the HBase database, enabling each piece of data to contain row keys, column clusters, column names and values of the HBase database, and merging all the converted and encoded data into a first RDD data set.
Further, the step of loading the first RDD data set into the HBase database corresponding to the janussgraph storage in the form of an HFile file includes:
reading the line key partition information of the Janus graph corresponding to the HBase database, re-partitioning the generated first RDD data set, sequencing the RDD data set in the partition according to the sequence of line keys, column clusters and column names, and then storing the first RDD data set in an HDFS (high frequency file format) distributed file system in a fragmentation mode;
reading an HFile file in the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.
In the scheme, in the HBase data loading stage, a Spark calculation engine is used for reading the top point and the edge in the Hive table in parallel, an RDD (resource description device) data set is constructed according to a data storage structure and a coding mode of Janusgraph in an HBase database, the RDD data set is subjected to partition sequencing according to a partition mode of the HBase database to generate an HFile file, the HFile file generated by a single partition is ensured not to cross partitions of a plurality of HBase databases, the storage sequence of single partition data is ensured, so that errors or performance influence cannot be generated when the HFile file is loaded into the HBase database, and then all the generated HFile files are loaded into the HBase database.
Due to the fact that the HFile file is used for loading the HBase database, an API (application programming interface) of the HBase database is not used for loading data, meanwhile, the problem that the transaction overhead is caused by the API of the Janus graph is avoided, data loading and index construction can be executed simultaneously and parallelly, and the integral loading efficiency of the Janus graph data is greatly improved.
Further, the detailed steps of the Elasticsearch index construction phase include:
connecting created Mixed index information, wherein the Mixed index information comprises an index name, an index type, a label and an index attribute name, and reading the configured mapping relation;
using a Spark calculation engine to read a vertex Hive table and an edge Hive table prepared in a data preparation stage in parallel, using index information to judge, if a certain vertex or a certain edge has a configuration index, extracting required index attribute data, and constructing a second RDD data set according to an index storage structure of Janusgraph in an elastic search;
and writing the extracted and converted second RDD data set of the vertex or the edge into an Elasticissearch index in parallel according to the storage position of the Elasticissearch index of the Janusgraph.
In the above scheme, since the data is loaded in the HFile batch loading manner of the HBase database instead of using the API interface of the janussgraph, the index data needs to be written separately, and in order to ensure the index writing efficiency, the present invention uses Spark to read and load the index data in parallel. In the Elasticissearch index construction stage, the vertex and the edge in the Hive table are read in parallel by using Spark, the attribute of the vertex or the edge needing to be indexed is extracted, a second RDD data set is constructed according to the index storage structure of the Janusgraph in the Elasticissearch, and then the second RDD data set is written into the specified storage position of the Janusgraph in the Elasticissearch in parallel.
Compared with the prior art, the invention has the beneficial effects that:
(1) according to the invention, the graph data is written into the HBase database in a HFile file batch loading mode of the HBase database, and the springsearch index is constructed through the Spark calculation engine, so that the transaction overhead of calling the API of the Janusgraph is bypassed.
(2) The method writes the graph data into the HBase database in a HFile file batch loading mode of the HBase database, avoids unnecessary resource consumption caused by an API (application programming interface) of the HBase database, and solves the problem that the graph data is lost due to partial data writing loss possibly caused by parallel writing of mass data through the API of the HBase database.
(3) The loading of the HBase database and the establishment of the Elasticissearch index can be simultaneously executed, namely, the stage (II) and the stage (III) can be simultaneously executed, and HBase cluster and Elasticissearch cluster resources are fully utilized, so that the performance of loading massive data into the Janus graph in batches is greatly improved.
(4) The HFile file generated by conversion can be used as a backup snapshot of the graph data for storage, and the efficiency of redoing the graph can be greatly improved through the HFile file.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a method for fast batch loading of Janus graph data according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Example 1:
the invention is particularly suitable for loading massive data with more than hundred million scales into the Janus graph in batches, and can obviously improve the performance of the data loading graph by using the HBase cluster as storage and the Elasticissearch cluster as an index scene. In a production environment, after massive stock data can be quickly loaded into the graph in batches by the method, incremental data is updated into the graph through the API (application program interface) of the Janus graph, and the consistency of data stored and indexed is ensured through the affairs provided by the Janus graph. The invention is realized by the following technical scheme, as shown in fig. 1, a method for rapidly loading Janus graph data in batches comprises the following steps:
the data preparation stage comprises the following steps: storing the data which needs to be loaded into the vertexes and edges of the graph in batch into the Hive table for fragment storage, so that a Spark calculation engine can read the data in parallel, a Schema structure of the Janusgraph graph data to be loaded is created, and the mapping relation between the Hive table and the Schema structure is configured.
Step 1-1, dividing data needing to be loaded into the graph in batch into a vertex and an edge, respectively forming a vertex Hive table and an edge Hive table, and performing fragment storage on each vertex Hive table and each edge Hive table.
Step 1-2, a Spark calculation engine is used for distributing global unique vertex IDs to all vertexes of all vertex Hive tables, replacing the associated vertexes of all edges of all edge Hive tables with the distributed vertex IDs, and distributing the global unique edge IDs to all edges of all edge Hive tables.
Step 1-3, creating a Schema structure of the Janusgraph graph data to be loaded, wherein the Schema structure comprises PropertyKey (attribute), VertexLabel (vertex label), EdgeLabel (edge label) and Mixed index.
Step 1-4, a configuration file is created, and the mapping relation between Hive table names and fields of the top points and the edges and the labels and attributes of the Janus graph is configured.
(II) HBase data loading stage: and reading the top point and the edge in the Hive table in parallel by using a Spark calculation engine, constructing a first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database, and loading the first RDD data set into the HBase database corresponding to the Janusgraph storage in an HFile file form.
Step 2-1, connecting the Schema structure created in step 1-3 in the data preparation stage, obtaining relevant information of PropertyKey (attribute), VertexLabel (vertex label), and EdgeLabel (edge label), and reading the mapping relationship configured in step 1-4 in the data preparation stage.
And 2-2, using a Spark calculation engine to read data in the vertex Hive table and the edge Hive table which are allocated with the IDs in the step 1-2 in the data preparation stage in parallel, and converting and encoding according to the storage structure and the encoding mode of the Janusgraph in the HBase database.
Using the global unique vertex ID distributed in the step 1-2 of the data preparation stage as the internal vertex ID of the Janus graph, and using the distributed global unique edge ID as the internal edge ID of the Janus graph; and simultaneously, encoding all data according to a Janus graph encoding mode by using the vertex and side information in the step 1-1, enabling each piece of data to contain row keys, column clusters, column names and values of an HBase database, and merging all the data subjected to conversion encoding into a first RDD data set.
In the process of converting the global unique vertex ID into the row key of the HBase database, the last N-bit data of the vertex ID is used as the partition of the Janus graph vertex, and therefore the performance problem caused by the imbalance between the HBase database and the first RDD data set partition is avoided.
And 2-3, reading the row key partition information of the HBase database stored by the Janusgraph, re-partitioning the first RDD data set generated in the step 2-2, sequencing the first RDD data set in the partition according to the sequence of row keys, column clusters and column names, and then storing the first RDD data set in an HFile file form in an HDFS distributed file system in a fragmentation mode.
If the partition position of the HBase database is smaller than that of the vertex of the Janus graph, the first RDD data set can be partitioned into smaller partitions based on partition information of the HBase database when being partitioned again, so that the converted single HFile file fragment data belong to the same HBase database partition, the partition number of the first RDD data set is increased, the data volume of each partition is reduced, the resource consumption of partition sequencing of the first RDD data set is reduced, and the sequencing time is shortened.
And 2-4, reading the HFile file in the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.
(III) an elastic search index construction phase: and reading the vertex and the edge in the Hive table in parallel by using a Spark calculation engine, extracting the attribute of the vertex or the edge needing to be indexed, constructing a second RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the second RDD data set in the Elasticissearch index in parallel.
Step 3-1, connecting Mixed index information created in the step 1-3 in the data preparation stage, wherein the Mixed index information comprises information such as index names, index types, labels, index attribute names and the like, and reading the mapping relation configured in the step 1-4.
And 3-2, using a Spark calculation engine to read the vertex Hive table and the edge Hive table prepared in the step 1-2 in the data preparation stage in parallel, using the index information in the step 3-1 to judge, if a certain vertex or a certain edge has a configuration index, extracting the attribute data of the required index, and constructing a second RDD data set according to an index storage structure of Janusgraph in the elastic search.
And 3-3, writing the extracted and converted second RDD data set of the vertex or the edge into an Elasticissearch index in parallel according to the storage position of the Elasticissearch index of the Janusgraph.
The invention uses HBase database as the storage of the Janus graph data, uses Elasticissearch index as the index of the Janus graph data, completes the tasks of data preparation, data loading and data indexing by using Spark calculation engine, submits the tasks to YARN cluster for execution by Spark calculation engine, and uses HDFS and Hive cluster to store the original data and the intermediate data. The HBase, the YARN, the HDFS and the Hive are all deployed on the same Hadoop cluster, and the Hadoop cluster and the Elasticissearch cluster are independently deployed on different servers.
According to the scheme, the graph data is written into the HBase database in a HFile file batch loading mode of the HBase database, and the Elasticissearch index is constructed through the Spark calculation engine, so that the transaction cost of an API (application program interface) structure of a Janusgraph is bypassed, unnecessary resource consumption caused by the API interface of the HBase database is bypassed, and the problem that the graph data is lost due to the fact that partial data is written and lost when massive data is written in the HBase database in parallel through the API interface of the HBase database is solved. And the loading of the HBase database and the establishment of the Elasticissearch index can be executed simultaneously, namely, the stage (II) and the stage (III) can be executed simultaneously, and HBase cluster and Elasticissearch cluster resources are fully utilized, so that the performance of loading mass data into the Janus graph in batches is greatly improved.
Example 2:
based on the embodiment 1, the enterprise investment relationship map is taken as an example of data requiring a batch loading map, and an enterprise basic information table entry _ info is assumed as a vertex Hive table, and an enterprise investment relationship table entry _ invent _ relationship is assumed as a side Hive table.
The data sample of the entrprise _ info table is as follows:
name of an enterprise Enterprise unified social information code
Information technology Co Ltd 123*456
B network technology Ltd 789*147
The data sample of the entreprise _ invent _ relationship table is as follows:
Figure BDA0003327578340000111
data preparation stage
Step 1-1, importing the entry _ info table and the entry _ invent _ relationship table into Hive for distributed storage to form a vertex Hive table and an edge Hive table.
Step 1-2, using a Spark calculation engine to assign a global unique vertex ID to all vertices in the vertex Hive table entry _ info, assign a global unique edge ID to all edges in the side Hive table entry _ invent _ relationship, and replace the vertices in the edge Hive table with the assigned vertex IDs.
After the replacement is completed, the data of the entrprise _ info table is sampled as follows:
vertex ID Name of an enterprise Enterprise unified social information code
1 Information technology Co Ltd 123*456
2 B network technology Ltd 789*147
The data sample of the entreprise _ invent _ relationship table is as follows:
Figure BDA0003327578340000112
Figure BDA0003327578340000121
step 1-3, creating a Schema structure of the Janus graph to-be-loaded graph data, wherein PropertyKey (attribute) to be created in the example comprises an enterprise name, an enterprise unified social information code and an investment proportion, a VertexLabel (vertex label) only has one enterprise, and an EdgeLabel (edge label) only has one investment. Since enterprise search needs to be performed based on the "enterprise name" or the "enterprise unified social information code", a Mixed index needs to be created for the "enterprise name" and the "enterprise unified social information code".
Step 1-4, a configuration file is created, and the mapping relation between Hive table names and fields of the top points and the edges and the labels and attributes of the Janus graph is configured. In this example, the configuration of the entreprise _ info table maps to "Enterprise" VertexLabel (vertex tag) of Janusgraph, the entreprise _ invent _ relation table maps to "invest" EdgeLabel (edge tag) of Janusgraph, and maps "investor ID" to the starting vertex ID of "invest" EdgeLabel (edge tag) of Janusgraph, and maps "invested ID" to the ending vertex ID of "invest" EdgeLabel (edge tag) of Janusgraph. The "enterprise name" and "enterprise unified social information code" fields of the entry _ info table and the "investment proportion" field of the entry _ invent _ relationship table are mapped to the PropertyKey (attribute) corresponding to the Janusgraph, respectively.
(II) HBase data loading stage
And 2-1, connecting the Schema structure configured in the step 1-3, and reading the mapping relation configured in the step 1-4.
Step 2-2, using a Spark calculation engine to read the entry _ info table and the entry _ info _ relationship table of the ID allocated in the step 1-2 in the data preparation stage in parallel, using the information obtained in the step 2-1, converting and encoding the data of the entry _ info table and the entry _ info _ relationship table according to the storage structure and encoding mode of Janusgraph in the HBase database, so that each piece of data contains the row key, the column cluster, the column name and the value of the HBase database, and merging all the converted and encoded data into a first RDD data set.
And 2-3, reading the line key partition information of the HBase database stored by the Janusgraph, re-partitioning the first RDD data set generated in the step 2-2 according to the line key partition mode of the HBase database, sequencing the first RDD data set in the partition according to the sequence of line keys, column clusters and column names, and then storing the first RDD data set in an HDFS (high frequency file format) distributed file system in a fragmentation mode.
And 2-4, reading the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.
(III) Elasticissearch index construction phase
And 3-1, connecting the Mixed index information created by the enterprise name and the enterprise unified social information code PropertyKey in the step 1-3, and reading the mapping relation configured in the step 1-4.
And 3-2, reading the data of two Hive tables of the entry _ info table and the entry _ event _ relationship table prepared in the step 1-2 by using a Spark calculation engine, judging by using Mixed index information in the step 3-1, extracting the data of the enterprise name and the enterprise unified social information code of the entry _ info table needing to create the index, and constructing a second RDD data set according to the storage structure of the Janusgraph in the elastic search index.
And 3-3, writing the extracted and converted second RDD data set of the vertex or the edge into an Elasticissearch index in parallel according to the storage position of the Elasticissearch index of the Janusgraph.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A method for rapidly loading Janus graph data in batches is characterized in that: the method comprises the following steps:
a data preparation stage: storing data of vertexes and edges needing to be loaded in batches into a Hive table for fragment storage so as to facilitate parallel reading of Spark calculation engines, creating a Schema structure of Janusgraph graph data to be loaded, and configuring a mapping relation between the Hive table and the Schema structure;
HBase data loading stage: using a Spark calculation engine to read the top point and the edge in the Hive table in parallel, constructing a first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database, and loading the first RDD data set into the HBase database corresponding to the Janusgraph storage in the form of an HFile file;
an Elasticissearch index construction phase: and reading the vertex and the edge in the Hive table in parallel by using a Spark calculation engine, extracting the attribute of the vertex or the edge needing to be indexed, constructing a second RDD data set according to a storage structure of the Janusgraph in the Elasticissearch index, and writing the second RDD data set in the Elasticissearch index in parallel.
2. The method for fast bulk loading of Janus graph data as claimed in claim 1, wherein: the step of storing the data of the vertexes and edges needing to be loaded in batch into the Hive table for fragment storage so as to facilitate the Spark calculation engine to perform parallel reading comprises the following steps:
dividing data needing to be loaded into the graphs in batches into vertexes and edges, respectively forming a vertex Hive table and an edge Hive table, and performing fragment storage on each vertex Hive table and each edge Hive table;
and using a Spark calculation engine to assign a global unique vertex ID to all vertexes of all vertex Hive tables, replacing the associated vertexes of all edges of all edge Hive tables with the assigned vertex ID, and assigning a global unique edge ID to all edges of all edge Hive tables.
3. The method for fast batch loading of Janus graph data as claimed in claim 2, wherein: the creating of the Schema structure of the Janusgraph graph data to be loaded and the configuring of the mapping relationship between the Hive table and the Schema structure include:
creating a Schema structure of the Janusgraph graph data to be loaded, wherein the Schema structure comprises attributes, vertex tags, edge tags and Mixed indexes;
and creating a configuration file, and configuring the mapping relation between the Hive table name and the field of the vertex and the edge and the label and attribute of the Janus graph.
4. The method for fast bulk loading of Janus graph data as claimed in claim 3, wherein: the step of using a Spark calculation engine to read the top point and the edge in the Hive table in parallel and constructing the first RDD data set according to the storage structure and the encoding mode of the Janusgraph in the HBase database comprises the following steps:
connecting the Schema structure created by the Janusgraph, acquiring related information of the attributes, the vertex tags and the edge tags, and reading the mapping relation;
and reading data in the vertex Hive table and the edge Hive table which are allocated with the IDs in parallel by using a Spark calculation engine, converting and encoding according to a storage structure and an encoding mode of the Janus graph in the HBase database, enabling each piece of data to contain row keys, column clusters, column names and values of the HBase database, and merging all the converted and encoded data into a first RDD data set.
5. The method for fast bulk loading of Janus graph data as claimed in claim 4, wherein: the step of loading the first RDD data set into the HBase database corresponding to the Janus graph storage in the form of the HFile file includes:
reading the line key partition information of the Janus graph corresponding to the HBase database, re-partitioning the generated first RDD data set, sequencing the RDD data set in the partition according to the sequence of line keys, column clusters and column names, and then storing the first RDD data set in an HDFS (high frequency file format) distributed file system in a fragmentation mode;
reading an HFile file in the HDFS distributed file system, and loading the HFile file into an HBase database corresponding to the Janus graph storage.
6. The method for fast bulk loading of Janus graph data as claimed in claim 3, wherein: the detailed steps of the Elasticissearch index construction phase comprise:
connecting created Mixed index information, wherein the Mixed index information comprises an index name, an index type, a label and an index attribute name, and reading the configured mapping relation;
using a Spark calculation engine to read a vertex Hive table and an edge Hive table prepared in a data preparation stage in parallel, using index information to judge, if a certain vertex or a certain edge has a configuration index, extracting required index attribute data, and constructing a second RDD data set according to an index storage structure of Janusgraph in an elastic search;
and writing the extracted and converted second RDD data set of the vertex or the edge into an Elasticissearch index in parallel according to the storage position of the Elasticissearch index of the Janusgraph.
CN202111267971.9A 2021-10-29 2021-10-29 Method for quickly loading Janus graph data in batches Pending CN114138735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111267971.9A CN114138735A (en) 2021-10-29 2021-10-29 Method for quickly loading Janus graph data in batches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111267971.9A CN114138735A (en) 2021-10-29 2021-10-29 Method for quickly loading Janus graph data in batches

Publications (1)

Publication Number Publication Date
CN114138735A true CN114138735A (en) 2022-03-04

Family

ID=80394843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111267971.9A Pending CN114138735A (en) 2021-10-29 2021-10-29 Method for quickly loading Janus graph data in batches

Country Status (1)

Country Link
CN (1) CN114138735A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383311A (en) * 2023-06-05 2023-07-04 云筑信息科技(成都)有限公司 Method for real-time fusion search of provider portrait data in building industry

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383311A (en) * 2023-06-05 2023-07-04 云筑信息科技(成都)有限公司 Method for real-time fusion search of provider portrait data in building industry
CN116383311B (en) * 2023-06-05 2023-08-18 云筑信息科技(成都)有限公司 Method for real-time fusion search of provider portrait data in building industry

Similar Documents

Publication Publication Date Title
US8359316B2 (en) Database table look-up
CN107818115B (en) Method and device for processing data table
US10831773B2 (en) Method and system for parallelization of ingestion of large data sets
CN102411616B (en) Method and system for storing data and data management method
CN111768850B (en) Hospital data analysis method, hospital data analysis platform, device and medium
TW201530328A (en) Method and device for constructing NoSQL database index for semi-structured data
CN105653609A (en) Memory-based data processing method and device
US20160048572A1 (en) Building a Distributed Dwarf Cube using Mapreduce Technique
CN108628898A (en) The method, apparatus and equipment of data loading
CN111104457A (en) Massive space-time data management method based on distributed database
Hashem et al. An Integrative Modeling of BigData Processing.
CN114048204A (en) Beidou grid space indexing method and device based on database inverted index
CN116126901A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN114138735A (en) Method for quickly loading Janus graph data in batches
CN114860727A (en) Zipper watch updating method and device
CN113127741B (en) Cache method for reading and writing data of mass users and posts in part-time post recommendation system
CN107506394B (en) Optimization method for eliminating big data standard relation connection redundancy
Hasan et al. Data transformation from sql to nosql mongodb based on r programming language
CN111008198A (en) Service data acquisition method and device, storage medium and electronic equipment
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium
Purdilă et al. Single‐scan: a fast star‐join query processing algorithm
Safaee et al. A distributed B+ Tree indexing method for processing range queries over streaming data
CN113722296A (en) Agricultural information processing method and device, electronic equipment and storage medium
CN112800054A (en) Data model determination method, device, equipment and storage medium
CN114416738B (en) Data aggregation method and device based on relational database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination