CN114297173A

CN114297173A - Knowledge graph construction method and system for large-scale mass data

Info

Publication number: CN114297173A
Application number: CN202110677218.0A
Authority: CN
Inventors: 赵俊峰; 王亚沙; 徐涌鑫; 杨恺; 单中原; 王子健; 尹思菁
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2022-04-08

Abstract

The invention discloses a knowledge graph construction method and a knowledge graph construction system for large-scale mass data, wherein the method comprises the following steps: s100, building each distributed cluster by adopting a Master-Slave structure based on a docker-compound and Apache Hadoop technology, and providing distributed storage, index and calculation for a graph database, wherein each distributed cluster comprises a distributed storage cluster, a distributed index cluster and a distributed calculation cluster; s200, jointly storing and retrieving the mass knowledge map data by using the map database and the document database to realize the construction of the mass knowledge map. The invention realizes the rapid deployment and the customizable establishment of the cluster through the docker-composition technology, and greatly improves the knowledge map establishment and retrieval efficiency under the background of mass data by utilizing the advantages of a map database and a document database in relevant retrieval scenes.

Description

Knowledge graph construction method and system for large-scale mass data

Technical Field

The invention relates to the technical field of knowledge graph construction, in particular to a knowledge graph construction method and a knowledge graph construction system for large-scale mass data.

Background

In the real world, with the rapid development of fixed network broadband, mobile internet, internet of things and the like, data is growing explosively. According to the Data development trend report of Data era 2025 published by Seagate technologies and International Data Corporation (IDC), the Data growth rate is remarkable in the future, and the global Data volume will reach 163ZB in 2025. For a real application scene such as the financial field, according to the data acquisition standard of Zheng State pedestrian, money laundering business is carried out on a single basis, the total data scale of four years reaches TB level, and the data growth scale of each day reaches GB level. Aiming at the situation that the data growth speed in a real application scene is remarkable, the domain knowledge graph tool needs to carry out high-level modeling on massive multi-source heterogeneous data through knowledge extraction, namely, schema of a multi-source heterogeneous database table is mapped to an ontology established by an expert in a domain knowledge graph in a manual or mechanical mode, the graph is used as an intermediary to realize semantic fusion of heterogeneous data, and the import of massive data and the construction, self-growth and self-evolution of the domain knowledge graph are completed under the guidance. Massive multi-source heterogeneous data also brings great challenges to the construction and retrieval of knowledge maps:

(1) for cluster deployment aspects. From the technical route, the graph database is divided into a single-machine version graph database and a distributed graph database. Stand-alone version database such as Neo4j database (community version) has been popular in the market and popular in the industry due to its convenience, availability, freedom from source, and mature technology, and has been the leader of database ranking in DB-Engines throughout the year. However, the stand-alone version database is difficult to meet the increasing data requirements, and because the stand-alone version database cannot establish clusters and perform distributed storage, only the hard disk of a machine, a higher memory and an SSD can be used if the performance and the capacity are to be improved, the price is high, and the cost is high. Distributed graph databases such as digraph, janussgraph and the like can solve the problem of a single-machine system caused by massive data increase by horizontally expanding clusters, so that the cost of hardware equipment is reduced, but storage and index components need to be manually installed on each machine, network connection is configured, and the process is complicated.

(2) For map construction. A standalone version of a database, such as a Neo4j database (community version), is too slow to import data item by statement, and cannot be read and written in real time. The mass data import knowledge graph tool provided by the official part has complicated steps and is mostly imported in a manual mode, and data can be imported into an empty database at one time only by stopping operation, so that the self-growth and self-evolution of the domain knowledge graph cannot be realized by importing mass incremental data into a graph database. Although the CSV import tool provided by the official can support incremental import, the import efficiency greatly declines when the data volume reaches the level of ten million. Distributed graph databases such as Janus graph are relatively new graph databases, officials do not provide tools for importing knowledge maps with mass data, open source cases for realizing batch import of Janus graph databases are hardly realized, jump-by-jump import by sentences leads the construction efficiency of the maps to be behind the speed of data updating and increasing, and the application of domain knowledge map tools is limited.

(3) And (4) in the aspect of map retrieval. A standalone version of a graph database, such as a Neo4j graph database (community version), is inefficient in multi-hop queries because indexes cannot be established in the relationship; although the distributed graph database can establish indexes in relation and improve the efficiency of multi-hop query, the existence of the super nodes in the graph database can greatly reduce the retrieval performance of the graph, and the degree (in-degree + out-degree) of the super nodes in the graph database refers to nodes with the degree of ten thousand or more. In addition, a plurality of super nodes appear in many application scenarios, for example, in a financial domain knowledge graph, ten-million level transaction edges exist between a certain financial institution or unit customer and a plurality of customers, so that a 'super node' is formed. The existence of the super node can lead the query traversing the super node to traverse all adjacent edges of the node, so that the graph database models the relationship into data, and the multi-table joint query operation is replaced by the multi-hop query to avoid losing the advantage of multiple JOIN database table operations.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a knowledge graph construction method and a knowledge graph construction system for large-scale mass data, which can realize the rapid deployment and the customizable construction of a cluster through a docker-compound technology, can utilize the advantages of a graph database and a document database in relevant retrieval scenes respectively, and greatly improve the knowledge graph construction and retrieval efficiency under the background of mass data.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a knowledge graph construction method for large-scale mass data comprises the following steps:

s100, building each distributed cluster by adopting a Master-Slave structure based on a docker-compound and Apache Hadoop technology, and providing distributed storage, index and calculation for a graph database, wherein each distributed cluster comprises a distributed storage cluster, a distributed index cluster and a distributed calculation cluster;

s200, jointly storing and retrieving the mass knowledge map data by using the map database and the document database to realize the construction of the mass knowledge map.

Further, according to the method, the distributed storage cluster uses an HBase component, the distributed index cluster uses an ElasticSearch component, the distributed computing cluster uses a Spark component, and the graph database is a distributed graph database based on a Janusgraph open source.

Further, according to the method, for the distributed computing cluster, gremlin sreverer, Spark Master, Yarn resource manager and HDFS NameNode of Janusgraph are deployed in a Master machine, corresponding worker nodes are deployed in a Slave machine, and the deployment modes of the distributed storage cluster and the distributed index cluster are the same as those of the Slave machine.

Further, the method as described above, S100, comprises:

s101, building a distributed cluster based on a docker-composition.yml file, and providing distributed storage, index and calculation for a graph database;

s102, the number of Worker container nodes in each distributed cluster is specified through scale parameters in the docker-composition, relevant configuration items in the yaml file are specified by taking e parameters in the docker-composition as environment variable parameters, and the relevant configuration items comprise container network subnet IP, IP of the Worker container nodes and core and memory resources distributed to Spark Worker nodes;

s103, embedding a docker-compound up command for deploying each distributed cluster into a Linux Shell script, using scales and e parameters of the docker-compound command as parameters to be transmitted by a user through the Linux Shell script, and realizing container network customization, IP customization and resource allocation customization according to different data volume and application scene requirements;

and S104, performing one-key starting and stopping on each distributed cluster through a down-component up command and a down-component down command.

Further, the method as described above, S200 includes:

s201, analyzing the characteristics of massive knowledge graph data, modeling a pair of edges with the same type between a head entity and a tail entity into an edge cluster, optimizing the storage of the edge cluster into one edge in graph database storage, establishing a cluster ID attribute on the edge to identify the cluster to which the edge belongs, and storing the attribute information of the edge in the edge cluster into a document type database;

s202, based on analysis, storing a basic network structure of massive knowledge graph data in a graph database, wherein the basic network structure comprises nodes and edges, storing attribute information of the nodes and relations of the massive knowledge graph data in a document database, the relations refer to the edges in the graph database, the attribute information of the nodes comprises IDs and types, and the attribute information of the relations comprises cluster IDs and types;

s203, automatically distributing node primary key IDs and relation primary key IDs according to the mapping result of the database table of the structured data/semi-structured data and the domain knowledge body, automatically identifying data information which needs to be stored in a database and a document database simultaneously, wherein the data information comprises names, node types and relation types, and automatically dividing the data for scattered storage.

A knowledge graph construction system for large-scale mass data comprises:

the building module is used for building each distributed cluster by adopting a Master-Slave structure based on a docker-compound and Apache Hadoop technology, providing distributed storage, index and calculation for a graph database, and each distributed cluster comprises a distributed storage cluster, a distributed index cluster and a distributed calculation cluster;

and the construction module is used for performing combined storage and retrieval on the mass knowledge map data by using the map database and the document database to realize the construction of the mass knowledge map.

Further, according to the system, the distributed storage cluster uses an HBase component, the distributed index cluster uses an ElasticSearch component, the distributed computing cluster uses a Spark component, and the graph database is a distributed graph database based on a Janusgraph open source.

Further, according to the system, for the distributed computing cluster, gremlin sreverr, Spark Master, Yarn resource manager and HDFS NameNode of Janusgraph are deployed in a Master machine, corresponding worker nodes are deployed in a Slave machine, and the deployment modes of the distributed storage cluster and the distributed index cluster are the same as those of the Slave machine.

Further, as for the system described above, the building module is specifically configured to:

yml files are used for building distributed clusters, and distributed storage, indexing and calculation are provided for a graph database;

the number of Worker container nodes in each distributed cluster is specified through scale parameters in the docker-composition, e parameters in the docker-composition serve as environment variable parameters to specify relevant configuration items in the yaml file, and the relevant configuration items comprise container network subnet IP, IP of the Worker container nodes, cores distributed by Spark Worker nodes and memory resources;

embedding a docker-composition up command for deploying each distributed cluster into a Linux Shell script, using scale and e parameters of the docker-composition command as parameters required to be transmitted by a user through the Linux Shell script, and realizing container network customization, IP customization and resource allocation customization according to different requirements of data volume and application scenes;

and performing one-key starting and stopping on each distributed cluster through a down-component up command and a down-component down command.

Further, in the system as described above, the building module is specifically configured to:

analyzing the characteristics of mass knowledge graph data, modeling a pair of edges with the same type between a head entity and a tail entity into an edge cluster, optimizing the storage of the edge cluster into one edge in graph database storage, establishing a cluster ID attribute on the edge to identify the cluster to which the edge belongs, and storing the attribute information of the edge in the edge cluster into a document database;

based on analysis, storing a basic network structure of massive knowledge graph data in a graph database, wherein the basic network structure comprises nodes and edges, storing attribute information of the nodes and the relations of the massive knowledge graph data in a document database, the relations refer to the edges in the graph database, the attribute information of the nodes comprises IDs (identity) and types, and the attribute information of the relations comprises cluster IDs (identity) and types;

according to the mapping result of the database table of the structured data/semi-structured data and the domain knowledge body, node primary key IDs and relation primary key IDs are automatically distributed, data information including names, node types and relation types which need to be stored in a database and a document type database at the same time is automatically identified, and the data is automatically divided for scattered storage.

The invention has the beneficial effects that: the method and the device effectively solve the problems that a single graph database cannot meet the storage and retrieval requirements of mass data and the deployment process of the distributed graph database is complicated. In the aspect of cluster deployment, a plug-and-play database distributed cluster framework which is not realized in an open source community is proposed for the first time. The heterogeneous database storage and retrieval scheme provided by the invention has great significance for relieving the problem that the retrieval performance of the super nodes in the database is reduced and improving the overall retrieval efficiency.

Drawings

Fig. 1 is a schematic flowchart of a knowledge graph construction method for large-scale mass data according to an embodiment of the present invention;

FIG. 2 is a diagram of a distributed cluster provided in an embodiment of the present invention;

FIG. 3 is a block diagram of a heterogeneous database storage and retrieval architecture provided in an embodiment of the present invention;

fig. 4 is a diagram of an experimental result of performing heterogeneous database storage and retrieval according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a knowledge graph construction system for large-scale mass data according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems solved, the technical solutions adopted, and the technical effects achieved by the present invention clearer, the technical solutions of the embodiments of the present invention will be further described in detail with reference to the accompanying drawings.

In order to solve the problem that a single machine system is difficult to deal with increasing mass data in the aspect of cluster deployment, the distributed storage, calculation and index cluster construction method is mainly based on related components of an Apache Hadoop big data platform. The invention provides a method for jointly storing and retrieving a graph database and a document database, aiming at the problem that the retrieval performance of a graph is greatly reduced due to super nodes in the graph database. The knowledge graph system established by the method can realize rapid deployment and customizable establishment of the cluster through the docker-composition technology, and can greatly improve the knowledge graph establishment and retrieval efficiency under the background of mass data by utilizing the advantages of a graph database and a document database in relevant retrieval scenes. The invention mainly comprises the following steps: (1) the method realizes the one-click customizable establishment of the distributed graph database cluster on the basis of the docker-composition technology. The user can customize the subnet IP of the cluster container network according to the own requirements, and customize the number, IP and host names of the slave nodes of the distributed storage, index and calculation cluster. (2) And adopting a heterogeneous database mixing scheme to correlate the data stored in the graph database and the document database according to the distributed node primary key ID and the relation primary key ID. The advantages of high efficiency of multi-hop query of a graph database and high speed of conditional query and statistical analysis of a document database are comprehensively utilized.

The embodiment of the invention provides a knowledge graph construction method for large-scale mass data, as shown in figure 1, the method comprises the following steps:

s100, building each distributed cluster by adopting a Master-Slave structure based on a docker-compound and Apache Hadoop technology, and providing distributed storage, index and calculation for a graph database, wherein each distributed cluster comprises a distributed storage cluster, a distributed index cluster and a distributed calculation cluster.

In the embodiment of the invention, in order to implement the method, the invention provides a distributed cluster supporting mass data storage and high-efficiency retrieval, and as shown in fig. 2, the distributed cluster built by the invention adopts a Master-Slave structure. Further, for the distributed computing cluster, gremlin sriver, Spark Master, Yarn resource manager and HDFS NameNode of Janusgraph are deployed in a Master machine, corresponding worker nodes are deployed in a Slave machine, and the deployment modes of the distributed storage cluster and the distributed index cluster are the same as those of the distributed storage cluster and the distributed index cluster.

Specifically, the distributed cluster is built based on Apache Hadoop. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications and is promulgated under the Apache 2.0 licensing agreement. It supports applications running on large clusters built of commodity hardware. Briefly, Hadoop is a software platform that can be more easily developed and run to process large-scale data. The platform is realized by using an object-oriented programming language Java and has good portability.

In the embodiment of the invention, the distributed storage cluster uses an HBase component, the HBase is a top-level project of Apache, the HBase is an open-source, distributed and column-oriented database, the database is suitable for unstructured massive data storage, BigTable modeling of Google is referred to, the realized programming language is Java, the database runs on an HDFS file system, and services similar to BigTable scale are provided for Hadoop. Therefore, it can provide a very high fault tolerance rate for sparse files. In the face of increasing mass data in various application scenes, the mass data can be flexibly ensured to be stored in a transverse expansion mode, and performance loss and high cost caused by continuous accumulation of a single machine are avoided.

In the embodiment of the invention, the distributed index cluster uses an Elastic Search (ES) component. ES is a search engine based on the Apache Lucene library. It provides a distributed, multi-tenant supported full-text search engine with an HTTP Web interface and modeless JSON documents. ES was developed in Java and released as open source software under the Apache license. Official clients are available in Java,. NET (C #), PHP, Python, Apache Groovy, Ruby and many other languages. ES is the most popular enterprise search engine, as shown by DB-Engineers' ranking. The ES is also distributed, can be expanded to hundreds of servers, processes PB-level structured or unstructured data, and provides guarantee for the retrieval performance of mass data.

In an embodiment of the invention, the distributed computing cluster uses a Spark component. Apache Spark is an open source cluster operation framework, Spark realizes the memory calculation technology based on the distributed calculation realized by the MapReduce algorithm, and can analyze and operate in the RAM when the intermediate output result is not written into the hard disk. The operation speed of the program running in the RAM can be 100 times faster than that of Hadoop MapReduce. Colloquially, distributed computing can be understood as one hundred people working simultaneously, and the working efficiency is greatly improved compared with one person working. The project utilizes the advantages of parallel and high speed of the distributed computing cluster to realize rapid batch import of mass data and OLAP (online analytical processing) operation.

In the embodiment of the invention, the development of the knowledge graph tool in the field of the invention is based on a Janus graph open source distributed graph database. JanusGraph is a graph database engine and can be adapted to various storage back ends such as Apache HBase, Apache Cassandra and Berkeley DB and various index back ends such as elastic search and Apache Solr, the distribution of the JanusGraph database is mainly embodied in the distribution type cluster deployment of the storage back end and the index back end, when the performance of a single-machine system reaches the bottleneck due to mass data growing day by day, the performance of the single-machine system can be relieved by adding a machine transverse expansion cluster, and the capacity of storing and retrieving the mass data is improved by the transverse expansion of the storage back end and the index back end. Although the distributed graph database can deal with the problem brought by massive increased data to a single machine system by transversely expanding the cluster, the traditional cluster deployment mode needs to create a container for each component in the cluster by running a docker run command based on downloaded docker images, needs to establish a network in advance, customizes an IP (Internet protocol) for each container, configures a connection file of a Janus graph, and establishes connection with a storage back-end cluster and an index back-end cluster, and the process is very complicated; even if the way of writing scripts is used, the customization of users cannot be realized, such as: for application scenarios with different data volume requirements, a user can specify the slave number of the storage back-end HBase cluster and the slave number of the index back-end ES cluster by means of parameter input. Nor can customization of the network be implemented, such as the subnet IP of a custom container network and the IP of each container.

In the embodiment of the invention, the steps of building the distributed cluster are as follows:

s102, the number of Worker container nodes in each distributed cluster is specified through scale parameters in the docker-composition, the e parameters in the docker-composition serve as environment variable parameters to specify relevant configuration items in the yaml file, and the relevant configuration items comprise container network subnet IP, IP of the Worker container nodes and core and memory resources distributed to Spark Worker nodes;

The core idea of the invention is based on the idea of divide-and-conquer, an original task is decomposed into a plurality of sub-tasks with equal semanteme, and the tasks are executed in parallel by a special worker thread, and the result of the original task is formed by integrating the processing results of the sub-tasks.

Furthermore, in order to achieve better effects, the database and document database combined storage scheme is transparent to users and can be efficiently adapted to different scenes. Aiming at mass data characteristics and service requirements, a heterogeneous database mixing scheme is adopted, and data stored in a database and a document database are associated according to the distributed node primary key ID and the relation primary key ID. The advantages of high efficiency of multi-hop query of a graph database and high speed of conditional query and statistical analysis of a document database are comprehensively utilized. Graph databases are used when multi-hop relational queries and relational inference queries are performed, and document-type databases are used when conditional query filtering and statistical analysis are performed.

In the embodiment of the present invention, as shown in fig. 3, the heterogeneous database storage and retrieval includes the following steps:

firstly, data characteristics of a knowledge graph are analyzed, the condition that a plurality of edges of the same type exist between a pair of head entities and tail entities frequently (for example, one account transfers one thousand to ten thousand times to another account within ten years) exists, a plurality of edges of the same type (Label) between the pair of head entities and the tail entities are not required to be stored in a graph database, and the plurality of edges of the same type (Label) between the pair of head entities and the tail entities are modeled into an edge cluster, so that the storage of the edge cluster is optimized into one edge in the graph database storage, and a cluster ID attribute is established on the edge to identify the cluster to which the edge belongs, thereby reducing the degree of super nodes in the graph database. The attribute information of the edges in the edge cluster is stored in the document type database, and is identified by using the cluster ID, and the same is true for the nodes.

only basic network structures are stored in a graph database, such as: an edge with a cluster ID of 1 and a type of 'transfer' is arranged between the node with the ID of 3 and the type of 'account' and the node with the ID of 5 and the type of 'account', and the advantage of high multi-hop query efficiency of the graph database is fully exerted. The attribute information of the nodes and relationships is stored in a document-type database. The advantages of the document type database in conditional query and statistical analysis are fully exerted.

S203, automatically distributing node primary key IDs and relation primary key IDs according to the mapping result of the database table of the structured data/semi-structured data and the domain knowledge body, automatically identifying data information which needs to be stored in a database and a document database simultaneously, wherein the data information comprises names, node types and relation types, and automatically dividing the data for scattered storage;

according to the mapping result of the database table of the structured data/semi-structured data and the domain knowledge body, node primary key IDs and relation primary key IDs are automatically distributed, data information needing to be stored in a database and a document type database at the same time, such as names, node types, relation types and the like, is automatically identified, and data are automatically divided for scattered storage. And searching according to the constraint and the filtering rule.

Fig. 4 is an experimental result diagram for performing heterogeneous database storage and retrieval in the embodiment of the present invention, and for a problem whether the retrieval efficiency under massive map data can be improved by using combined storage and retrieval of a graph database and a document type database, data is constructed and an experiment is designed for four common query application scenarios. The first application scenario is that the IDs and other information of two nodes in a known graph query the edge of a certain Label between the two nodes, which satisfies a certain condition, such as: inquiring which transactions between the node with the ID 4096 and the node with the ID 4104 are from 3/2013 to 31/8/2014 and transactions with the transaction amount of 2000000 yuan to 5000000 yuan; the second application scenario is that the ID and other information of a node in a known graph are queried, and an edge of a certain Label which takes the node as an incident point or an emergent point and satisfies a certain condition is queried, such as: inquiring which of the transactions between the transaction time of the node with the ID 4096 in 3 months and 3 days in 2013 to 31 days in 8 months and 2014 and between 2000000 yuan and 5000000 yuan are carried out; the third application scenario is to find out nodes meeting conditions according to attribute information of nodes in the graph, such as: inquiring nodes with names of Xiaoming; the fourth application scenario is to query an edge of a certain Label between two nodes, which meets a certain condition, according to attribute information that can uniquely locate a node in a graph, for example: the transaction time between the node with the query name of 'Xiaoming' and the node with the name of 'Xiaoqiang' is between 3 and 3 days in 2013 and 31 days in 2014 and 8 and 31 days in 2014, and the transaction amount is any transaction between 2000000 yuan and 5000000 yuan.

Wherein, for the first and second application scenarios, the constructed data set size is: the method comprises the following steps that 402 nodes are totally arranged, each 10000000 edge points to other nodes through a node with the ID of 4096 and is totally divided into 420 clusters, wherein 20 clusters are totally arranged between the node with the ID of 4096 and the node with the ID of 4104, each cluster is 100000 edges, each cluster is provided with one cluster between the node with the ID of 4096 and other nodes, each cluster is 20000 edges, each cluster is provided with the node with the ID of 4096 and an edge Label with the "clusterId" of the node with the ID of 4104 as "transaction", the attributes comprise transaction time, transaction amount and the like, and the data set size is 3.5 GB; for the third application scenario, the constructed data set size is: the number of nodes is 10000000, each node has a name attribute, and the size of a data set is 1.6 GB; for the fourth application scenario, the constructed dataset size is: the data set comprises 10000000 nodes, each node has a name attribute, 2000000 edges and 2000 edge clusters, 2000 clusters are arranged between a node with a name of ' Xiaoming ' and a node with a name of ' Xiaoqiang ', each cluster has 1000 edges, an edge Label with a clusterId ' of the node with the name of ' Xiaoming ' and the node with the name of ' Xiaoqiang ' is ' transaction ', the attributes comprise transaction time, transaction amount and the like, and the data set size is 2.3 GB.

For the setting of random data, because only graph database query is needed, and a comparison experiment of the graph database and the document database combined storage and query is needed, for the fairness of the experiment, the same random seeds are used for the data constructed by the two schemes so as to ensure that the data constructed by the two schemes are consistent; for the setting of the index, for the two comparison schemes, the index is established in the field to be queried, as shown in fig. 3, in four common application scenarios, the query and retrieval speed of the mixed database scheme is improved by 30 times compared with the speed of storing and querying by only using a database.

By adopting the method provided by the embodiment of the invention, the problems that a single graph database cannot meet the storage and retrieval requirements of mass data and the deployment process of the distributed graph database is complicated are effectively solved. In the aspect of cluster deployment, a plug-and-play database distributed cluster framework which is not realized in an open source community is proposed for the first time. The heterogeneous database storage and retrieval scheme provided by the invention has great significance for relieving the problem that the retrieval performance of the super nodes in the database is reduced and improving the overall retrieval efficiency.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present invention, there is also provided a knowledge graph constructing system for large-scale mass data, as shown in fig. 5, including:

the building module 100 is used for building each distributed cluster by adopting a Master-Slave structure based on a docker-compound and Apache Hadoop technology, providing distributed storage, index and calculation for a graph database, and each distributed cluster comprises a distributed storage cluster, a distributed index cluster and a distributed calculation cluster. The distributed storage cluster uses an HBase component, the distributed index cluster uses an ElasticSearch component, the distributed computing cluster uses a Spark component, and the graph database is an open source distributed graph database based on a Janus graph. For the distributed computing cluster, gremlin sriver, Spark Master, Yarn resource manager and HDFS NameNode of Janusgraph are deployed in a Master machine, corresponding worker nodes are deployed in a Slave machine, and the deployment modes of the distributed storage cluster and the distributed index cluster are the same as those of the distributed storage cluster and the distributed index cluster.

The building module 100 is specifically configured to: yml files are used for building distributed clusters, and distributed storage, indexing and calculation are provided for a graph database; the number of Worker container nodes in each distributed cluster is specified through scale parameters in the docker-composition, e parameters in the docker-composition serve as environment variable parameters to specify relevant configuration items in the yaml file, and the relevant configuration items comprise container network subnet IP, IP of the Worker container nodes, and cores and memory resources distributed to Spark Worker nodes; embedding a docker-composition up command for deploying each distributed cluster into a Linux Shell script, using scale and e parameters of the docker-composition command as parameters required to be transmitted by a user through the Linux Shell script, and realizing container network customization, IP customization and resource allocation customization according to different requirements of data volume and application scenes; and performing one-key starting and stopping on each distributed cluster through a down-component up command and a down-component down command.

The construction module 200 is configured to jointly store and retrieve the massive knowledge map data by using the map database and the document type database, so as to implement the construction of the massive knowledge map.

The building block 200 is specifically configured to: analyzing the characteristics of mass knowledge graph data, modeling a pair of edges with the same type between a head entity and a tail entity into an edge cluster, optimizing the storage of the edge cluster into one edge in graph database storage, establishing a cluster ID attribute on the edge to identify the cluster to which the edge belongs, and storing the attribute information of the edge in the edge cluster into a document database; based on analysis, storing a basic network structure of massive knowledge graph data in a graph database, wherein the basic network structure comprises nodes and edges, storing attribute information of the nodes and the relations of the massive knowledge graph data in a document database, the relations refer to the edges in the graph database, the attribute information of the nodes comprises IDs (identity) and types, and the attribute information of the relations comprises cluster IDs (identity) and types; according to the mapping result of the database table of the structured data/semi-structured data and the domain knowledge body, node primary key IDs and relation primary key IDs are automatically distributed, data information including names, node types and relation types which need to be stored in a database and a document type database at the same time is automatically identified, and the data is automatically divided for scattered storage.

It should be noted that the system for constructing the knowledge graph for large-scale mass data and the method for constructing the knowledge graph for large-scale mass data belong to the same inventive concept, and detailed description is omitted.

By adopting the system provided by the embodiment of the invention, the problems that a single graph database cannot meet the storage and retrieval requirements of mass data and the deployment process of the distributed graph database is complicated are effectively solved. In the aspect of cluster deployment, a plug-and-play database distributed cluster framework which is not realized in an open source community is proposed for the first time. The heterogeneous database storage and retrieval scheme provided by the invention has great significance for relieving the problem that the retrieval performance of the super nodes in the database is reduced and improving the overall retrieval efficiency.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims

1. A knowledge graph construction method for large-scale mass data is characterized by comprising the following steps:

2. The method of claim 1, wherein the distributed storage cluster uses an HBase component, the distributed index cluster uses an ElasticSearch component, the distributed computing cluster uses a Spark component, and the graph database is a Janusgraph-based open source distributed graph database.

3. The method of claim 2, wherein, for the distributed computing cluster, gremlin sriver, Spark Master, Yarn resource manager, and HDFS NameNode of Janusgraph are deployed in the Master machine, the corresponding worker node is deployed in the Slave machine, and the distributed storage cluster and the distributed index cluster are deployed in the same manner.

4. The method of claim 3, wherein S100 comprises:

5. The method according to any one of claims 1-4, wherein S200 comprises:

6. A knowledge graph construction system for large-scale mass data is characterized by comprising:

7. The system of claim 6, wherein the distributed storage cluster uses an HBase component, the distributed index cluster uses an ElasticSearch component, the distributed computing cluster uses a Spark component, and the graph database is a Janusgraph-based open source distributed graph database.

8. The system of claim 7, wherein for the distributed computing cluster, the gremlin sriver, Spark Master, Yarn resource manager, and HDFS NameNode of Janusgraph are deployed in the Master machine, the corresponding worker node is deployed in the Slave machine, and the distributed storage cluster and the distributed index cluster are deployed in the same manner.

9. The system according to claim 8, characterized in that the building module is specifically configured to:

10. The system according to any one of claims 6 to 9, wherein the building module is specifically configured to: