WO2021032002A1

WO2021032002A1 - Big data processing method based on heterogeneous distributed knowledge graph, device, and medium

Info

Publication number: WO2021032002A1
Application number: PCT/CN2020/109226
Authority: WO
Inventors: 宋群豪
Original assignee: 星环信息科技(上海)股份有限公司
Priority date: 2019-08-20
Filing date: 2020-08-14
Publication date: 2021-02-25
Also published as: CN110472068A; CN110472068B

Abstract

Disclosed are a big data processing method based on a heterogeneous distributed knowledge graph, a device, and a medium. The big data processing method based on the heterogeneous distributed knowledge graph comprises: constructing a node table and a relationship table of a heterogeneous distributed knowledge graph according to a data structure of the heterogeneous distributed knowledge graph; determining a graph computation scene according to a graph computation request, and determining the type and/or attribute of a node and the type and/or attribute of an edge required by the graph computation scene; extracting, from the node table and the relationship table, at least one computing node corresponding to the graph computation scene; filtering out, from the heterogeneous distributed knowledge graph, node data corresponding to the at least one computing node; and performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.

Description

Big data processing method, equipment and medium based on heterogeneous distributed knowledge graph

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 201910770620.6 on August 20, 2019. The entire content of this application is incorporated into this application by reference.

Technical field

The present disclosure relates to knowledge graph technology, for example, to a big data processing method, device and medium based on heterogeneous distributed knowledge graph.

Background technique

Knowledge Graph, also known as scientific knowledge graph, is called knowledge domain visualization or knowledge domain mapping map in the library and information industry. The life cycle of the knowledge graph consists of the following parts: data extraction, transformation and loading (Extract-Transform-Load, ETL), knowledge extraction, definition graph, data import, knowledge reasoning and knowledge application.

Knowledge graphs are divided into heterogeneous knowledge graphs and isomorphic knowledge graphs. The nodes and edges in the isomorphic knowledge graph have the same type, that is, no type distinction is made. The nodes and edges in the heterogeneous knowledge graph can have different types. Even have different attributes. Heterogeneous knowledge graphs are described in the form of triples, quintuples, or seven-tuples. For example, a large-scale directed knowledge graph composed of "points and edges" is represented by "concepts, relationships, and rules". Describing the knowledge graph in the form of multiple groups can clearly express the relationship between concepts and concepts, between concepts and entities, between entities and entities, between entities and attributes, and between attributes and attribute values.

Although the multi-group form brings many benefits, when calculating the heterogeneous distributed knowledge graph, the multi-group form is not concise enough and contains a lot of redundant information, which is not conducive to filtering the node data of interest, leading to a large increase in calculations Complexity.

Summary of the invention

The present disclosure provides a big data processing method, device, equipment and medium based on a heterogeneous distributed knowledge graph to provide an effective data processing scheme for a heterogeneous distributed knowledge graph.

Provides a big data processing method based on heterogeneous distributed knowledge graph, including:

According to the data structure of the heterogeneous distributed knowledge graph, construct the node table and relation table of the heterogeneous distributed knowledge graph;

According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;

At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, extract at least one computing node corresponding to the graph calculation scene from the node table and the relation table;

Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;

Perform data processing on the filtered node data to obtain data processing results based on heterogeneous distributed knowledge graphs;

Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections. The relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.

A big data processing device based on heterogeneous distributed knowledge graphs is also provided, including:

The building module is set to construct the node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;

The determining module is configured to determine the graph calculation scenario according to the graph calculation request, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;

The calculation node acquisition module is set to calculate at least one of the following nodes required for the scene based on the graph: type, attribute, and at least one of the following: type, attribute, and extract the graph calculation scene from the node table and the relation table At least one corresponding computing node;

A filtering module, configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;

The calculation module is set to perform data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph;

A computer device is also provided, including a processor and a memory, where the memory is used to store instructions, and when the instructions are executed, the processor performs the following operations:

According to the graph calculation request, determine the graph calculation scene, and determine at least one of the following nodes required for the graph calculation scene: type, attribute, and at least one of the following: type, attribute;

A storage medium is also provided, the storage medium is used to store instructions, and the instructions are used to execute:

Description of the drawings

FIG. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 1 of the present invention;

2 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 2 of the present invention;

3 is a schematic structural diagram of a big data processing device based on a heterogeneous distributed knowledge graph provided by Embodiment 3 of the present invention;

Fig. 4 is a schematic structural diagram of a computer device according to the fourth embodiment of the present invention.

detailed description

The present disclosure will be described below with reference to the drawings and embodiments. For ease of description, only a part of the structure related to the present disclosure is shown in the accompanying drawings instead of all of the structure.

Example one

FIG. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 1 of the present invention. This embodiment may be applicable to the case of performing data processing on a heterogeneous distributed knowledge graph. The method can be executed by a big data processing device based on a heterogeneous distributed knowledge graph. The device can be composed of hardware and/or software and integrated in a computer device. The software can be written in Scala programming language or Java programming language. .

In one embodiment, heterogeneity is opposite to isomorphism. The nodes and edges in the isomorphic knowledge graph have the same type, that is, no type distinction is made, while the nodes and edges in the heterogeneous knowledge graph have different types. For example, each node in the isomorphic knowledge graph represents a person, and the relationship between people represents the cognitive relationship. The nodes in the heterogeneous knowledge graph can represent people, accounts, companies, etc. The relationship between people and accounts is ownership, and the relationship between people and companies is job relationship. Moreover, each type of node and edge also has different attributes. The heterogeneous distributed knowledge graph in this embodiment refers to a heterogeneous knowledge graph stored in multiple devices in a distributed manner. The knowledge graph includes a huge amount of data, and has different types and attributes. There has not yet been an effective solution for such knowledge. Data processing method of Atlas. Based on this, in conjunction with FIG. 1, the big data processing method provided in this embodiment includes the following operations.

Step 110: Construct a node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph.

In this embodiment, the heterogeneous distributed knowledge graph (hereinafter referred to as the graph) corresponds to a node table and a relationship table. The node table includes: the types and attributes of multiple nodes, and the relationship table includes: the types and attributes of multiple edges. Among them, different types of nodes or edges have different attributes.

Exemplarily, the node table includes: identifications of multiple nodes, types of multiple nodes, attributes of multiple nodes, type collections and attribute collections of nodes. Among them, the identifier of the node serves as the unique identifier of the node, which can be a character string or a number. The type of a node is an essential element in a heterogeneous graph, and is a string type. The attribute of a node can be a dictionary (map) type: map<string, string>, for example, map<gender->man, age->20>. By using map to store data in different formats (including numbers and strings), it not only ensures the unity of the data format, but also provides a very strong flexibility. In the subsequent data processing of some attribute nodes, you can use map to provide Method for targeted extraction. The node type set and attribute set indicate which types of nodes exist in the graph, and which attributes each different type of node has can be used for calculation. The type set and attribute set of the node can be stored in the hidden column of the node table, which is unique to the graph. Based on this, the serialized information can be recorded in the meta information of the Schema column. There is no need to carry this data repeatedly in each row of data to save space.

The relationship table includes: the start node identifiers of multiple edges, the target node identifiers of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of attributes. Wherein, the start node identifier and the target node identifier can be character strings or numbers. The edge type is an essential element in a heterogeneous graph, and is a string type. The node attribute can be a dictionary (map) type: map<string, string>, for example, map<gender->man, age->20, province- >M, city->P>. By using map to store data in different formats (including numbers and strings), it not only ensures the unity of the data format, but also provides very strong flexibility. When processing some attribute edges later, you can use map to provide Method for targeted extraction. The edge type set and attribute set indicate which types of edges exist in the graph, and which attributes each different type of edge has can be used for calculation. The type set and attribute set of the edge can be stored in the hidden column of the relational table, which is unique to the graph. Based on this, the serialized information can be recorded in the meta information of the Schema column. There is no need to carry this data repeatedly in each row of data to save space.

Optionally, constructing the node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph may include: loading the node data source and edge data source used to construct the heterogeneous distributed knowledge graph ; Identify the data structure of the node data source and the data structure of the edge data source; construct the node table according to the data structure of the node data source, and construct the relationship table according to the data structure of the edge data source; according to the node table and the relationship table, the node The data in the data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.

Optionally, constructing the node table according to the data structure of the node data source may include: generating the identifiers of multiple nodes in the node table according to the number column in the node data source; generating the node table according to the type field in the node data source According to the attribute fields in the node data source, the attributes of multiple nodes in the node table are generated; the types and attributes of multiple nodes are respectively summarized to generate the type set and attribute set of the node.

Optionally, constructing the relationship table according to the data structure of the edge data source may include: generating the starting node identifiers of multiple edges in the relationship table according to the number column corresponding to the starting node in the edge data source; according to the edge data source Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the relationship table; generate the types of multiple edges in the relationship table according to the type field in the edge data source; generate according to the attribute field in the edge data source The attributes of multiple edges in the relational table; the types and attributes of multiple edges are separately summarized to generate an edge type set and attribute set.

Step 120: According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required by the graph calculation scenario, and the type and/or attribute of the edge.

When it is necessary to perform data processing on the graph, it is not necessary to extract all the node data, but only need to determine the graph calculation scene according to the graph calculation request, and extract the node data required for the graph calculation scene calculation to save the calculation amount. According to the description of S110, this embodiment uses a node table and a relationship table to represent all nodes and edges in the graph. In other words, the node table and the relationship table constitute an index of multiple node data in the graph. Based on this, the type and/or attribute of the node required for graph calculation is determined from the node table, and the type and/or attribute of the edge required for the graph calculation scenario is determined from the relationship table.

Optionally, from the node type set and attribute set, determine the type and/or attribute of the node required for the graph calculation scene; from the edge type set and attribute set, determine the type and type of the edge required for the graph calculation scene /Or attribute.

Step 130: According to the type and/or attribute of the node and the type and/or attribute of the edge required for the graph calculation scene, extract at least one calculation node corresponding to the graph calculation scene from the node table and the relation table.

Optionally, calculate the type and/or attribute of the node and the type and/or attribute of the edge required by the graph according to the graph, look up the node ID in the node table, and look up the start node ID and target node ID in the relationship table ; Determine at least one computing node according to the node's identity, the starting node's identity and the target node's identity.

Step 140: Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph.

In an optional implementation manner, after at least one computing node is determined, the corresponding node data is filtered from the heterogeneous distributed knowledge graph according to the identification of the at least one computing node, the starting node identification, and the target node identification.

In another alternative embodiment, the node data carries its own corresponding node type and attribute, and the attribute of the corresponding edge type, then in all the node data, the search carries the type and/or the type of the node required for calculation Attributes, and the type of edges and/or node data of attributes.

In this embodiment, a graph database is used as the storage method for graphs. The graph database generally uses attribute graphs as the basic representation. For example, in the Neo4j graph database, nodes and relationships can contain attributes, which means that it is easier to express realistic business scenarios and is more suitable Storage scenarios of heterogeneous distributed knowledge graphs. In an embodiment, the node data corresponding to at least one computing node is filtered from the graph database corresponding to the heterogeneous distributed knowledge graph.

Step 150: Perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.

After the node data is filtered out, the node data is calculated based on the current calculation scenario. The following uses an application scenario for calculating page rank (PageRank) to illustrate the calculation method provided in this embodiment. Assuming that there is a graph of the relationship between Internet users and webpage visits and the link relationship between webpages, PageRank statistics need to be performed on all webpages. Table 1 is the node table, and Table 2 is the relationship table.

Table 1 Node table

Table 2 Relationship table

In the schema of the node table and the relationship table, the website types, click types, and link types required to calculate PageRank are found. The identifiers of the nodes of the website type found in the node table include: abc.com and bcd.com; at the same time, the starting node identifier abc.com and the target node identifier bcd.com of the edge of the link relationship are found in the relationship table, And the start node identifier 001 and the target node identifier abc.com of the edge of the click relationship, the node data corresponding to 001, abc.com, and bcd.com are filtered from the graph database corresponding to the graph. Perform PageRank calculation on the node data of the website type to obtain the PageRank value of each website. For example, the PageRank value of abc.com is 1, and the PageRank value of bcd.com is 2.

Optionally, the data processing result is added to the attribute set in the node table, or the attribute set in the relation table; and/or, the data processing result is added to the attribute of the corresponding node in the node table, or the relation table In the attributes of the corresponding edge. Following the above application scenario, Table 3 shows the new node table.

Table 3 New node table

It can be seen that the PageRank attribute is added to the attributes of multiple nodes, and the PageRank attribute is added to the attribute set of the nodes.

In the embodiment of the present invention, the heterogeneous distributed knowledge graph is represented by a node table and a relationship table, and from the node table and the relationship table, the types and/or attributes of the nodes required for the graph computing scene can be determined, and The type and/or attribute of the edge, so as to filter the corresponding node data from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge; Data processing is performed on the output node data, and data processing results based on heterogeneous distributed knowledge graphs are obtained without data processing on the entire graph. It can be seen that this embodiment provides an efficient data processing method for heterogeneous distributed knowledge graphs based on the node table and the relationship table.

Since the heterogeneous distributed knowledge graph is stored in multiple devices in a distributed manner, the graph needs to be processed in a distributed manner. In an optional embodiment, filtering out the node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph includes: filtering out the node data corresponding to at least one computing node from each device; Data processing is performed on the node data of each device to obtain the data processing result based on the heterogeneous distributed knowledge graph, including: performing data processing on the node data corresponding to the at least one computing node filtered from each device; and summarizing the data of each device The data processing result is based on the heterogeneous distributed knowledge graph.

In an embodiment, the node table further includes devices stored in multiple nodes. In the node table, according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge, determine the device where the node required for the calculation is stored, and filter the node data from the corresponding device . Calculate the filtered node data in each device separately, and optionally, use a library method like NetworkX to calculate the filtered node data in each device. Summarize the calculation results of each device to obtain the final distributed calculation results.

Example two

This embodiment is described on the basis of the foregoing embodiment. Optionally, before obtaining the node table and the relationship table of the heterogeneous distributed knowledge graph, the process of constructing the heterogeneous distributed knowledge graph is also included. This embodiment is applicable to the construction of a heterogeneous distributed knowledge graph in the case of multiple data sources. Among them, multiple data sources include traditional relational databases (such as Oracle, MySQL, etc.), distributed relational databases (such as Hive, etc.), distributed non-relational databases (such as HBase, ElasticSearch, etc.), TXT files, and CSV files.

Since the data needed to construct a knowledge graph often comes from multiple different data sources, there are both structured relational data and unstructured text data. Whether it is to unify the data from different data sources into a complete data source or write its own import scripts for each data source, it takes a lot of resources and time, resulting in the entire data ETL process that takes a lot of time and manpower . In order to solve the deficiencies of the related technology, with reference to FIG. 2, the method provided by the embodiment of the present invention includes the following operations.

Step 210: Load a node data source and an edge data source used to construct a heterogeneous distributed knowledge graph.

In this embodiment, both the node data source and the edge data source can be structured relational databases, such as traditional relational databases (such as Oracle, MySQL, etc.), distributed relational databases (such as Hive, etc.), and distributed non-relational databases. Type databases (such as HBase, ElasticSearch, etc.) can also include unstructured text, such as TXT files and CSV files.

In this embodiment, an abstract data source interface is designed so that the computing device of the heterogeneous distributed knowledge graph can seamlessly connect with multiple data sources. After different data sources are connected through the data source interface, a unified operation method is performed; there is no need to import data from different data sources into a unified data source. The data source interface includes a data structure and definition interface and a data reading interface. In addition, it may also include at least one of a data status checking interface and a data writing interface. Among them, the data structure and definition interface encapsulate the data structure identification method and the graph interface definition method. The above-mentioned interfaces are all application programming interfaces (Application Programming Interface, API).

The computing device of the heterogeneous distributed knowledge graph can load the corresponding node data source and edge data source through the storage path of the node data source and the storage path of the edge data source.

Step 220: Identify the data structure of the node data source and the data structure of the edge data source.

In this embodiment, the data structure and definition interface are called to identify the data structure of the node data source and the edge data source.

In one embodiment, the data structure of the node data source includes a number column, type field, attribute field, and node data, and the data structure of the edge data source includes a number column corresponding to the start node, a number column corresponding to the target node, a type field, and Attribute field.

Step 230: Construct a node table according to the data structure of the node data source; and construct a relationship table according to the data structure of the edge data source.

Optionally, call the data structure and define the interface, construct the node table according to the data structure of the node data source; and construct the relation table according to the data structure of the edge data source.

When constructing the node table, generate the identification of multiple nodes in the node table according to the number column in the node data source; generate the types of multiple nodes in the node table according to the type field in the node data source; according to the attributes in the node data source Field, generate the attributes of multiple nodes in the node table; summarize the types and attributes of multiple nodes respectively, and generate the type set and attribute set of the node.

When constructing the relationship table, according to the number column corresponding to the starting node in the edge data source, the starting node identification of multiple edges in the relationship table is generated; according to the number column corresponding to the target node in the edge data source, the multiple The target node identifier of the edge; according to the type field in the edge data source, the types of multiple edges in the relational table are generated; according to the attribute fields in the edge data source, the attributes of multiple edges in the relational table are generated; for multiple edges The types and attributes are summarized separately to generate the edge type set and attribute set.

Step 240: According to the node table and the relation table, read the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph.

Optionally, call the data reading interface in the data source interface, and read the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.

The data reading interface encapsulates a method for reading data to the graph database. According to the description of operation S130, the node table and the relationship table include multiple nodes, types and attributes, edge types and attributes, and connection relationships between nodes and edges. Based on this, the data required by the graph database can be determined according to the node table and the relational table, and the node data can be read into the graph database.

Step 250: Obtain the node table and the relationship table of the heterogeneous distributed knowledge graph.

Step 260: According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required by the graph calculation scenario, and the type and/or attribute of the edge.

Step 270: According to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge, the corresponding node data is filtered from the heterogeneous distributed knowledge graph.

Step 280: Perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.

In the embodiment of the present invention, the data structure of each data source can be identified through the data structure and definition interface in the data source interface, and there is no need to import different data sources into a unified database, and there is no need to filter data and write import scripts. , Use the import tool; by calling the data reading interface in the data source interface, the data in the data source is read into the graph database according to the node table and the relational table, so as to automatically read in the data, no business knowledge or need Collaboration of business experts and engineering experts. This embodiment only needs to identify the data structure through the data source interface, and then automatically import data according to the node table and the relationship table, so that the data import process saves a lot of manpower and resources and reduces costs when connecting different data sources.

On the basis of the foregoing embodiment, the data source interface further includes: a data status checking interface and/or a data writing interface. Based on this, the calculation method based on the heterogeneous distributed knowledge graph further includes at least one of the following two implementation manners.

The first implementation mode: After loading the node data source and edge data source used to construct the heterogeneous distributed knowledge graph, call the data status check interface to check whether the working status of the node data source and edge data source is normal, and will work Data sources with abnormal status are reported to users.

Optionally, when identifying the data structure of the node data source and the edge data source, and when reading the data into the graph database, check whether the working status of the node data source and the edge data source is normal; you can also check the node data periodically Whether the working status of the source and edge data sources is normal. If the data source is online and the user has access rights, the data source status is normal; if the data source is offline or the user does not have access rights, the data source status is abnormal, thereby effectively ensuring the stability and security of the map construction process.

The second implementation mode: in calling the data reading interface in the data source interface, according to the node table and the relation table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph After that, the data writing interface is called to reverse the data in the graph database to the corresponding data source. In an embodiment, the data in the graph database carries the source data source and the position in the source data source, such as the row and column. Therefore, the data writing interface is called, and according to the source data source of the data in the graph database and the position in the source data source, reverse writing to the corresponding position of the corresponding data source.

This embodiment implements the function of writing the data in the knowledge graph into the data source in reverse through the data writing interface, which is beneficial to restore the knowledge graph and mutual verification between the data source and the knowledge graph.

Example three

FIG. 3 is a schematic structural diagram of a big data processing device based on a heterogeneous distributed knowledge graph provided by Embodiment 3 of the present invention. This embodiment is suitable for the case of performing big data processing on a heterogeneous distributed knowledge graph. With reference to FIG. 3, the device includes: a construction module 31, a determination module 32, a computing node acquisition module 33, a filtering module 34 and a calculation module 35.

The construction module 31 is configured to construct the node table and the relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;

The determining module 32 is configured to determine the graph calculation scenario according to the graph calculation request, and determine the type and/or attribute of the node required for the graph calculation scenario, and the type and/or attribute of the edge;

The computing node obtaining module 33 is configured to calculate the type and/or attribute of the node and the type and/or attribute of the edge required to calculate the scene according to the graph, and extract at least one calculation corresponding to the graph computing scene from the node table and the relation table node;

The filtering module 34 is configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;

The calculation module 35 is configured to perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph;

In an embodiment, the node table includes: multiple node identifiers, multiple node types, multiple node attributes, node type sets and attribute sets, and the relationship table includes: multiple edge start node identifiers, multiple The target node identifier of the edge, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of attributes.

Optionally, the heterogeneous distributed knowledge graph is stored in multiple devices in a distributed manner. When filtering out the node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph, the filtering module 34 is configured to filter out the node data corresponding to at least one computing node from each device. When the calculation module 35 calculates the filtered node data and obtains the calculation result based on the heterogeneous distributed knowledge graph, it is set to: perform data processing on the corresponding node data filtered from each device; summarize each device The result of data processing is based on the heterogeneous distributed knowledge graph.

Optionally, the device further includes an adding module configured to add the data processing result to the attribute set in the node table after performing data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph , Or in the attribute set in the relationship table; and/or, add the data processing result to the attribute of the node in the node table corresponding to the data processing result, or the edge in the relationship table corresponding to the data processing result In the properties.

Optionally, the filtering module 34 extracts at least one calculation corresponding to the data graph scene from the node table and the relationship table based on the type and/or attribute of the node and the type and/or attribute of the edge required for calculating the scene according to the graph. When the node is set, it is set to: calculate the type and/or attribute of the node and the type and/or attribute of the edge required by the graph according to the graph, look up the identification of the node in the node table, and look up the starting node identification and Target node identification; at least one computing node is determined according to the node's identification, starting node identification and target node identification.

Optionally, the construction module 31 is configured to load the node data used to construct the heterogeneous distributed knowledge graph when constructing the node table and the relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph Source and edge data sources; identify the data structure of the node data source and the data structure of the edge data source; construct the node table according to the data structure of the node data source; and construct the relationship table according to the data structure of the edge data source; follow the node table and The relational table reads the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph.

Optionally, when the building module constructs the node table according to the data structure of the node data source, it is set to: generate the identifiers of multiple nodes in the node table according to the number column in the node data source; according to the type in the node data source Field, generate the types of multiple nodes in the node table; generate the attributes of multiple nodes in the node table according to the attribute fields in the node data source; summarize the types and attributes of multiple nodes to generate the type set and attributes of the node set.

Optionally, when the building module constructs the relationship table according to the data structure of the edge data source, it is set to: generate the starting node identifiers of multiple edges in the relationship table according to the number column corresponding to the starting node in the edge data source ; According to the number column corresponding to the target node in the edge data source, generate the target node identifiers of the multiple edges in the relationship table; according to the type field in the edge data source, generate the types of multiple edges in the relationship table; according to the edge data source The attribute field of, generates the attributes of multiple edges in the relationship table; summarizes the types and attributes of multiple edges respectively, and generates the type set and attribute set of the edges.

The big data processing device based on the heterogeneous distributed knowledge graph provided by the embodiment of the present invention can execute the big data processing method based on the heterogeneous distributed knowledge graph provided by any embodiment of the present invention, and has the corresponding functional modules and execution methods. Beneficial effect.

Example four

Fig. 4 is a schematic structural diagram of a computer device provided in the fourth embodiment of the present invention. As shown in Fig. 4, the computer device includes a processor 40, a memory 41, an input device 42, and an output device 43; The number can be one or more. In Figure 4, one processor 40 is taken as an example; the processor 40, memory 41, input device 42, and output device 43 in the computer equipment can be connected by a bus or other means. Take bus connection as an example.

As a computer-readable storage medium, the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the big data processing method based on a heterogeneous distributed knowledge graph in the embodiment of the present invention ( For example, the building module 31, the determining module 32, the computing node obtaining module 33, the filtering module 34, and the computing module 35 in a big data processing device based on a heterogeneous distributed knowledge graph). The processor 40 executes multiple functional applications and data processing of the electronic device by running the software programs, instructions, and modules stored in the memory 41, that is, realizes the aforementioned data processing method based on the heterogeneous distributed knowledge graph.

The memory 41 may include a program storage area and a data storage area. The program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like. In addition, the memory 41 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 41 may include a memory remotely provided with respect to the processor 40, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 42 can be used to receive input digital or character information, and generate key signal inputs related to user settings and function control of the computer equipment, such as node tables and data tables. The output device 43 may include a display device such as a display screen, and the display device is used to display the big data processing result.

Example five

The fifth embodiment of the present invention also provides a storage medium storing instructions. When the instructions are executed by a computer processor, they are used to execute a big data processing method based on a heterogeneous distributed knowledge graph, and the method includes:

According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required for the graph calculation scenario, and the type and/or attribute of the edge;

According to the type and/or attribute of the node and the type and/or attribute of the edge required for the graph calculation scene, extract at least one calculation node corresponding to the graph calculation scene from the node table and the relation table;

Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections and attribute collections, and the relationship table includes: multiple edge start node IDs, multiple edge targets Node ID, multiple edge types, multiple edge attributes, edge type set and attribute set.

An embodiment of the present invention provides a storage medium on which instructions are stored. The stored instructions are not limited to the above method operations, and can also perform big data processing based on heterogeneous distributed knowledge graphs provided by any embodiment of the present invention Related operations in the method.

Through the above description of the embodiments, the present disclosure can be implemented by software and necessary general-purpose hardware, or can be implemented by hardware. The present disclosure can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods of the multiple embodiments of the present disclosure.

In the above embodiment of the big data processing device based on the heterogeneous distributed knowledge graph, the multiple units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized. Yes; in addition, the names of multiple functional units are only for the convenience of distinguishing each other and are not used to limit the protection scope of the present disclosure.

Claims

A big data processing method based on heterogeneous distributed knowledge graph, including:

Constructing the node table and relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;

According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;

At least one of the following nodes required for calculating the scene according to the graph: type, attribute, and at least one of the following: type, attribute, and extracting from the node table and the relationship table and the graph calculation At least one computing node corresponding to the scene;

Filtering out node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph;

Perform data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph;

Wherein, the node table includes: identifiers of multiple nodes, types of multiple nodes, attributes of multiple nodes, set of types of nodes, and set of attributes of nodes, and the relationship table includes: identifiers of starting nodes of multiple edges , The target node identification of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
The method according to claim 1, wherein the heterogeneous distributed knowledge graph is stored in a plurality of devices in a distributed manner;

The filtering out the node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph includes:

Filtering out node data corresponding to the at least one computing node from each device;

The performing data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph includes:

Performing data processing on the node data corresponding to the at least one computing node filtered from each device;

Summarize the data processing results of each device to obtain the data processing results based on the heterogeneous distributed knowledge graph.
The method according to claim 1, wherein after performing data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph, the method further comprises at least one of the following:

Adding the data processing result to the attribute set in the node table or the attribute set in the relationship table;

The data processing result is added to the attribute of the node corresponding to the data processing result in the node table, or the attribute of the edge corresponding to the data processing result in the relationship table.
The method according to claim 3, wherein at least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the following: type, attribute, from the node table And in the relationship table, extracting at least one computing node corresponding to the graph computing scene includes:

At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, look up the identifier of the node in the node table, and in the relationship table Find the starting node ID and the target node ID;

The at least one computing node is determined according to the identification of the node, the identification of the starting node, and the identification of the target node.
The method according to claim 1, wherein the constructing the node table and the relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph comprises:

Load the node data source and the edge data source used to construct the heterogeneous distributed knowledge graph;

Identifying the data structure of the node data source and the data structure of the edge data source;

Constructing the node table according to the data structure of the node data source, and constructing the relationship table according to the data structure of the edge data source;

According to the node table and the relationship table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.
The method according to claim 5, wherein the constructing the node table according to the data structure of the node data source comprises:

Generate the identities of multiple nodes in the node table according to the number column in the node data source;

Generating the types of multiple nodes in the node table according to the type field in the node data source;

Generating the attributes of multiple nodes in the node table according to the attribute fields in the node data source;

The types and attributes of multiple nodes are respectively summarized to generate the type set of the node and the attribute set of the node.
The method according to claim 5, wherein the constructing the relationship table according to the data structure of the edge data source comprises:

According to the serial number column corresponding to the start node in the edge data source, generating the start node identifiers of the multiple edges in the relationship table;

Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the edge data source;

Generate the types of multiple edges in the relational table according to the type field in the edge data source;

Generating the attributes of multiple edges in the relationship table according to the attribute fields in the edge data source;

Summarize the types and attributes of multiple edges separately to generate an edge type set and attribute set.
A computer device includes a processor and a memory, the memory is used to store instructions, and when the instructions are executed, the processor is caused to perform the following operations:

Constructing the node table and relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;

According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;

At least one of the following nodes required for calculating the scene according to the graph: type, attribute, and at least one of the following: type, attribute, and extracting from the node table and the relationship table and the graph calculation At least one computing node corresponding to the scene;

Filtering out node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph;

Perform data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph;

Wherein, the node table includes: identifiers of multiple nodes, types of multiple nodes, attributes of multiple nodes, set of types of nodes, and set of attributes of nodes, and the relationship table includes: identifiers of starting nodes of multiple edges , The target node identification of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
The computer device according to claim 8, wherein the heterogeneous distributed knowledge graph is stored in a plurality of devices in a distributed manner;

The processor is configured to filter out the node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph in the following manner:

Filtering out node data corresponding to the at least one computing node from each device;

The processor is set to obtain the data processing result based on the heterogeneous distributed knowledge graph in the following manner:

Performing data processing on the node data corresponding to the at least one computing node filtered from each device;

Summarize the data processing results of each device to obtain the data processing results based on the heterogeneous distributed knowledge graph.
The computer device according to claim 8, wherein the processor is further configured to:

After obtaining the data processing result based on the heterogeneous distributed knowledge graph, it also includes at least one of the following:

Adding the data processing result to the attribute set in the node table or the attribute set in the relationship table;

The data processing result is added to the attribute of the node corresponding to the data processing result in the node table, or the attribute of the edge corresponding to the data processing result in the relationship table.
The computer device according to claim 10, wherein the processor is configured to extract at least one computing node corresponding to the graph computing scene in the following manner:

At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, look up the identifier of the node in the node table, and in the relationship table Find the starting node ID and the target node ID;

The at least one computing node is determined according to the identification of the node, the identification of the starting node, and the identification of the target node.
8. The computer device according to claim 8, wherein the processor is configured to construct a node table and a relationship table of the heterogeneous distributed knowledge graph in the following manner:

Load the node data source and the edge data source used to construct the heterogeneous distributed knowledge graph;

Identifying the data structure of the node data source and the data structure of the edge data source;

Constructing the node table according to the data structure of the node data source, and constructing the relationship table according to the data structure of the edge data source;

According to the node table and the relationship table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.
The computer device according to claim 12, wherein the processor is configured to construct the node table in the following manner:

Generate the identities of multiple nodes in the node table according to the number column in the node data source;

Generating the types of multiple nodes in the node table according to the type field in the node data source;

Generating the attributes of multiple nodes in the node table according to the attribute fields in the node data source;

The types and attributes of multiple nodes are respectively summarized to generate a set of types and attributes of nodes.
The computer device according to claim 12, wherein the processor is configured to construct the relationship table in the following manner:

According to the serial number column corresponding to the start node in the edge data source, generating the start node identifiers of the multiple edges in the relationship table;

Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the edge data source;

Generate the types of multiple edges in the relational table according to the type field in the edge data source;

Generating the attributes of multiple edges in the relationship table according to the attribute fields in the edge data source;

Summarize the types and attributes of multiple edges separately to generate an edge type set and attribute set.
A storage medium for storing instructions, and the instructions are used for executing the method for processing big data based on a heterogeneous distributed knowledge graph according to any one of claims 1-7.