CN110472068B

CN110472068B - Big data processing method, equipment and medium based on heterogeneous distributed knowledge graph

Info

Publication number: CN110472068B
Application number: CN201910770620.6A
Authority: CN
Inventors: 宋群豪
Original assignee: Transwarp Technology Shanghai Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2020-04-24
Anticipated expiration: 2039-08-20
Also published as: CN110472068A; WO2021032002A1

Abstract

The embodiment of the invention discloses a big data processing method, equipment and a medium based on a heterogeneous distributed knowledge graph. The method comprises the following steps: constructing a node table and a relation table of the heterogeneous distributed knowledge graph according to a data structure of the heterogeneous distributed knowledge base; determining a graph calculation scene according to the graph calculation request, and determining the type and/or attribute of a node and the type and/or attribute of an edge required by the graph calculation scene; extracting at least one computing node corresponding to the map computing scene from the node table and the relation table; filtering node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph; and performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph. The embodiment provides a data processing method for a heterogeneous distributed knowledge graph efficiently based on a node table and a relation table.

Description

Big data processing method, equipment and medium based on heterogeneous distributed knowledge graph

Technical Field

The embodiment of the invention relates to a knowledge graph technology, in particular to a big data processing method, equipment and a medium based on a heterogeneous distributed knowledge graph.

Background

Knowledge Graph (Knowledge Graph), also known as scientific Knowledge Graph, is known as Knowledge domain visualization or Knowledge domain mapping map in the book intelligence community. The life cycle of the knowledge graph consists of the following parts: data ETL (Extract-Transform-Load), knowledge extraction, map definition, data import, knowledge reasoning and knowledge application.

The knowledge graph is generally divided into a heterogeneous knowledge graph and a homogeneous knowledge graph, nodes and edges in the homogeneous knowledge graph respectively have the same type, namely, the types are not distinguished, and the nodes and the edges in the heterogeneous knowledge graph can have different types and even have different attributes. At present, heterogeneous knowledge maps are generally described in the form of triplets, quintuples, heptads, and the like, for example, a large-scale directed knowledge map composed of "point-edge" is represented by "concept, relationship, and rule". The knowledge graph is described through the multi-group form, so that the relation between concepts, the relation between concepts and entities, the relation between entities and attributes, the relation between attributes and attribute values and the like can be clearly represented.

Although the multi-tuple form brings many benefits, when the heterogeneous distributed knowledge graph is calculated, the multi-tuple form is not concise and contains a large amount of redundant information, which is not beneficial to filtering interested node data, and therefore the calculation complexity is greatly increased.

Disclosure of Invention

The embodiment of the invention provides a big data processing method, a big data processing device, big data processing equipment and a big data processing medium based on a heterogeneous distributed knowledge graph, and aims to provide an effective data processing scheme aiming at the heterogeneous distributed knowledge graph.

In a first aspect, an embodiment of the present invention provides a big data processing method based on a heterogeneous distributed knowledge graph, including:

constructing a node table and a relation table of the heterogeneous distributed knowledge graph according to a data structure of the heterogeneous distributed knowledge base;

determining a graph calculation scene according to the graph calculation request, and determining the type and/or attribute of a node and the type and/or attribute of an edge required by the graph calculation scene;

extracting at least one computing node corresponding to the graph computing scene from the node table and the relation table according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph computing scene;

filtering node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;

performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph;

wherein, the node table includes: the identifier of each node, the type of each node, the attribute of each node, the type set and the attribute set of the node, and the relationship table comprises: the method comprises the steps of identifying a starting node of each edge, identifying a target node of each edge, identifying the type of each edge, identifying the attribute of each edge, and collecting the type and attribute of each edge.

In a second aspect, an embodiment of the present invention further provides a big data processing apparatus based on a heterogeneous distributed knowledge graph, including:

the building module is used for building a node table and a relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge base;

the determining module is used for determining a map computing scene according to the map computing request, and determining the type and/or attribute of a node and the type and/or attribute of an edge required by the map computing scene;

the calculation node acquisition module is used for extracting at least one calculation node corresponding to the graph calculation scene from the node table and the relation table according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene; the filtering module is used for filtering node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;

the computing module is used for carrying out data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph;

In a third aspect, an embodiment of the present invention further provides a computer device, including a processor and a memory, where the memory is used to store instructions, and when the instructions are executed, the processor is caused to perform the following operations:

In a fourth aspect, an embodiment of the present invention further provides a storage medium, where the storage medium is configured to store instructions for performing:

In the embodiment of the invention, the heterogeneous distributed knowledge graph is represented by a node table and a relation table, and the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene can be determined from the node table and the relation table, so that the corresponding node data can be filtered from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene; and performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph without performing data processing on the whole graph. Therefore, the data processing method for the heterogeneous distributed knowledge graph is provided based on the node table and the relation table.

Drawings

Fig. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph according to an embodiment of the present invention;

FIG. 2 is a flowchart of a big data processing method based on heterogeneous distributed knowledge graph according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a big data processing apparatus based on a heterogeneous distributed knowledge graph according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph according to an embodiment of the present invention, which may be applied to a case of performing data processing on the heterogeneous distributed knowledge graph. The method can be executed by a large data processing device based on heterogeneous distributed knowledge graph, the device can be composed of hardware and/or software, and is generally integrated in computer equipment, wherein the software can be written by Scala programming language and Java programming language.

The heterogeneous knowledge graph is opposite to the homogeneous knowledge graph, nodes and edges in the homogeneous knowledge graph respectively have the same type, namely the types are not distinguished, and the nodes and the edges in the heterogeneous knowledge graph have different types. For example, each node in the isomorphic knowledge graph represents a person, and the relationships between people represent cognitive relationships. Nodes in the heterogeneous knowledge graph can represent people, accounts, companies and the like, the relation between the people and the accounts is an owning relation, and the relation between the people and the companies is an empowerment relation; moreover, each type of node and edge also has different attributes. The heterogeneous distributed knowledge graph in the embodiment refers to a heterogeneous knowledge graph which is stored in a plurality of devices in a distributed manner, the data size of the knowledge graph is huge, the types and the attributes are different, and an effective data processing method aiming at the knowledge graph does not exist at present. Based on this, with reference to fig. 1, the big data processing method provided in this embodiment includes the following operations:

and 110, constructing a node table and a relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge base.

In this embodiment, the heterogeneous distributed knowledge graph (hereinafter referred to as a graph) corresponds to a node table and a relationship table. The node table includes: the type and attribute of each node, the relationship table includes: the type and attributes of each edge. Where different types of nodes or edges have different properties.

Illustratively, the node table includes: the identification of each node, the type of each node, the attribute of each node, the type set and the attribute set of the node. The identifier of the node is used as a unique identifier of the node, and may be a character string or a number. The type of the node is a necessary element in the heteromorphic graph, and the attribute of the node can be a dictionary (map) type as a character string type: map < string >, e.g., map < sender- > man, age- >20 >. By using the map to store data (including numbers and character strings) in different formats, the unification of data formats is ensured, and very strong flexibility is provided, so that the method provided by the map can be adopted to perform targeted extraction when data processing is performed on nodes with certain attributes in the subsequent process. The type set and attribute set of the node represent which types of nodes are present in the graph, and each different type of node has which attributes can be used for computation. The type set and attribute set of the node may be stored in hidden columns of the node table, unique to the graph. Based on this, this information after serialization can be recorded in meta information of Schema column. This data need not be carried repeatedly in each row of data to save space.

The relationship table includes: the method comprises the steps of identifying a starting node of each edge, identifying a target node of each edge, identifying the type of each edge, identifying the attribute of each edge, and collecting the type and attribute of each edge. Wherein, the starting node identification and the target node identification can be character strings or numbers. The type of the edge is a necessary element in the heteromorphic graph, and the attribute of the node can be a dictionary (map) type as a character string type: map < string >, such as map < sender- > man, age- >20, progress- > M, city- > P >. By using the map to store data (including numbers and character strings) in different formats, the unification of data formats is ensured, and very strong flexibility is provided, so that the method provided by the map can be adopted to perform targeted extraction when data processing is performed on edges of certain attributes in the subsequent process. The type set and attribute set of edges represent which types of edges exist in the graph, and which attributes each different type of edge possesses can be used for computation. The type set and attribute set of the edge may be stored in hidden columns of the relational table, unique to the graph. Based on this, this information after serialization can be recorded in meta information of Schema column. This data need not be carried repeatedly in each row of data to save space.

Optionally, constructing a node table and a relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge base may include: loading a node data source and an edge data source for constructing a heterogeneous distributed knowledge graph; identifying a data structure of a node data source and a data structure of an edge data source; constructing a node table according to the data structure of the node data source; constructing a relation table according to the data structure of the edge data source; and reading the data in the node data source and the edge data source into a graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.

Optionally, constructing a node table according to the data structure of the node data source may include: generating an identifier of each node in the node table according to the serial number column in the node data source; generating the type of each node in the node table according to the type field in the node data source; generating attributes of each node in a node table according to attribute fields in a node data source; and respectively summarizing the types and the attributes of the nodes to generate a type set and an attribute set of the nodes.

Optionally, constructing the relationship table according to the data structure of the edge data source may include: generating an initial node identifier of each edge in a relation table according to a number column corresponding to the initial node in the edge data source; generating a target node identifier of each edge in a relation table according to a number column corresponding to a target node in an edge data source; generating the type of each edge in the relation table according to the type field in the edge data source; generating the attribute of each edge in the relation table according to the attribute field in the edge data source; and summarizing the types and the attributes of the edges respectively to generate a type set and an attribute set of the edges.

And step 120, determining a graph computation scene according to the graph computation request, and determining the type and/or attribute of the node and the type and/or attribute of the edge required by the graph computation scene.

When the data processing is required to be carried out on the map, all node data does not need to be extracted, and only a specific map calculation scene needs to be determined according to the map calculation request, and then the node data required by the map calculation scene calculation is extracted, so that the calculation amount is saved. According to the description of S110, the present embodiment uses a node table and a relation table to represent all nodes and edges in the graph, in other words, the node table and the relation table constitute indexes of data of each node in the graph. Based on this, the types and/or attributes of the nodes required for the graph computation are determined from the node table, and the types and/or attributes of the edges required for the graph computation scenario are determined from the relationship table.

Optionally, determining the type and/or attribute of the node required by the graph computation scene from the type set and attribute set of the node; from the set of types and the set of attributes of the edges, the types and/or attributes of the edges required by the graph to compute the scene are determined.

And step 130, extracting at least one computing node corresponding to the graph computing scene from the node table and the relation table according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph computing scene.

Optionally, according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene, the identifier of the node is searched in the node table, and the identifier of the starting node and the identifier of the target node are searched in the relation table; and determining at least one computing node according to the node identifier, the starting node identifier and the target node identifier.

And step 140, filtering node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph.

In an optional embodiment, after at least one computing node is determined, corresponding node data is filtered from the heterogeneous distributed knowledge graph according to the identifier of each computing node, the identifier of the start node, and the identifier of the target node.

In another optional implementation, the node data carries the node type and attribute corresponding to itself and the attribute of the type of the corresponding edge, and then the node data carrying the type and/or attribute of the node and the type and/or attribute of the edge required for calculation is searched in all the node data.

In the embodiment, a graph database is used as a storage mode of the graph, the graph database generally takes an attribute graph as a basic representation form, for example, a Neo4j graph database, and nodes and relations can contain attributes, which means that a real service scene is more easily expressed, and the method is more suitable for a storage scene of a heterogeneous distributed knowledge graph. Specifically, node data corresponding to at least one computing node is filtered from a graph database corresponding to the heterogeneous distributed knowledge graph.

And 150, performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.

After the node data is filtered out, the node data is calculated based on the current calculation scene. The following describes the calculation method provided in this embodiment in detail with an application scenario of calculating a web page rank (PageRank).

Assuming that a map of the access relation between internet users and web pages and the link relation between web pages exists, the PageRank statistics needs to be carried out on all web pages. Table 1 is a node table, and table 2 is a relationship table.

TABLE 1 node table

TABLE 2 relationship table

First, the website type, click type, and link type required for computing PageRank are found in the Schema of the node table and relationship table. Then, the identifier of the node of the website type found in the node table includes: abc.com and bcd.com; meanwhile, the starting node identification abc.com and the target node identification bcd.com of the edge of the link relation are found in the relation table, and the starting node identification 001 and the target node identification abc.com of the edge of the click relation are found, so that the node data corresponding to 001, abc.com and bcd.com are filtered out from the graph database corresponding to the graph. Next, a PageRank calculation is performed on the node data of the website type to obtain a PageRank value of each website, for example, the PageRank value of abc.com is 1, and the PageRank value of bcd.com is 2.

Optionally, adding the calculation result to an attribute set in the node table or an attribute set in the relationship table; and/or adding the calculation result into the attribute of the corresponding node in the node table or the attribute of the corresponding edge in the relation table. Following the application scenario described above, table 3 shows a new node table.

TABLE 3 New node table

As can be seen, the PageRank attribute is added to the attributes of each node, and the PageRank attribute is added to the attribute set of the node.

The heterogeneous distributed knowledge graph storage is distributed and stored in a plurality of devices; distributed data processing of the atlas is required. In a preferred embodiment, filtering node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph comprises: filtering out node data corresponding to at least one computing node from each device; calculating the filtered node data to obtain a calculation result based on the heterogeneous distributed knowledge graph, wherein the calculation result comprises the following steps: calculating the corresponding node data filtered from each device; and summarizing the calculation result of each device to obtain the calculation result based on the heterogeneous distributed knowledge graph.

Specifically, the node table further includes devices in which the respective nodes are stored. In the node table, according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene, the device in which the node required by calculation is stored is determined, and node data is filtered from the corresponding device. The filtered node data in each device is then computed separately, optionally using a library like NetworkX. And then, summarizing the calculation result of each device to obtain a final distributed calculation result.

Example two

The embodiment is further optimized on the basis of the above embodiment, and optionally, before the node table and the relationship table of the heterogeneous distributed knowledge graph are obtained, the method further includes a heterogeneous distributed knowledge graph construction process. The method is suitable for the construction of the heterogeneous distributed knowledge graph under the condition of multiple data sources. The multiple data sources include, but are not limited to, a traditional relational database (e.g., Oracle, MySQL, etc.), a distributed relational database (e.g., Hive), a distributed non-relational database (e.g., HBase, elastic search, etc.), a TXT file, and a CSV file.

Because the data required for constructing the knowledge graph is usually from a plurality of different data sources, the data has structured relational data and unstructured text data. Whether the data of different data sources are unified into a complete data source or a respective import script is written for each data source, a lot of resources and time are needed, so that the whole data ETL process needs a lot of time and labor. To solve the defects in the prior art, with reference to fig. 2, the method provided by the embodiment of the present invention specifically includes the following operations:

and step 210, loading a node data source and an edge data source for constructing the heterogeneous distributed knowledge graph.

In this embodiment, both the node data source and the edge data source may be structured relational databases, such as a traditional relational database (e.g., Oracle, MySQL, etc.), a distributed relational database (e.g., Hive), a distributed non-relational database (e.g., HBase, elastic search, etc.), and may further include unstructured texts, such as TXT files and CSV files.

The embodiment designs an abstract data source interface, so that the computing device of the heterogeneous distributed knowledge graph can be seamlessly interfaced with various data sources. After different data sources are butted through the data source interface, a uniform operation mode is executed; data of different data sources does not need to be imported into a unified data source. The data source interface comprises a data structure and definition interface and a data reading interface, and can also comprise at least one of a data state checking interface and a data writing interface. The data structure and the definition interface are packaged with a data structure identification method and a map interface definition method. The interfaces are Application Programming Interfaces (APIs).

The computing device of the heterogeneous distributed knowledge graph can load the corresponding node data source and the edge data source through the storage path of the node data source and the storage path of the edge data source.

Step 220, identify the data structure of the node data source and the data structure of the edge data source.

In this embodiment, a data structure and a definition interface are called, and data structures of a node data source and an edge data source are identified.

The data structure of the node data source comprises a number column, a type field, an attribute field and node data, and the data structure of the edge data source comprises a number column corresponding to the starting node, a number column corresponding to the target node, a type field and an attribute field.

Step 230, constructing a node table according to the data structure of the node data source; and constructing a relation table according to the data structure of the edge data source.

Optionally, calling a data structure and a definition interface, and constructing a node table according to the data structure of the node data source; and constructing a relation table according to the data structure of the edge data source.

When a node table is constructed, the identification of each node in the node table is generated according to the serial number column in the node data source; generating the type of each node in the node table according to the type field in the node data source; generating attributes of each node in a node table according to attribute fields in a node data source; and respectively summarizing the types and the attributes of the nodes to generate a type set and an attribute set of the nodes.

When a relation table is constructed, generating an initial node identifier of each edge in the relation table according to a number column corresponding to the initial node in the edge data source; generating a target node identifier of each edge in a relation table according to a number column corresponding to a target node in an edge data source; generating the type of each edge in the relation table according to the type field in the edge data source; generating the attribute of each edge in the relation table according to the attribute field in the edge data source; and summarizing the types and the attributes of the edges respectively to generate a type set and an attribute set of the edges.

And 240, reading the data in the node data source and the edge data source into a graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.

Optionally, a data reading interface in the data source interface is called, and data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relationship table.

The data reading interface is packaged with a method for reading data into a graph database. The node table and the relationship table include respective nodes, types and attributes of edges, and connection relationships of the nodes and the edges, as described in operation S130. Based on the data, the data required by the graph database can be determined according to the node table and the relation table, and the node data is read into the graph database.

And step 250, acquiring a node table and a relation table of the heterogeneous distributed knowledge graph.

And step 260, determining a graph calculation scene according to the graph calculation request, and determining the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene.

And 270, filtering corresponding node data from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph calculation scene.

And step 280, performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.

In the embodiment of the invention, the data structure of each data source can be identified through the data structure and the definition interface in the data source interface, different data sources are not required to be imported into a unified database, data is not required to be screened, an import script is compiled, and an import tool is used; the data in the data source is read into the graph database according to the node table and the relation table by calling the data reading interface in the data source interface, so that the data is automatically read, and business knowledge and cooperative cooperation of business experts and engineering experts are not needed. According to the embodiment, the data structure is only needed to be identified through the data source interface, and then the data is automatically imported according to the node table and the relation table, so that a large amount of manpower and resources are saved and the cost is reduced when different data sources are docked in the data import process.

On the basis of the above embodiments, the data source interface further includes: a data status check interface and/or a data write interface. Based on the above, the heterogeneous distributed knowledge graph-based computing method further comprises at least one of the following two embodiments.

The first embodiment: after a node data source and an edge data source for constructing the heterogeneous distributed knowledge graph are loaded, a data state check interface is called, whether the working states of the node data source and the edge data source are normal or not is checked, and the data source with the abnormal working state is fed back to a user.

Optionally, when identifying the data structures of the node data source and the edge data source and reading data into the graph database, checking whether the working states of the node data source and the edge data source are normal; and periodically checking whether the working states of the node data source and the edge data source are normal. If the data source is online and the user has access right, the data source is in a normal state; if the data source is offline or the user does not have access right, the state of the data source is abnormal, so that the stability and the safety of the map construction process are effectively ensured.

The second embodiment: and calling a data reading interface in the data source interface, reading data in the node data source and the edge data source into a graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table, calling a data writing interface, and reversely writing the data in the graph database into the corresponding data source. In particular, data in the graph database carries a source data source and a location in the source data source, e.g., row number, column number. Therefore, the data writing interface is called, and the data is reversely written into the corresponding position of the corresponding data source according to the source data source and the position in the source data source of the data in the graph database.

The data writing interface realizes the function of reversely writing the data in the knowledge graph into the data source, and is beneficial to restoring the knowledge graph and mutual verification between the data source and the knowledge graph.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a big data processing apparatus based on a heterogeneous distributed knowledge graph according to a third embodiment of the present invention, and this embodiment is suitable for a case of processing big data on the heterogeneous distributed knowledge graph. With reference to fig. 3, the apparatus comprises: a construction module 31, a determination module 32, a calculation node acquisition module 33, a filtering module 34 and a calculation module 35.

The building module 31 is configured to build a node table and a relationship table of the heterogeneous distributed knowledge graph according to a data structure of the heterogeneous distributed knowledge base;

a determining module 32, configured to determine a graph computation scenario according to the graph computation request, determine the type and/or attribute of a node and the type and/or attribute of an edge required by the graph computation scenario;

a computation node obtaining module 33, configured to extract at least one computation node corresponding to the graph computation scenario from the node table and the relationship table according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph computation scenario;

a filtering module 34, configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;

the calculation module 35 is configured to perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph;

Optionally, the heterogeneous distributed knowledge graph store is stored in a plurality of devices in a distributed manner. When filtering out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph, the filtering module 34 is specifically configured to: node data corresponding to at least one computing node is filtered from each device. When the computing module 35 computes the filtered node data to obtain a computing result based on the heterogeneous distributed knowledge graph, it is specifically configured to: calculating the corresponding node data filtered from each device; and summarizing the calculation result of each device to obtain the calculation result based on the heterogeneous distributed knowledge graph.

Optionally, the apparatus further includes an adding module, configured to add the calculation result to an attribute set in the node table or an attribute set in the relationship table after performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph; and/or adding the calculation result into the attribute of the corresponding node in the node table or the attribute of the corresponding edge in the relation table.

Optionally, when determining at least one computing node corresponding to the data computing scenario from the node table and the relationship table according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph computing scenario, the filtering module 34 is specifically configured to: according to the type and/or attribute of the node and the type and/or attribute of the edge required by the map calculation scene, searching the identifier of the node in a node table, and searching the identifier of the starting node and the identifier of the target node in a relation table; and determining at least one computing node according to the node identifier, the starting node identifier and the target node identifier.

Optionally, the building module 31 is configured to load a node data source and an edge data source for building the heterogeneous distributed knowledge graph when constructing the node table and the relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge base; identifying a data structure of a node data source and a data structure of an edge data source; constructing a node table according to the data structure of the node data source; constructing a relation table according to the data structure of the edge data source; and reading the data in the node data source and the edge data source into a graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.

Optionally, when the building module builds the node table according to the data structure of the node data source, the building module is specifically configured to: generating an identifier of each node in the node table according to the serial number column in the node data source; generating the type of each node in the node table according to the type field in the node data source; generating attributes of each node in a node table according to attribute fields in a node data source; and respectively summarizing the types and the attributes of the nodes to generate a type set and an attribute set of the nodes.

Optionally, when the building module builds the relationship table according to the data structure of the edge data source, the building module is specifically configured to: generating an initial node identifier of each edge in a relation table according to a number column corresponding to the initial node in the edge data source; generating a target node identifier of each edge in a relation table according to a number column corresponding to a target node in an edge data source; generating the type of each edge in the relation table according to the type field in the edge data source; generating the attribute of each edge in the relation table according to the attribute field in the edge data source; and summarizing the types and the attributes of the edges respectively to generate a type set and an attribute set of the edges.

The big data processing device based on the heterogeneous distributed knowledge graph provided by the embodiment of the invention can execute the big data processing method based on the heterogeneous distributed knowledge graph provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of a computer apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the computer apparatus includes a processor 40, a memory 41, an input device 42, and an output device 43; the number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in fig. 4; the processor 40, the memory 41, the input device 42 and the output device 43 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.

The memory 41 serves as a computer-readable storage medium, and may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the heterogeneous distributed knowledge-graph-based big data processing method in the embodiment of the present invention (for example, the building module 31, the determining module 32, the computing node obtaining module 33, the filtering module 34, and the computing module 35 in the heterogeneous distributed knowledge-graph-based big data processing apparatus). The processor 40 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 41, that is, implements the data processing method based on the heterogeneous distributed knowledge graph.

The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 42 is operable to receive input numeric or character information and to generate key signal inputs, such as node tables and data tables, associated with user settings and function controls of the computer apparatus. The output device 43 may include a display device such as a display screen for displaying the result of the big data processing.

EXAMPLE five

The fifth embodiment of the present invention further provides a storage medium having instructions stored thereon. The instructions, when executed by a computer processor, perform a heterogeneous distributed knowledge-graph based big data processing method, the method comprising:

Of course, the storage medium storing the instructions provided by the embodiments of the present invention is not limited to the above method operations, and may also perform related operations in the heterogeneous distributed knowledge graph-based big data processing method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the big data processing apparatus based on the heterogeneous distributed knowledge graph, the included units and modules are only divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A big data processing method based on a heterogeneous distributed knowledge graph is characterized by comprising the following steps:

constructing a node table and a relation table of the heterogeneous distributed knowledge graph according to a data structure of a heterogeneous distributed knowledge base;

filtering out node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph;

wherein the node table includes: the identifier of each node, the type of each node, the attribute of each node, the type set and the attribute set of the node, and the relationship table comprises: the method comprises the steps of identifying a starting node of each edge, identifying a target node of each edge, identifying the type of each edge, identifying the attribute of each edge, and collecting the type and attribute of each edge.

2. The method of claim 1, wherein the heterogeneous distributed knowledge-graph store is stored in a plurality of devices in a distributed manner;

filtering node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph, including:

filtering out node data corresponding to the at least one computing node from each device;

the step of calculating the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph comprises the following steps:

calculating the corresponding node data filtered from each device;

and summarizing the calculation result of each device to obtain a data processing result based on the heterogeneous distributed knowledge graph.

3. The method according to claim 1, wherein after performing data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph, the method further comprises:

adding the data processing result to an attribute set in the node table or an attribute set in a relation table; and/or the presence of a gas in the gas,

and adding the data processing result to the attribute of the corresponding node in the node table or the attribute of the corresponding edge in the relation table.

4. The method according to claim 3, wherein the determining at least one computing node corresponding to the data computing scenario from the node table and the relationship table according to the type and/or attribute of the node and the type and/or attribute of the edge required by the graph computing scenario comprises:

according to the type and/or attribute of the node and the type and/or attribute of the edge required by the scene of the graph calculation, searching the identifier of the node in a node table, and searching the identifier of the initial node and the identifier of the target node in a relation table;

and determining the at least one computing node according to the node identifier, the starting node identifier and the target node identifier.

5. The method of claim 1, wherein constructing the node tables and the relationship tables of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge base comprises:

loading a node data source and an edge data source for constructing a heterogeneous distributed knowledge graph;

identifying a data structure of the node data source and a data structure of an edge data source;

constructing the node table according to a data structure of a node data source; constructing the relation table according to the data structure of the edge data source;

and reading the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.

6. The method of claim 5, wherein constructing the node table according to the data structure of the node data source comprises:

generating an identifier of each node in the node table according to the serial number column in the node data source;

generating the type of each node in the node table according to the type field in the node data source;

generating attributes of each node in a node table according to attribute fields in a node data source;

and respectively summarizing the types and the attributes of the nodes to generate a type set and an attribute set of the nodes.

7. The method of claim 5, wherein constructing the relational table according to the data structure of the edge data source comprises:

generating an initial node identifier of each edge in a relation table according to a number column corresponding to the initial node in the edge data source;

generating a target node identifier of each edge in a relation table according to a number column corresponding to a target node in an edge data source;

generating the type of each edge in the relation table according to the type field in the edge data source;

generating the attribute of each edge in the relation table according to the attribute field in the edge data source;

and summarizing the types and the attributes of the edges respectively to generate a type set and an attribute set of the edges.

8. A computer device comprising a processor and a memory, the memory to store instructions that, when executed, cause the processor to:

9. The computer device of claim 8, wherein the heterogeneous distributed knowledge-graph store is stored in a plurality of devices in a distributed manner;

the processor is configured to filter node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph by:

the processor is configured to obtain a data processing result based on the heterogeneous distributed knowledge graph by:

calculating the corresponding node data filtered from each device;

10. The computer device of claim 8, wherein the processor is further configured to:

after data processing results based on the heterogeneous distributed knowledge graph are obtained,

11. The computer device of claim 10, wherein the processor is configured to determine the at least one compute node corresponding to the data computation scenario by:

12. The computer apparatus of claim 8, wherein the processor is configured to construct the node tables and relationship tables of the heterogeneous distributed knowledge graph by:

13. The computer device of claim 12, wherein the processor is configured to construct the node table by:

14. The computer device of claim 12, wherein the processor is configured to construct the relationship table by:

15. A storage medium for storing instructions for performing the heterogeneous distributed knowledge graph-based big data processing method according to any one of claims 1 to 7.