WO2021032002A1 - Big data processing method based on heterogeneous distributed knowledge graph, device, and medium - Google Patents

Big data processing method based on heterogeneous distributed knowledge graph, device, and medium Download PDF

Info

Publication number
WO2021032002A1
WO2021032002A1 PCT/CN2020/109226 CN2020109226W WO2021032002A1 WO 2021032002 A1 WO2021032002 A1 WO 2021032002A1 CN 2020109226 W CN2020109226 W CN 2020109226W WO 2021032002 A1 WO2021032002 A1 WO 2021032002A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
data
attribute
graph
knowledge graph
Prior art date
Application number
PCT/CN2020/109226
Other languages
French (fr)
Chinese (zh)
Inventor
宋群豪
Original Assignee
星环信息科技(上海)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 星环信息科技(上海)股份有限公司 filed Critical 星环信息科技(上海)股份有限公司
Publication of WO2021032002A1 publication Critical patent/WO2021032002A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Definitions

  • the present disclosure relates to knowledge graph technology, for example, to a big data processing method, device and medium based on heterogeneous distributed knowledge graph.
  • Knowledge Graph also known as scientific knowledge graph, is called knowledge domain visualization or knowledge domain mapping map in the library and information industry.
  • the life cycle of the knowledge graph consists of the following parts: data extraction, transformation and loading (Extract-Transform-Load, ETL), knowledge extraction, definition graph, data import, knowledge reasoning and knowledge application.
  • Knowledge graphs are divided into heterogeneous knowledge graphs and isomorphic knowledge graphs.
  • the nodes and edges in the isomorphic knowledge graph have the same type, that is, no type distinction is made.
  • the nodes and edges in the heterogeneous knowledge graph can have different types. Even have different attributes.
  • Heterogeneous knowledge graphs are described in the form of triples, quintuples, or seven-tuples. For example, a large-scale directed knowledge graph composed of "points and edges" is represented by "concepts, relationships, and rules". Describing the knowledge graph in the form of multiple groups can clearly express the relationship between concepts and concepts, between concepts and entities, between entities and entities, between entities and attributes, and between attributes and attribute values.
  • the multi-group form brings many benefits, when calculating the heterogeneous distributed knowledge graph, the multi-group form is not concise enough and contains a lot of redundant information, which is not conducive to filtering the node data of interest, leading to a large increase in calculations Complexity.
  • the present disclosure provides a big data processing method, device, equipment and medium based on a heterogeneous distributed knowledge graph to provide an effective data processing scheme for a heterogeneous distributed knowledge graph.
  • a big data processing method based on heterogeneous distributed knowledge graph including:
  • the graph calculation request determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
  • the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections.
  • the relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
  • a big data processing device based on heterogeneous distributed knowledge graphs including:
  • the building module is set to construct the node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;
  • the determining module is configured to determine the graph calculation scenario according to the graph calculation request, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
  • the calculation node acquisition module is set to calculate at least one of the following nodes required for the scene based on the graph: type, attribute, and at least one of the following: type, attribute, and extract the graph calculation scene from the node table and the relation table At least one corresponding computing node;
  • a filtering module configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph
  • the calculation module is set to perform data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph;
  • the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections.
  • the relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
  • a computer device including a processor and a memory, where the memory is used to store instructions, and when the instructions are executed, the processor performs the following operations:
  • the graph calculation request determine the graph calculation scene, and determine at least one of the following nodes required for the graph calculation scene: type, attribute, and at least one of the following: type, attribute;
  • the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections.
  • the relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
  • a storage medium is also provided, the storage medium is used to store instructions, and the instructions are used to execute:
  • the graph calculation request determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
  • the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections.
  • the relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
  • FIG. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 1 of the present invention
  • Embodiment 2 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 2 of the present invention
  • Embodiment 3 is a schematic structural diagram of a big data processing device based on a heterogeneous distributed knowledge graph provided by Embodiment 3 of the present invention
  • Fig. 4 is a schematic structural diagram of a computer device according to the fourth embodiment of the present invention.
  • FIG. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 1 of the present invention. This embodiment may be applicable to the case of performing data processing on a heterogeneous distributed knowledge graph.
  • the method can be executed by a big data processing device based on a heterogeneous distributed knowledge graph.
  • the device can be composed of hardware and/or software and integrated in a computer device.
  • the software can be written in Scala programming language or Java programming language. .
  • heterogeneity is opposite to isomorphism.
  • the nodes and edges in the isomorphic knowledge graph have the same type, that is, no type distinction is made, while the nodes and edges in the heterogeneous knowledge graph have different types.
  • each node in the isomorphic knowledge graph represents a person, and the relationship between people represents the cognitive relationship.
  • the nodes in the heterogeneous knowledge graph can represent people, accounts, companies, etc.
  • the relationship between people and accounts is ownership, and the relationship between people and companies is job relationship.
  • each type of node and edge also has different attributes.
  • the heterogeneous distributed knowledge graph in this embodiment refers to a heterogeneous knowledge graph stored in multiple devices in a distributed manner.
  • the knowledge graph includes a huge amount of data, and has different types and attributes. There has not yet been an effective solution for such knowledge.
  • Data processing method of Atlas Based on this, in conjunction with FIG. 1, the big data processing method provided in this embodiment includes the following operations.
  • Step 110 Construct a node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph.
  • the heterogeneous distributed knowledge graph corresponds to a node table and a relationship table.
  • the node table includes: the types and attributes of multiple nodes, and the relationship table includes: the types and attributes of multiple edges. Among them, different types of nodes or edges have different attributes.
  • the node table includes: identifications of multiple nodes, types of multiple nodes, attributes of multiple nodes, type collections and attribute collections of nodes.
  • the identifier of the node serves as the unique identifier of the node, which can be a character string or a number.
  • the type of a node is an essential element in a heterogeneous graph, and is a string type.
  • the attribute of a node can be a dictionary (map) type: map ⁇ string, string>, for example, map ⁇ gender->man, age->20>.
  • the node type set and attribute set indicate which types of nodes exist in the graph, and which attributes each different type of node has can be used for calculation.
  • the type set and attribute set of the node can be stored in the hidden column of the node table, which is unique to the graph. Based on this, the serialized information can be recorded in the meta information of the Schema column. There is no need to carry this data repeatedly in each row of data to save space.
  • the relationship table includes: the start node identifiers of multiple edges, the target node identifiers of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of attributes.
  • the start node identifier and the target node identifier can be character strings or numbers.
  • the edge type is an essential element in a heterogeneous graph, and is a string type.
  • the node attribute can be a dictionary (map) type: map ⁇ string, string>, for example, map ⁇ gender->man, age->20, province- >M, city->P>.
  • the edge type set and attribute set indicate which types of edges exist in the graph, and which attributes each different type of edge has can be used for calculation.
  • the type set and attribute set of the edge can be stored in the hidden column of the relational table, which is unique to the graph. Based on this, the serialized information can be recorded in the meta information of the Schema column. There is no need to carry this data repeatedly in each row of data to save space.
  • constructing the node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph may include: loading the node data source and edge data source used to construct the heterogeneous distributed knowledge graph ; Identify the data structure of the node data source and the data structure of the edge data source; construct the node table according to the data structure of the node data source, and construct the relationship table according to the data structure of the edge data source; according to the node table and the relationship table, the node The data in the data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.
  • constructing the node table according to the data structure of the node data source may include: generating the identifiers of multiple nodes in the node table according to the number column in the node data source; generating the node table according to the type field in the node data source According to the attribute fields in the node data source, the attributes of multiple nodes in the node table are generated; the types and attributes of multiple nodes are respectively summarized to generate the type set and attribute set of the node.
  • constructing the relationship table according to the data structure of the edge data source may include: generating the starting node identifiers of multiple edges in the relationship table according to the number column corresponding to the starting node in the edge data source; according to the edge data source Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the relationship table; generate the types of multiple edges in the relationship table according to the type field in the edge data source; generate according to the attribute field in the edge data source
  • the attributes of multiple edges in the relational table; the types and attributes of multiple edges are separately summarized to generate an edge type set and attribute set.
  • Step 120 According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required by the graph calculation scenario, and the type and/or attribute of the edge.
  • this embodiment uses a node table and a relationship table to represent all nodes and edges in the graph.
  • the node table and the relationship table constitute an index of multiple node data in the graph. Based on this, the type and/or attribute of the node required for graph calculation is determined from the node table, and the type and/or attribute of the edge required for the graph calculation scenario is determined from the relationship table.
  • the node type set and attribute set determine the type and/or attribute of the node required for the graph calculation scene; from the edge type set and attribute set, determine the type and type of the edge required for the graph calculation scene /Or attribute.
  • Step 130 According to the type and/or attribute of the node and the type and/or attribute of the edge required for the graph calculation scene, extract at least one calculation node corresponding to the graph calculation scene from the node table and the relation table.
  • calculate the type and/or attribute of the node and the type and/or attribute of the edge required by the graph according to the graph look up the node ID in the node table, and look up the start node ID and target node ID in the relationship table ; Determine at least one computing node according to the node's identity, the starting node's identity and the target node's identity.
  • Step 140 Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph.
  • the corresponding node data is filtered from the heterogeneous distributed knowledge graph according to the identification of the at least one computing node, the starting node identification, and the target node identification.
  • the node data carries its own corresponding node type and attribute, and the attribute of the corresponding edge type, then in all the node data, the search carries the type and/or the type of the node required for calculation Attributes, and the type of edges and/or node data of attributes.
  • a graph database is used as the storage method for graphs.
  • the graph database generally uses attribute graphs as the basic representation.
  • nodes and relationships can contain attributes, which means that it is easier to express realistic business scenarios and is more suitable Storage scenarios of heterogeneous distributed knowledge graphs.
  • the node data corresponding to at least one computing node is filtered from the graph database corresponding to the heterogeneous distributed knowledge graph.
  • Step 150 Perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.
  • the node data is calculated based on the current calculation scenario.
  • PageRank page rank
  • the website types, click types, and link types required to calculate PageRank are found.
  • the identifiers of the nodes of the website type found in the node table include: abc.com and bcd.com; at the same time, the starting node identifier abc.com and the target node identifier bcd.com of the edge of the link relationship are found in the relationship table, And the start node identifier 001 and the target node identifier abc.com of the edge of the click relationship, the node data corresponding to 001, abc.com, and bcd.com are filtered from the graph database corresponding to the graph. Perform PageRank calculation on the node data of the website type to obtain the PageRank value of each website. For example, the PageRank value of abc.com is 1, and the PageRank value of bcd.com is 2.
  • the data processing result is added to the attribute set in the node table, or the attribute set in the relation table; and/or, the data processing result is added to the attribute of the corresponding node in the node table, or the relation table In the attributes of the corresponding edge.
  • Table 3 shows the new node table.
  • the PageRank attribute is added to the attributes of multiple nodes, and the PageRank attribute is added to the attribute set of the nodes.
  • the heterogeneous distributed knowledge graph is represented by a node table and a relationship table, and from the node table and the relationship table, the types and/or attributes of the nodes required for the graph computing scene can be determined, and The type and/or attribute of the edge, so as to filter the corresponding node data from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge; Data processing is performed on the output node data, and data processing results based on heterogeneous distributed knowledge graphs are obtained without data processing on the entire graph. It can be seen that this embodiment provides an efficient data processing method for heterogeneous distributed knowledge graphs based on the node table and the relationship table.
  • filtering out the node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph includes: filtering out the node data corresponding to at least one computing node from each device; Data processing is performed on the node data of each device to obtain the data processing result based on the heterogeneous distributed knowledge graph, including: performing data processing on the node data corresponding to the at least one computing node filtered from each device; and summarizing the data of each device The data processing result is based on the heterogeneous distributed knowledge graph.
  • the node table further includes devices stored in multiple nodes.
  • the node table according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge, determine the device where the node required for the calculation is stored, and filter the node data from the corresponding device .
  • the process of constructing the heterogeneous distributed knowledge graph is also included.
  • This embodiment is applicable to the construction of a heterogeneous distributed knowledge graph in the case of multiple data sources.
  • multiple data sources include traditional relational databases (such as Oracle, MySQL, etc.), distributed relational databases (such as Hive, etc.), distributed non-relational databases (such as HBase, ElasticSearch, etc.), TXT files, and CSV files.
  • the method provided by the embodiment of the present invention includes the following operations.
  • Step 210 Load a node data source and an edge data source used to construct a heterogeneous distributed knowledge graph.
  • both the node data source and the edge data source can be structured relational databases, such as traditional relational databases (such as Oracle, MySQL, etc.), distributed relational databases (such as Hive, etc.), and distributed non-relational databases.
  • Type databases such as HBase, ElasticSearch, etc.
  • unstructured text such as TXT files and CSV files.
  • an abstract data source interface is designed so that the computing device of the heterogeneous distributed knowledge graph can seamlessly connect with multiple data sources. After different data sources are connected through the data source interface, a unified operation method is performed; there is no need to import data from different data sources into a unified data source.
  • the data source interface includes a data structure and definition interface and a data reading interface. In addition, it may also include at least one of a data status checking interface and a data writing interface. Among them, the data structure and definition interface encapsulate the data structure identification method and the graph interface definition method.
  • the above-mentioned interfaces are all application programming interfaces (Application Programming Interface, API).
  • the computing device of the heterogeneous distributed knowledge graph can load the corresponding node data source and edge data source through the storage path of the node data source and the storage path of the edge data source.
  • Step 220 Identify the data structure of the node data source and the data structure of the edge data source.
  • the data structure and definition interface are called to identify the data structure of the node data source and the edge data source.
  • the data structure of the node data source includes a number column, type field, attribute field, and node data
  • the data structure of the edge data source includes a number column corresponding to the start node, a number column corresponding to the target node, a type field, and Attribute field.
  • Step 230 Construct a node table according to the data structure of the node data source; and construct a relationship table according to the data structure of the edge data source.
  • call the data structure and define the interface construct the node table according to the data structure of the node data source; and construct the relation table according to the data structure of the edge data source.
  • node table When constructing the node table, generate the identification of multiple nodes in the node table according to the number column in the node data source; generate the types of multiple nodes in the node table according to the type field in the node data source; according to the attributes in the node data source Field, generate the attributes of multiple nodes in the node table; summarize the types and attributes of multiple nodes respectively, and generate the type set and attribute set of the node.
  • the starting node identification of multiple edges in the relationship table is generated; according to the number column corresponding to the target node in the edge data source, the multiple The target node identifier of the edge; according to the type field in the edge data source, the types of multiple edges in the relational table are generated; according to the attribute fields in the edge data source, the attributes of multiple edges in the relational table are generated; for multiple edges
  • the types and attributes are summarized separately to generate the edge type set and attribute set.
  • Step 240 According to the node table and the relation table, read the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph.
  • call the data reading interface in the data source interface and read the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.
  • the data reading interface encapsulates a method for reading data to the graph database.
  • the node table and the relationship table include multiple nodes, types and attributes, edge types and attributes, and connection relationships between nodes and edges. Based on this, the data required by the graph database can be determined according to the node table and the relational table, and the node data can be read into the graph database.
  • Step 250 Obtain the node table and the relationship table of the heterogeneous distributed knowledge graph.
  • Step 260 According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required by the graph calculation scenario, and the type and/or attribute of the edge.
  • Step 270 According to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge, the corresponding node data is filtered from the heterogeneous distributed knowledge graph.
  • Step 280 Perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.
  • the data structure of each data source can be identified through the data structure and definition interface in the data source interface, and there is no need to import different data sources into a unified database, and there is no need to filter data and write import scripts.
  • Use the import tool by calling the data reading interface in the data source interface, the data in the data source is read into the graph database according to the node table and the relational table, so as to automatically read in the data, no business knowledge or need Collaboration of business experts and engineering experts.
  • This embodiment only needs to identify the data structure through the data source interface, and then automatically import data according to the node table and the relationship table, so that the data import process saves a lot of manpower and resources and reduces costs when connecting different data sources.
  • the data source interface further includes: a data status checking interface and/or a data writing interface. Based on this, the calculation method based on the heterogeneous distributed knowledge graph further includes at least one of the following two implementation manners.
  • the first implementation mode After loading the node data source and edge data source used to construct the heterogeneous distributed knowledge graph, call the data status check interface to check whether the working status of the node data source and edge data source is normal, and will work Data sources with abnormal status are reported to users.
  • the data source when identifying the data structure of the node data source and the edge data source, and when reading the data into the graph database, check whether the working status of the node data source and the edge data source is normal; you can also check the node data periodically Whether the working status of the source and edge data sources is normal. If the data source is online and the user has access rights, the data source status is normal; if the data source is offline or the user does not have access rights, the data source status is abnormal, thereby effectively ensuring the stability and security of the map construction process.
  • the second implementation mode in calling the data reading interface in the data source interface, according to the node table and the relation table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph After that, the data writing interface is called to reverse the data in the graph database to the corresponding data source.
  • the data in the graph database carries the source data source and the position in the source data source, such as the row and column. Therefore, the data writing interface is called, and according to the source data source of the data in the graph database and the position in the source data source, reverse writing to the corresponding position of the corresponding data source.
  • This embodiment implements the function of writing the data in the knowledge graph into the data source in reverse through the data writing interface, which is beneficial to restore the knowledge graph and mutual verification between the data source and the knowledge graph.
  • FIG. 3 is a schematic structural diagram of a big data processing device based on a heterogeneous distributed knowledge graph provided by Embodiment 3 of the present invention. This embodiment is suitable for the case of performing big data processing on a heterogeneous distributed knowledge graph.
  • the device includes: a construction module 31, a determination module 32, a computing node acquisition module 33, a filtering module 34 and a calculation module 35.
  • the construction module 31 is configured to construct the node table and the relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;
  • the determining module 32 is configured to determine the graph calculation scenario according to the graph calculation request, and determine the type and/or attribute of the node required for the graph calculation scenario, and the type and/or attribute of the edge;
  • the computing node obtaining module 33 is configured to calculate the type and/or attribute of the node and the type and/or attribute of the edge required to calculate the scene according to the graph, and extract at least one calculation corresponding to the graph computing scene from the node table and the relation table node;
  • the filtering module 34 is configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
  • the calculation module 35 is configured to perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph;
  • the node table includes: multiple node identifiers, multiple node types, multiple node attributes, node type sets and attribute sets, and the relationship table includes: multiple edge start node identifiers, multiple The target node identifier of the edge, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of attributes.
  • the heterogeneous distributed knowledge graph is represented by a node table and a relationship table, and from the node table and the relationship table, the types and/or attributes of the nodes required for the graph computing scene can be determined, and The type and/or attribute of the edge, so as to filter the corresponding node data from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge; Data processing is performed on the output node data, and data processing results based on heterogeneous distributed knowledge graphs are obtained without data processing on the entire graph. It can be seen that this embodiment provides an efficient data processing method for heterogeneous distributed knowledge graphs based on the node table and the relationship table.
  • the heterogeneous distributed knowledge graph is stored in multiple devices in a distributed manner.
  • the filtering module 34 is configured to filter out the node data corresponding to at least one computing node from each device.
  • the calculation module 35 calculates the filtered node data and obtains the calculation result based on the heterogeneous distributed knowledge graph, it is set to: perform data processing on the corresponding node data filtered from each device; summarize each device The result of data processing is based on the heterogeneous distributed knowledge graph.
  • the device further includes an adding module configured to add the data processing result to the attribute set in the node table after performing data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph , Or in the attribute set in the relationship table; and/or, add the data processing result to the attribute of the node in the node table corresponding to the data processing result, or the edge in the relationship table corresponding to the data processing result In the properties.
  • an adding module configured to add the data processing result to the attribute set in the node table after performing data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph , Or in the attribute set in the relationship table; and/or, add the data processing result to the attribute of the node in the node table corresponding to the data processing result, or the edge in the relationship table corresponding to the data processing result In the properties.
  • the filtering module 34 extracts at least one calculation corresponding to the data graph scene from the node table and the relationship table based on the type and/or attribute of the node and the type and/or attribute of the edge required for calculating the scene according to the graph.
  • the node is set, it is set to: calculate the type and/or attribute of the node and the type and/or attribute of the edge required by the graph according to the graph, look up the identification of the node in the node table, and look up the starting node identification and Target node identification; at least one computing node is determined according to the node's identification, starting node identification and target node identification.
  • the construction module 31 is configured to load the node data used to construct the heterogeneous distributed knowledge graph when constructing the node table and the relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph Source and edge data sources; identify the data structure of the node data source and the data structure of the edge data source; construct the node table according to the data structure of the node data source; and construct the relationship table according to the data structure of the edge data source; follow the node table and The relational table reads the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph.
  • the building module constructs the node table according to the data structure of the node data source, it is set to: generate the identifiers of multiple nodes in the node table according to the number column in the node data source; according to the type in the node data source Field, generate the types of multiple nodes in the node table; generate the attributes of multiple nodes in the node table according to the attribute fields in the node data source; summarize the types and attributes of multiple nodes to generate the type set and attributes of the node set.
  • the building module constructs the relationship table according to the data structure of the edge data source, it is set to: generate the starting node identifiers of multiple edges in the relationship table according to the number column corresponding to the starting node in the edge data source ; According to the number column corresponding to the target node in the edge data source, generate the target node identifiers of the multiple edges in the relationship table; according to the type field in the edge data source, generate the types of multiple edges in the relationship table; according to the edge data source
  • the attribute field of generates the attributes of multiple edges in the relationship table; summarizes the types and attributes of multiple edges respectively, and generates the type set and attribute set of the edges.
  • the big data processing device based on the heterogeneous distributed knowledge graph provided by the embodiment of the present invention can execute the big data processing method based on the heterogeneous distributed knowledge graph provided by any embodiment of the present invention, and has the corresponding functional modules and execution methods. Beneficial effect.
  • Fig. 4 is a schematic structural diagram of a computer device provided in the fourth embodiment of the present invention.
  • the computer device includes a processor 40, a memory 41, an input device 42, and an output device 43; The number can be one or more.
  • one processor 40 is taken as an example; the processor 40, memory 41, input device 42, and output device 43 in the computer equipment can be connected by a bus or other means. Take bus connection as an example.
  • the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the big data processing method based on a heterogeneous distributed knowledge graph in the embodiment of the present invention (for example, the building module 31, the determining module 32, the computing node obtaining module 33, the filtering module 34, and the computing module 35 in a big data processing device based on a heterogeneous distributed knowledge graph).
  • the processor 40 executes multiple functional applications and data processing of the electronic device by running the software programs, instructions, and modules stored in the memory 41, that is, realizes the aforementioned data processing method based on the heterogeneous distributed knowledge graph.
  • the memory 41 may include a program storage area and a data storage area.
  • the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like.
  • the memory 41 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 41 may include a memory remotely provided with respect to the processor 40, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 42 can be used to receive input digital or character information, and generate key signal inputs related to user settings and function control of the computer equipment, such as node tables and data tables.
  • the output device 43 may include a display device such as a display screen, and the display device is used to display the big data processing result.
  • the fifth embodiment of the present invention also provides a storage medium storing instructions.
  • the instructions When the instructions are executed by a computer processor, they are used to execute a big data processing method based on a heterogeneous distributed knowledge graph, and the method includes:
  • the graph calculation request determine the graph calculation scenario, determine the type and/or attribute of the node required for the graph calculation scenario, and the type and/or attribute of the edge;
  • the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections and attribute collections
  • the relationship table includes: multiple edge start node IDs, multiple edge targets Node ID, multiple edge types, multiple edge attributes, edge type set and attribute set.
  • An embodiment of the present invention provides a storage medium on which instructions are stored.
  • the stored instructions are not limited to the above method operations, and can also perform big data processing based on heterogeneous distributed knowledge graphs provided by any embodiment of the present invention Related operations in the method.
  • the present disclosure can be implemented by software and necessary general-purpose hardware, or can be implemented by hardware.
  • the present disclosure can be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods of the multiple embodiments of the present disclosure.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized. Yes; in addition, the names of multiple functional units are only for the convenience of distinguishing each other and are not used to limit the protection scope of the present disclosure.

Abstract

Disclosed are a big data processing method based on a heterogeneous distributed knowledge graph, a device, and a medium. The big data processing method based on the heterogeneous distributed knowledge graph comprises: constructing a node table and a relationship table of a heterogeneous distributed knowledge graph according to a data structure of the heterogeneous distributed knowledge graph; determining a graph computation scene according to a graph computation request, and determining the type and/or attribute of a node and the type and/or attribute of an edge required by the graph computation scene; extracting, from the node table and the relationship table, at least one computing node corresponding to the graph computation scene; filtering out, from the heterogeneous distributed knowledge graph, node data corresponding to the at least one computing node; and performing data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.

Description

基于异构分布式知识图谱的大数据处理方法、设备及介质Big data processing method, equipment and medium based on heterogeneous distributed knowledge graph
本申请要求在2019年08月20日提交中国专利局、申请号为201910770620.6的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 201910770620.6 on August 20, 2019. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本公开涉及知识图谱技术,例如涉及一种基于异构分布式知识图谱的大数据处理方法、设备及介质。The present disclosure relates to knowledge graph technology, for example, to a big data processing method, device and medium based on heterogeneous distributed knowledge graph.
背景技术Background technique
知识图谱(Knowledge Graph)又称为科学知识图谱,在图书情报界称为知识域可视化或知识领域映射地图。知识图谱的生命周期由以下几个部分组成:数据抽取转换加载(Extract-Transform-Load,ETL)、知识抽取、定义图谱、数据导入、知识推理以及知识应用。Knowledge Graph, also known as scientific knowledge graph, is called knowledge domain visualization or knowledge domain mapping map in the library and information industry. The life cycle of the knowledge graph consists of the following parts: data extraction, transformation and loading (Extract-Transform-Load, ETL), knowledge extraction, definition graph, data import, knowledge reasoning and knowledge application.
知识图谱分为异构知识图谱和同构知识图谱,同构知识图谱中的节点和边分别拥有同样的类型,即不作类型区分,而异构知识图谱中的节点和边可以拥有不同的类型,甚至拥有不同的属性。异构知识图谱采用三元组、五元组或七元组等形式来描述,例如,通过“概念,关系,规则”来表示由“点-边”组成的大规模有向知识图谱。通过多元组形式描述知识图谱能够清晰表示概念与概念之间的关系、概念与实体的关系、实体与实体之间的关系、实体与属性的关系以及属性与属性值的关系等等。Knowledge graphs are divided into heterogeneous knowledge graphs and isomorphic knowledge graphs. The nodes and edges in the isomorphic knowledge graph have the same type, that is, no type distinction is made. The nodes and edges in the heterogeneous knowledge graph can have different types. Even have different attributes. Heterogeneous knowledge graphs are described in the form of triples, quintuples, or seven-tuples. For example, a large-scale directed knowledge graph composed of "points and edges" is represented by "concepts, relationships, and rules". Describing the knowledge graph in the form of multiple groups can clearly express the relationship between concepts and concepts, between concepts and entities, between entities and entities, between entities and attributes, and between attributes and attribute values.
虽然多元组形式带来诸多好处,但在对异构分布式知识图谱进行计算时,多元组形式不够简练,且包含了大量的冗余信息,不利于过滤感兴趣的节点数据,导致大大增加计算的复杂度。Although the multi-group form brings many benefits, when calculating the heterogeneous distributed knowledge graph, the multi-group form is not concise enough and contains a lot of redundant information, which is not conducive to filtering the node data of interest, leading to a large increase in calculations Complexity.
发明内容Summary of the invention
本公开提供一种基于异构分布式知识图谱的大数据处理方法、装置、设备及介质,以提供一种有效的针对异构分布式知识图谱的数据处理方案。The present disclosure provides a big data processing method, device, equipment and medium based on a heterogeneous distributed knowledge graph to provide an effective data processing scheme for a heterogeneous distributed knowledge graph.
提供了一种基于异构分布式知识图谱的大数据处理方法,包括:Provides a big data processing method based on heterogeneous distributed knowledge graph, including:
根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表;According to the data structure of the heterogeneous distributed knowledge graph, construct the node table and relation table of the heterogeneous distributed knowledge graph;
根据图谱计算请求,确定图谱计算场景,以及确定图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性;According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
根据图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点;At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, extract at least one computing node corresponding to the graph calculation scene from the node table and the relation table;
从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据;Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;Perform data processing on the filtered node data to obtain data processing results based on heterogeneous distributed knowledge graphs;
其中,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和节点的属性集合,关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和边的属性集合。Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections. The relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
还提供了一种基于异构分布式知识图谱的大数据处理装置,包括:A big data processing device based on heterogeneous distributed knowledge graphs is also provided, including:
构建模块,设置为根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表;The building module is set to construct the node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;
确定模块,设置为根据图谱计算请求,确定图谱计算场景,以及确定图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性;The determining module is configured to determine the graph calculation scenario according to the graph calculation request, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
计算节点获取模块,设置为根据图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点;The calculation node acquisition module is set to calculate at least one of the following nodes required for the scene based on the graph: type, attribute, and at least one of the following: type, attribute, and extract the graph calculation scene from the node table and the relation table At least one corresponding computing node;
过滤模块,设置为从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据;A filtering module, configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
计算模块,设置为对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;The calculation module is set to perform data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph;
其中,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和节点的属性集合,关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和边的属性集合。Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections. The relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
还提供了一种计算机设备,包括处理器和存储器,存储器用于存储指令,当指令执行时使得处理器执行以下操作:A computer device is also provided, including a processor and a memory, where the memory is used to store instructions, and when the instructions are executed, the processor performs the following operations:
根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表;According to the data structure of the heterogeneous distributed knowledge graph, construct the node table and relation table of the heterogeneous distributed knowledge graph;
根据图谱计算请求,确定图谱计算场景,以及确定图谱计算场景所需的节 点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性;According to the graph calculation request, determine the graph calculation scene, and determine at least one of the following nodes required for the graph calculation scene: type, attribute, and at least one of the following: type, attribute;
根据图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点;At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, extract at least one computing node corresponding to the graph calculation scene from the node table and the relation table;
从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据;Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;Perform data processing on the filtered node data to obtain data processing results based on heterogeneous distributed knowledge graphs;
其中,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和节点的属性集合,关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和边的属性集合。Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections. The relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
还提供了一种存储介质,存储介质用于存储指令,指令用于执行:A storage medium is also provided, the storage medium is used to store instructions, and the instructions are used to execute:
根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表;According to the data structure of the heterogeneous distributed knowledge graph, construct the node table and relation table of the heterogeneous distributed knowledge graph;
根据图谱计算请求,确定图谱计算场景,以及确定图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性;According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
根据图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点;At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, extract at least one computing node corresponding to the graph calculation scene from the node table and the relation table;
从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据;Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;Perform data processing on the filtered node data to obtain data processing results based on heterogeneous distributed knowledge graphs;
其中,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和节点的属性集合,关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和边的属性集合。Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections, and node attribute collections. The relationship table includes: multiple edge start node IDs, multiple edges The target node identifier, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
附图说明Description of the drawings
图1是本发明实施例一提供的一种基于异构分布式知识图谱的大数据处理 方法的流程图;FIG. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 1 of the present invention;
图2是本发明实施例二提供的一种基于异构分布式知识图谱的大数据处理方法的流程图;2 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 2 of the present invention;
图3是本发明实施例三提供的一种基于异构分布式知识图谱的大数据处理装置的结构示意图;3 is a schematic structural diagram of a big data processing device based on a heterogeneous distributed knowledge graph provided by Embodiment 3 of the present invention;
图4是本发明实施例四提供的一种计算机设备的结构示意图。Fig. 4 is a schematic structural diagram of a computer device according to the fourth embodiment of the present invention.
具体实施方式detailed description
下面结合附图和实施例对本公开进行说明。为了便于描述,附图中仅示出了与本公开相关的部分而非全部结构。The present disclosure will be described below with reference to the drawings and embodiments. For ease of description, only a part of the structure related to the present disclosure is shown in the accompanying drawings instead of all of the structure.
实施例一Example one
图1是本发明实施例一提供的一种基于异构分布式知识图谱的大数据处理方法的流程图,本实施例可适用于对异构分布式知识图谱进行数据处理的情况。该方法可以由基于异构分布式知识图谱的大数据处理装置来执行,该装置可以由硬件和/或软件构成,并集成在计算机设备中,其中,软件可以采用Scala编程语言,Java编程语言编写。FIG. 1 is a flowchart of a big data processing method based on a heterogeneous distributed knowledge graph provided by Embodiment 1 of the present invention. This embodiment may be applicable to the case of performing data processing on a heterogeneous distributed knowledge graph. The method can be executed by a big data processing device based on a heterogeneous distributed knowledge graph. The device can be composed of hardware and/or software and integrated in a computer device. The software can be written in Scala programming language or Java programming language. .
在一实施例中,异构与同构相对,同构知识图谱中的节点和边分别拥有同样的类型,即不作类型区分,而异构知识图谱中的节点和边拥有不同的类型。例如,同构知识图谱中每一个节点均代表一个人,人和人之间的关系均代表认识关系。而异构知识图谱中的节点可以代表人、账户或公司等,人与账户的关系为拥有关系,人与公司的关系为任职关系;而且,每个类型的节点和边还具有不同的属性。本实施例中的异构分布式知识图谱指分布式存储在多个设备中的异构知识图谱,该知识图谱包括的数据量巨大,且类型和属性各异,尚未出现有效的针对此种知识图谱的数据处理方法。基于此,结合图1,本实施例提供的大数据处理方法包括以下操作。In one embodiment, heterogeneity is opposite to isomorphism. The nodes and edges in the isomorphic knowledge graph have the same type, that is, no type distinction is made, while the nodes and edges in the heterogeneous knowledge graph have different types. For example, each node in the isomorphic knowledge graph represents a person, and the relationship between people represents the cognitive relationship. The nodes in the heterogeneous knowledge graph can represent people, accounts, companies, etc. The relationship between people and accounts is ownership, and the relationship between people and companies is job relationship. Moreover, each type of node and edge also has different attributes. The heterogeneous distributed knowledge graph in this embodiment refers to a heterogeneous knowledge graph stored in multiple devices in a distributed manner. The knowledge graph includes a huge amount of data, and has different types and attributes. There has not yet been an effective solution for such knowledge. Data processing method of Atlas. Based on this, in conjunction with FIG. 1, the big data processing method provided in this embodiment includes the following operations.
步骤110、根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表。Step 110: Construct a node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph.
本实施例中,异构分布式知识图谱(以下简称图谱)对应一张节点表和一张关系表。节点表包括:多个节点的类型和属性,关系表包括:多条边的类型和属性。其中,不同类型的节点或者边具有不同的属性。In this embodiment, the heterogeneous distributed knowledge graph (hereinafter referred to as the graph) corresponds to a node table and a relationship table. The node table includes: the types and attributes of multiple nodes, and the relationship table includes: the types and attributes of multiple edges. Among them, different types of nodes or edges have different attributes.
示例性地,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和属性集合。其中,节点的标识作为节点的唯一标识, 可以是字符串或者数字。节点的类型是异构图中必备元素,为字符串类型,节点的属性可以是字典(map)类型:map<string,string>,例如map<gender->man,age->20>。通过使用map来存储不同格式的数据(包括数字和字符串),既保证了数据格式的统一,又提供了非常强的灵活性,在后续对一些属性的节点进行数据处理时,可以采用map提供的方法进行针对性提取。节点的类型集合和属性集合表示图谱中存在哪几种类型的节点,每个不同类型的节点拥有哪些属性可被用来计算。节点的类型集合和属性集合可存储在节点表的隐藏列中,对于图谱是唯一的。基于此,可以将这个序列化之后的信息记录在Schema列的meta信息里面。在每一行数据里不需要重复携带这个数据,以节省空间。Exemplarily, the node table includes: identifications of multiple nodes, types of multiple nodes, attributes of multiple nodes, type collections and attribute collections of nodes. Among them, the identifier of the node serves as the unique identifier of the node, which can be a character string or a number. The type of a node is an essential element in a heterogeneous graph, and is a string type. The attribute of a node can be a dictionary (map) type: map<string, string>, for example, map<gender->man, age->20>. By using map to store data in different formats (including numbers and strings), it not only ensures the unity of the data format, but also provides a very strong flexibility. In the subsequent data processing of some attribute nodes, you can use map to provide Method for targeted extraction. The node type set and attribute set indicate which types of nodes exist in the graph, and which attributes each different type of node has can be used for calculation. The type set and attribute set of the node can be stored in the hidden column of the node table, which is unique to the graph. Based on this, the serialized information can be recorded in the meta information of the Schema column. There is no need to carry this data repeatedly in each row of data to save space.
关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和属性集合。其中,起始节点标识和目标节点标识可以是字符串或者数字。边的类型是异构图中必备元素,为字符串类型,节点的属性可以是字典(map)类型:map<string,string>,例如map<gender->man,age->20,province->M,city->P>。通过使用map来存储不同格式的数据(包括数字和字符串),即保证了数据格式的统一,又提供了非常强的灵活性,在后续对一些属性的边进行数据处理时,可以采用map提供的方法进行针对性提取。边的类型集合和属性集合表示图谱中存在哪几种类型的边,每个不同类型的边拥有哪些属性可被用来计算。边的类型集合和属性集合可存储在关系表的隐藏列中,对于图谱是唯一的。基于此,可以将这个序列化之后的信息记录在Schema列的meta信息里面。在每一行数据里不需要重复携带这个数据,以节省空间。The relationship table includes: the start node identifiers of multiple edges, the target node identifiers of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of attributes. Wherein, the start node identifier and the target node identifier can be character strings or numbers. The edge type is an essential element in a heterogeneous graph, and is a string type. The node attribute can be a dictionary (map) type: map<string, string>, for example, map<gender->man, age->20, province- >M, city->P>. By using map to store data in different formats (including numbers and strings), it not only ensures the unity of the data format, but also provides very strong flexibility. When processing some attribute edges later, you can use map to provide Method for targeted extraction. The edge type set and attribute set indicate which types of edges exist in the graph, and which attributes each different type of edge has can be used for calculation. The type set and attribute set of the edge can be stored in the hidden column of the relational table, which is unique to the graph. Based on this, the serialized information can be recorded in the meta information of the Schema column. There is no need to carry this data repeatedly in each row of data to save space.
可选的,根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表,可以包括:加载用于构建异构分布式知识图谱的节点数据源和边数据源;识别节点数据源的数据结构以及边数据源的数据结构;根据节点数据源的数据结构,构建节点表,以及根据边数据源的数据结构,构建关系表;按照节点表和关系表,将节点数据源和边数据源中的数据读入到异构分布式知识图谱对应的图数据库中。Optionally, constructing the node table and relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph may include: loading the node data source and edge data source used to construct the heterogeneous distributed knowledge graph ; Identify the data structure of the node data source and the data structure of the edge data source; construct the node table according to the data structure of the node data source, and construct the relationship table according to the data structure of the edge data source; according to the node table and the relationship table, the node The data in the data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.
可选的,根据节点数据源的数据结构,构建节点表,可以包括:根据节点数据源中的编号列,生成节点表中多个节点的标识;根据节点数据源中的类型字段,生成节点表中多个节点的类型;根据节点数据源中的属性字段,生成节点表中多个节点的属性;对多个节点的类型和属性分别进行汇总,生成节点的类型集合和属性集合。Optionally, constructing the node table according to the data structure of the node data source may include: generating the identifiers of multiple nodes in the node table according to the number column in the node data source; generating the node table according to the type field in the node data source According to the attribute fields in the node data source, the attributes of multiple nodes in the node table are generated; the types and attributes of multiple nodes are respectively summarized to generate the type set and attribute set of the node.
可选的,根据边数据源的数据结构,构建关系表,可以包括:根据边数据源中的起始节点对应的编号列,生成关系表中多条边的起始节点标识;根据边 数据源中的目标节点对应的编号列,生成关系表中多条边的目标节点标识;根据边数据源中的类型字段,生成关系表中多条边的类型;根据边数据源中的属性字段,生成关系表中多条边的属性;对多条边的类型和属性分别进行汇总,生成边的类型集合和属性集合。Optionally, constructing the relationship table according to the data structure of the edge data source may include: generating the starting node identifiers of multiple edges in the relationship table according to the number column corresponding to the starting node in the edge data source; according to the edge data source Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the relationship table; generate the types of multiple edges in the relationship table according to the type field in the edge data source; generate according to the attribute field in the edge data source The attributes of multiple edges in the relational table; the types and attributes of multiple edges are separately summarized to generate an edge type set and attribute set.
步骤120、根据图谱计算请求,确定图谱计算场景,确定图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性。Step 120: According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required by the graph calculation scenario, and the type and/or attribute of the edge.
当需要对图谱进行数据处理时,不需要提取所有的节点数据,而仅需要根据图谱计算请求,确定图谱计算场景,提取该图谱计算场景计算所需的节点数据即可,以节省计算量。根据对S110的描述,本实施例采用一张节点表和一张关系表来表示图谱中的所有节点和边,换言之,节点表和关系表构成了图谱中多个节点数据的索引。基于此,从节点表中确定图谱计算所需的节点的类型和/或属性,并从关系表中确定图谱计算场景所需的边的类型和/或属性。When it is necessary to perform data processing on the graph, it is not necessary to extract all the node data, but only need to determine the graph calculation scene according to the graph calculation request, and extract the node data required for the graph calculation scene calculation to save the calculation amount. According to the description of S110, this embodiment uses a node table and a relationship table to represent all nodes and edges in the graph. In other words, the node table and the relationship table constitute an index of multiple node data in the graph. Based on this, the type and/or attribute of the node required for graph calculation is determined from the node table, and the type and/or attribute of the edge required for the graph calculation scenario is determined from the relationship table.
可选地,从节点的类型集合和属性集合中,确定图谱计算场景所需的节点的类型和/或属性;从边的类型集合和属性集合中,确定图谱计算场景所需的边的类型和/或属性。Optionally, from the node type set and attribute set, determine the type and/or attribute of the node required for the graph calculation scene; from the edge type set and attribute set, determine the type and type of the edge required for the graph calculation scene /Or attribute.
步骤130、根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点。Step 130: According to the type and/or attribute of the node and the type and/or attribute of the edge required for the graph calculation scene, extract at least one calculation node corresponding to the graph calculation scene from the node table and the relation table.
可选的,根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,在节点表中查找节点的标识,在关系表中查找起始节点标识和目标节点标识;根据节点的标识、起始节点标识和目标节点标识,确定至少一个计算节点。Optionally, calculate the type and/or attribute of the node and the type and/or attribute of the edge required by the graph according to the graph, look up the node ID in the node table, and look up the start node ID and target node ID in the relationship table ; Determine at least one computing node according to the node's identity, the starting node's identity and the target node's identity.
步骤140、从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据。Step 140: Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph.
在一可选实施方式中,在确定至少一个计算节点后,根据至少一个计算节点的标识、起始节点标识和目标节点标识,从异构分布式知识图谱中过滤出对应的节点数据。In an optional implementation manner, after at least one computing node is determined, the corresponding node data is filtered from the heterogeneous distributed knowledge graph according to the identification of the at least one computing node, the starting node identification, and the target node identification.
在另一可选实施方式中,节点数据携带有自身对应的节点类型和属性,以及对应的边的类型的属性,则在全部节点数据中,查找携带有计算所需的节点的类型和/或属性,以及边的类型和/或属性的节点数据。In another alternative embodiment, the node data carries its own corresponding node type and attribute, and the attribute of the corresponding edge type, then in all the node data, the search carries the type and/or the type of the node required for calculation Attributes, and the type of edges and/or node data of attributes.
本实施例采用图数据库作为图谱的存储方式,图数据库一般以属性图为基本的表示形式,例如Neo4j图数据库,节点和关系可以包含属性,这就意味着更容易表达现实的业务场景,更适合异构分布式知识图谱的存储场景。在一实 施例中,从异构分布式知识图谱对应的图数据库中过滤出与至少一个计算节点对应的节点数据。In this embodiment, a graph database is used as the storage method for graphs. The graph database generally uses attribute graphs as the basic representation. For example, in the Neo4j graph database, nodes and relationships can contain attributes, which means that it is easier to express realistic business scenarios and is more suitable Storage scenarios of heterogeneous distributed knowledge graphs. In an embodiment, the node data corresponding to at least one computing node is filtered from the graph database corresponding to the heterogeneous distributed knowledge graph.
步骤150、对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果。Step 150: Perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.
过滤出节点数据之后,基于当前计算场景对节点数据进行计算。下面以一计算网页排名(PageRank)的应用场景,说明本实施例提供的计算方法。假设存在一张互联网用户和网页访问关系以及网页之间链接关系的图谱,需要对所有网页做PageRank统计。表1为节点表,表2为关系表。After the node data is filtered out, the node data is calculated based on the current calculation scenario. The following uses an application scenario for calculating page rank (PageRank) to illustrate the calculation method provided in this embodiment. Assuming that there is a graph of the relationship between Internet users and webpage visits and the link relationship between webpages, PageRank statistics need to be performed on all webpages. Table 1 is the node table, and Table 2 is the relationship table.
表1节点表Table 1 Node table
Figure PCTCN2020109226-appb-000001
Figure PCTCN2020109226-appb-000001
表2关系表Table 2 Relationship table
Figure PCTCN2020109226-appb-000002
Figure PCTCN2020109226-appb-000002
在节点表和关系表的Schema中,发现计算PageRank所需的网站类型、点击类型和链接类型。在节点表中查找到网站类型的节点的标识包括:abc.com和bcd.com;同时,在关系表中查找到链接关系的边的起始节点标识abc.com和目 标节点标识bcd.com,以及点击关系的边的起始节点标识001和目标节点标识abc.com,则从图谱对应的图数据库中过滤出001、abc.com和bcd.com对应的节点数据。对网站类型的节点数据进行PageRank计算,得到每个网站的PageRank值,例如abc.com的PageRank值是1,bcd.com的PageRank值是2。In the schema of the node table and the relationship table, the website types, click types, and link types required to calculate PageRank are found. The identifiers of the nodes of the website type found in the node table include: abc.com and bcd.com; at the same time, the starting node identifier abc.com and the target node identifier bcd.com of the edge of the link relationship are found in the relationship table, And the start node identifier 001 and the target node identifier abc.com of the edge of the click relationship, the node data corresponding to 001, abc.com, and bcd.com are filtered from the graph database corresponding to the graph. Perform PageRank calculation on the node data of the website type to obtain the PageRank value of each website. For example, the PageRank value of abc.com is 1, and the PageRank value of bcd.com is 2.
可选地,将数据处理结果添加到节点表中的属性集合中,或者关系表中的属性集合中;和/或,将数据处理结果添加到节点表中对应节点的属性中,或者关系表中对应边的属性中。接着上述应用场景,表3示出了新的节点表。Optionally, the data processing result is added to the attribute set in the node table, or the attribute set in the relation table; and/or, the data processing result is added to the attribute of the corresponding node in the node table, or the relation table In the attributes of the corresponding edge. Following the above application scenario, Table 3 shows the new node table.
表3新的节点表Table 3 New node table
Figure PCTCN2020109226-appb-000003
Figure PCTCN2020109226-appb-000003
可见,在多个节点的属性中添加了PageRank属性,并且在节点的属性集合中添加了PageRank属性。It can be seen that the PageRank attribute is added to the attributes of multiple nodes, and the PageRank attribute is added to the attribute set of the nodes.
本发明实施例中,异构分布式知识图谱采用一张节点表和一张关系表表示,并能够从节点表和关系表中,确定图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,以便根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,从异构分布式知识图谱中过滤出对应的节点数据;对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果,而不需要对整张图进行数据处理。可见,本实施例基于节点表和关系表提供了一种高效地针对异构分布式知识图谱的数据处理方法。In the embodiment of the present invention, the heterogeneous distributed knowledge graph is represented by a node table and a relationship table, and from the node table and the relationship table, the types and/or attributes of the nodes required for the graph computing scene can be determined, and The type and/or attribute of the edge, so as to filter the corresponding node data from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge; Data processing is performed on the output node data, and data processing results based on heterogeneous distributed knowledge graphs are obtained without data processing on the entire graph. It can be seen that this embodiment provides an efficient data processing method for heterogeneous distributed knowledge graphs based on the node table and the relationship table.
由于异构分布式知识图谱分布式存储在多个设备中;则需要分布式地对图谱进行数据处理。在一可选实施方式中,从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据,包括:从每个设备中过滤出与至少一个计算节点对应的节点数据;对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果,包括:对从每个设备中过滤出的所述至少一个计算节点对应的节点数据进行数据处理;汇总每个设备的数据处理结果,得到基 于异构分布式知识图谱的数据处理结果。Since the heterogeneous distributed knowledge graph is stored in multiple devices in a distributed manner, the graph needs to be processed in a distributed manner. In an optional embodiment, filtering out the node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph includes: filtering out the node data corresponding to at least one computing node from each device; Data processing is performed on the node data of each device to obtain the data processing result based on the heterogeneous distributed knowledge graph, including: performing data processing on the node data corresponding to the at least one computing node filtered from each device; and summarizing the data of each device The data processing result is based on the heterogeneous distributed knowledge graph.
在一实施例中,节点表还包括多个节点所存储在的设备。在节点表中,根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,确定计算所需的节点所存储在的设备,并从对应设备中过滤出节点数据。分别对每个设备中过滤出的节点数据进行计算,可选地,使用类似NetworkX之类的库的方法对每个设备中过滤出的节点数据进行计算。汇总每个设备的计算结果,得到最终的分布式计算结果。In an embodiment, the node table further includes devices stored in multiple nodes. In the node table, according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge, determine the device where the node required for the calculation is stored, and filter the node data from the corresponding device . Calculate the filtered node data in each device separately, and optionally, use a library method like NetworkX to calculate the filtered node data in each device. Summarize the calculation results of each device to obtain the final distributed calculation results.
实施例二Example two
本实施例在上述实施例的基础上进行说明,可选地,在获取异构分布式知识图谱的节点表和关系表之前,还包括异构分布式知识图谱的构建过程。本实施例适用于多数据源情况下,异构分布式知识图谱的构建情况。其中,多数据源包括传统关系型数据库(如Oracle、MySQL等)、分布式关系型数据库(如Hive等)、分布式非关系型数据库(如HBase、ElasticSearch等)、TXT文件和CSV文件。This embodiment is described on the basis of the foregoing embodiment. Optionally, before obtaining the node table and the relationship table of the heterogeneous distributed knowledge graph, the process of constructing the heterogeneous distributed knowledge graph is also included. This embodiment is applicable to the construction of a heterogeneous distributed knowledge graph in the case of multiple data sources. Among them, multiple data sources include traditional relational databases (such as Oracle, MySQL, etc.), distributed relational databases (such as Hive, etc.), distributed non-relational databases (such as HBase, ElasticSearch, etc.), TXT files, and CSV files.
由于构建知识图谱需要的数据往往来源于多个不同的数据源,既有结构化的关系型数据,也有非结构化的文本数据。不管是把不同数据源的数据统一到一个完整的数据源或是针对每个数据源编写各自的导入脚本,都需要花费大量的资源和时间,导致整个数据的ETL过程需要耗费大量的时间和人力。为解决相关技术的缺陷,结合图2,本发明实施例提供的方法包括如下操作。Since the data needed to construct a knowledge graph often comes from multiple different data sources, there are both structured relational data and unstructured text data. Whether it is to unify the data from different data sources into a complete data source or write its own import scripts for each data source, it takes a lot of resources and time, resulting in the entire data ETL process that takes a lot of time and manpower . In order to solve the deficiencies of the related technology, with reference to FIG. 2, the method provided by the embodiment of the present invention includes the following operations.
步骤210、加载用于构建异构分布式知识图谱的节点数据源和边数据源。Step 210: Load a node data source and an edge data source used to construct a heterogeneous distributed knowledge graph.
本实施例中,节点数据源和边数据源均可以为结构化的关系型数据库,例如传统关系型数据库(如Oracle、MySQL等)、分布式关系型数据库(如Hive等)、分布式非关系型数据库(如HBase、ElasticSearch等),还可以包括非结构化的文本,例如TXT文件和CSV文件。In this embodiment, both the node data source and the edge data source can be structured relational databases, such as traditional relational databases (such as Oracle, MySQL, etc.), distributed relational databases (such as Hive, etc.), and distributed non-relational databases. Type databases (such as HBase, ElasticSearch, etc.) can also include unstructured text, such as TXT files and CSV files.
本实施例设计了一个抽象的数据源接口,使得异构分布式知识图谱的计算装置可以与多种数据源无缝对接。不同的数据源通过该数据源接口进行对接之后,执行统一的操作方式;不需要将不同数据源的数据导入到统一的数据源中。数据源接口包括数据结构和定义接口和数据读取接口,此外,还可以包括数据状态检查接口和数据写入接口中的至少一种。其中,数据结构和定义接口中封装有数据结构的识别方法和图谱接口的定义方法。上述接口均为应用程序编程接口(Application Programming Interface,API)。In this embodiment, an abstract data source interface is designed so that the computing device of the heterogeneous distributed knowledge graph can seamlessly connect with multiple data sources. After different data sources are connected through the data source interface, a unified operation method is performed; there is no need to import data from different data sources into a unified data source. The data source interface includes a data structure and definition interface and a data reading interface. In addition, it may also include at least one of a data status checking interface and a data writing interface. Among them, the data structure and definition interface encapsulate the data structure identification method and the graph interface definition method. The above-mentioned interfaces are all application programming interfaces (Application Programming Interface, API).
异构分布式知识图谱的计算装置可以通过节点数据源的存储路径和边数据 源的存储路径,将对应的节点数据源和边数据源加载进来。The computing device of the heterogeneous distributed knowledge graph can load the corresponding node data source and edge data source through the storage path of the node data source and the storage path of the edge data source.
步骤220、识别节点数据源的数据结构以及边数据源的数据结构。Step 220: Identify the data structure of the node data source and the data structure of the edge data source.
本实施例中,调用数据结构和定义接口,识别节点数据源和边数据源的数据结构。In this embodiment, the data structure and definition interface are called to identify the data structure of the node data source and the edge data source.
在一实施例中,节点数据源的数据结构包括编号列、类型字段、属性字段和节点数据,边数据源的数据结构包括起始节点对应的编号列、目标节点对应的编号列、类型字段和属性字段。In one embodiment, the data structure of the node data source includes a number column, type field, attribute field, and node data, and the data structure of the edge data source includes a number column corresponding to the start node, a number column corresponding to the target node, a type field, and Attribute field.
步骤230、根据节点数据源的数据结构,构建节点表;以及根据边数据源的数据结构,构建关系表。Step 230: Construct a node table according to the data structure of the node data source; and construct a relationship table according to the data structure of the edge data source.
可选地,调用数据结构和定义接口,根据节点数据源的数据结构,构建节点表;以及根据边数据源的数据结构,构建关系表。Optionally, call the data structure and define the interface, construct the node table according to the data structure of the node data source; and construct the relation table according to the data structure of the edge data source.
构建节点表时,根据节点数据源中的编号列,生成节点表中多个节点的标识;根据节点数据源中的类型字段,生成节点表中多个节点的类型;根据节点数据源中的属性字段,生成节点表中多个节点的属性;对多个节点的类型和属性分别进行汇总,生成节点的类型集合和属性集合。When constructing the node table, generate the identification of multiple nodes in the node table according to the number column in the node data source; generate the types of multiple nodes in the node table according to the type field in the node data source; according to the attributes in the node data source Field, generate the attributes of multiple nodes in the node table; summarize the types and attributes of multiple nodes respectively, and generate the type set and attribute set of the node.
构建关系表时,根据边数据源中的起始节点对应的编号列,生成关系表中多条边的起始节点标识;根据边数据源中的目标节点对应的编号列,生成关系表中多条边的目标节点标识;根据边数据源中的类型字段,生成关系表中多条边的类型;根据边数据源中的属性字段,生成关系表中多条边的属性;对多条边的类型和属性分别进行汇总,生成边的类型集合和属性集合。When constructing the relationship table, according to the number column corresponding to the starting node in the edge data source, the starting node identification of multiple edges in the relationship table is generated; according to the number column corresponding to the target node in the edge data source, the multiple The target node identifier of the edge; according to the type field in the edge data source, the types of multiple edges in the relational table are generated; according to the attribute fields in the edge data source, the attributes of multiple edges in the relational table are generated; for multiple edges The types and attributes are summarized separately to generate the edge type set and attribute set.
步骤240、按照节点表和关系表,将节点数据源和边数据源中的数据读入到异构分布式知识图谱对应的图数据库中。Step 240: According to the node table and the relation table, read the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph.
可选地,调用数据源接口中的数据读取接口,按照节点表和关系表,将节点数据源和边数据源中的数据读入到异构分布式知识图谱对应的图数据库中。Optionally, call the data reading interface in the data source interface, and read the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph according to the node table and the relation table.
数据读取接口中封装有读取数据到图数据库中的方法。根据操作S130的描述,节点表和关系表包括多个节点、类型和属性,边的类型和属性,以及节点与边的连接关系。基于此,根据节点表和关系表能够确定图数据库需要的数据,并将节点数据读入至图数据库中。The data reading interface encapsulates a method for reading data to the graph database. According to the description of operation S130, the node table and the relationship table include multiple nodes, types and attributes, edge types and attributes, and connection relationships between nodes and edges. Based on this, the data required by the graph database can be determined according to the node table and the relational table, and the node data can be read into the graph database.
步骤250、获取异构分布式知识图谱的节点表和关系表。Step 250: Obtain the node table and the relationship table of the heterogeneous distributed knowledge graph.
步骤260、根据图谱计算请求,确定图谱计算场景,确定图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性。Step 260: According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required by the graph calculation scenario, and the type and/or attribute of the edge.
步骤270、根据图谱计算场景所需的节点的类型和/或属性,以及边的类型 和/或属性,从异构分布式知识图谱中过滤出对应的节点数据。Step 270: According to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge, the corresponding node data is filtered from the heterogeneous distributed knowledge graph.
步骤280、对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果。Step 280: Perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph.
本发明实施例中,通过数据源接口中的数据结构和定义接口,能够识别每个数据源的数据结构,无需将不同的数据源导入到统一的数据库中,也不需要筛选数据,编写导入脚本,使用导入工具;通过调用数据源接口中的数据读取接口,按照节点表和关系表将数据源中的数据读入到图数据库中,从而自动读入数据,不需要业务知识,也不需要业务专家和工程专家的协同配合。本实施例仅需要通过数据源接口识别数据结构,进而根据节点表和关系表自动导入数据,使得数据导入过程在对接不同数据源时,节省大量的人力和资源,降低成本。In the embodiment of the present invention, the data structure of each data source can be identified through the data structure and definition interface in the data source interface, and there is no need to import different data sources into a unified database, and there is no need to filter data and write import scripts. , Use the import tool; by calling the data reading interface in the data source interface, the data in the data source is read into the graph database according to the node table and the relational table, so as to automatically read in the data, no business knowledge or need Collaboration of business experts and engineering experts. This embodiment only needs to identify the data structure through the data source interface, and then automatically import data according to the node table and the relationship table, so that the data import process saves a lot of manpower and resources and reduces costs when connecting different data sources.
在上述实施例的基础上,数据源接口还包括:数据状态检查接口和/或数据写入接口。基于此,基于异构分布式知识图谱的计算方法还包括以下两种实施方式中的至少一种。On the basis of the foregoing embodiment, the data source interface further includes: a data status checking interface and/or a data writing interface. Based on this, the calculation method based on the heterogeneous distributed knowledge graph further includes at least one of the following two implementation manners.
第一种实施方式:在加载用于构建异构分布式知识图谱的节点数据源和边数据源之后,调用数据状态检查接口,检查节点数据源和边数据源的工作状态是否正常,并将工作状态异常的数据源反馈给用户。The first implementation mode: After loading the node data source and edge data source used to construct the heterogeneous distributed knowledge graph, call the data status check interface to check whether the working status of the node data source and edge data source is normal, and will work Data sources with abnormal status are reported to users.
可选地,在识别节点数据源和边数据源的数据结构时,以及将数据读入到图数据库中时,检查节点数据源和边数据源的工作状态是否正常;也可以周期性检查节点数据源和边数据源的工作状态是否正常。如果数据源在线且用户具有访问权限,则数据源状态正常;如果数据源离线或者用户不具有访问权限,则数据源状态异常,从而有效保证图谱构建过程的稳定性和安全性。Optionally, when identifying the data structure of the node data source and the edge data source, and when reading the data into the graph database, check whether the working status of the node data source and the edge data source is normal; you can also check the node data periodically Whether the working status of the source and edge data sources is normal. If the data source is online and the user has access rights, the data source status is normal; if the data source is offline or the user does not have access rights, the data source status is abnormal, thereby effectively ensuring the stability and security of the map construction process.
第二种实施方式:在调用数据源接口中的数据读取接口,按照节点表和关系表,将节点数据源和边数据源中的数据读入到异构分布式知识图谱对应的图数据库中之后,调用数据写入接口,将图数据库中的数据反向写入至对应的数据源中。在一实施例中,图数据库中的数据携带有来源数据源以及在来源数据源中的位置,例如第几行、第几列。因此,调用数据写入接口,根据图数据库中数据的来源数据源以及在来源数据源中的位置,反向写入至对应数据源的对应位置中。The second implementation mode: in calling the data reading interface in the data source interface, according to the node table and the relation table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph After that, the data writing interface is called to reverse the data in the graph database to the corresponding data source. In an embodiment, the data in the graph database carries the source data source and the position in the source data source, such as the row and column. Therefore, the data writing interface is called, and according to the source data source of the data in the graph database and the position in the source data source, reverse writing to the corresponding position of the corresponding data source.
本实施方式通过数据写入接口实现了将知识图谱中的数据反向写入至数据源中的功能,有利于还原知识图谱,以及数据源与知识图谱之间的互相验证。This embodiment implements the function of writing the data in the knowledge graph into the data source in reverse through the data writing interface, which is beneficial to restore the knowledge graph and mutual verification between the data source and the knowledge graph.
实施例三Example three
图3是本发明实施例三提供的一种基于异构分布式知识图谱的大数据处理装置的结构示意图,本实施例适用于对异构分布式知识图谱进行大数据处理的情况。结合图3,该装置包括:构建模块31、确定模块32、计算节点获取模块33、过滤模块34和计算模块35。FIG. 3 is a schematic structural diagram of a big data processing device based on a heterogeneous distributed knowledge graph provided by Embodiment 3 of the present invention. This embodiment is suitable for the case of performing big data processing on a heterogeneous distributed knowledge graph. With reference to FIG. 3, the device includes: a construction module 31, a determination module 32, a computing node acquisition module 33, a filtering module 34 and a calculation module 35.
构建模块31,设置为根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表;The construction module 31 is configured to construct the node table and the relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;
确定模块32,设置为根据图谱计算请求,确定图谱计算场景,以及确定图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性;The determining module 32 is configured to determine the graph calculation scenario according to the graph calculation request, and determine the type and/or attribute of the node required for the graph calculation scenario, and the type and/or attribute of the edge;
计算节点获取模块33,设置为根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点;The computing node obtaining module 33 is configured to calculate the type and/or attribute of the node and the type and/or attribute of the edge required to calculate the scene according to the graph, and extract at least one calculation corresponding to the graph computing scene from the node table and the relation table node;
过滤模块34,设置为从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据;The filtering module 34 is configured to filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
计算模块35,设置为对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;The calculation module 35 is configured to perform data processing on the filtered node data to obtain a data processing result based on the heterogeneous distributed knowledge graph;
在一实施例中,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和属性集合,关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和属性集合。In an embodiment, the node table includes: multiple node identifiers, multiple node types, multiple node attributes, node type sets and attribute sets, and the relationship table includes: multiple edge start node identifiers, multiple The target node identifier of the edge, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of attributes.
本发明实施例中,异构分布式知识图谱采用一张节点表和一张关系表表示,并能够从节点表和关系表中,确定图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,以便根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,从异构分布式知识图谱中过滤出对应的节点数据;对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果,而不需要对整张图进行数据处理。可见,本实施例基于节点表和关系表提供了一种高效地针对异构分布式知识图谱的数据处理方法。In the embodiment of the present invention, the heterogeneous distributed knowledge graph is represented by a node table and a relationship table, and from the node table and the relationship table, the types and/or attributes of the nodes required for the graph computing scene can be determined, and The type and/or attribute of the edge, so as to filter the corresponding node data from the heterogeneous distributed knowledge graph according to the type and/or attribute of the node required by the graph calculation scene, and the type and/or attribute of the edge; Data processing is performed on the output node data, and data processing results based on heterogeneous distributed knowledge graphs are obtained without data processing on the entire graph. It can be seen that this embodiment provides an efficient data processing method for heterogeneous distributed knowledge graphs based on the node table and the relationship table.
可选地,异构分布式知识图谱分布式存储在多个设备中。过滤模块34在从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据时,是设置为:从每个设备中过滤出与至少一个计算节点对应的节点数据。计算模块35在对过滤出的节点数据进行计算,得到基于异构分布式知识图谱的计算结果时,是设置为:对从每个设备中过滤出对应的节点数据进行数据处理;汇总每个设备的数据处理结果,得到基于异构分布式知识图谱的数据处理结果。Optionally, the heterogeneous distributed knowledge graph is stored in multiple devices in a distributed manner. When filtering out the node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph, the filtering module 34 is configured to filter out the node data corresponding to at least one computing node from each device. When the calculation module 35 calculates the filtered node data and obtains the calculation result based on the heterogeneous distributed knowledge graph, it is set to: perform data processing on the corresponding node data filtered from each device; summarize each device The result of data processing is based on the heterogeneous distributed knowledge graph.
可选地,该装置还包括添加模块,设置为在对过滤出的节点数据进行数据 处理,得到基于异构分布式知识图谱的数据处理结果之后,将数据处理结果添加到节点表中的属性集合中,或者关系表中的属性集合中;和/或,将数据处理结果添加到节点表中与所述数据处理结果对应的节点的属性中,或者关系表中与所述数据处理结果对应的边的属性中。Optionally, the device further includes an adding module configured to add the data processing result to the attribute set in the node table after performing data processing on the filtered node data to obtain the data processing result based on the heterogeneous distributed knowledge graph , Or in the attribute set in the relationship table; and/or, add the data processing result to the attribute of the node in the node table corresponding to the data processing result, or the edge in the relationship table corresponding to the data processing result In the properties.
可选地,过滤模块34在根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,从节点表和关系表中,提取与数据图谱场景对应的至少一个计算节点时,是设置为:根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,在节点表中查找节点的标识,在关系表中查找起始节点标识和目标节点标识;根据节点的标识、起始节点标识和目标节点标识,确定至少一个计算节点。Optionally, the filtering module 34 extracts at least one calculation corresponding to the data graph scene from the node table and the relationship table based on the type and/or attribute of the node and the type and/or attribute of the edge required for calculating the scene according to the graph. When the node is set, it is set to: calculate the type and/or attribute of the node and the type and/or attribute of the edge required by the graph according to the graph, look up the identification of the node in the node table, and look up the starting node identification and Target node identification; at least one computing node is determined according to the node's identification, starting node identification and target node identification.
可选地,构建模块31,设置为在根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表时,加载用于构建异构分布式知识图谱的节点数据源和边数据源;识别节点数据源的数据结构以及边数据源的数据结构;根据节点数据源的数据结构,构建节点表;以及根据边数据源的数据结构,构建关系表;按照节点表和关系表,将节点数据源和边数据源中的数据读入到异构分布式知识图谱对应的图数据库中。Optionally, the construction module 31 is configured to load the node data used to construct the heterogeneous distributed knowledge graph when constructing the node table and the relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph Source and edge data sources; identify the data structure of the node data source and the data structure of the edge data source; construct the node table according to the data structure of the node data source; and construct the relationship table according to the data structure of the edge data source; follow the node table and The relational table reads the data in the node data source and the edge data source into the graph database corresponding to the heterogeneous distributed knowledge graph.
可选地,构建模块在根据节点数据源的数据结构,构建节点表时,是设置为:根据节点数据源中的编号列,生成节点表中多个节点的标识;根据节点数据源中的类型字段,生成节点表中多个节点的类型;根据节点数据源中的属性字段,生成节点表中多个节点的属性;对多个节点的类型和属性分别进行汇总,生成节点的类型集合和属性集合。Optionally, when the building module constructs the node table according to the data structure of the node data source, it is set to: generate the identifiers of multiple nodes in the node table according to the number column in the node data source; according to the type in the node data source Field, generate the types of multiple nodes in the node table; generate the attributes of multiple nodes in the node table according to the attribute fields in the node data source; summarize the types and attributes of multiple nodes to generate the type set and attributes of the node set.
可选地,构建模块在根据边数据源的数据结构,构建关系表时,是设置为:根据边数据源中的起始节点对应的编号列,生成关系表中多条边的起始节点标识;根据边数据源中的目标节点对应的编号列,生成关系表中多条边的目标节点标识;根据边数据源中的类型字段,生成关系表中多条边的类型;根据边数据源中的属性字段,生成关系表中多条边的属性;对多条边的类型和属性分别进行汇总,生成边的类型集合和属性集合。Optionally, when the building module constructs the relationship table according to the data structure of the edge data source, it is set to: generate the starting node identifiers of multiple edges in the relationship table according to the number column corresponding to the starting node in the edge data source ; According to the number column corresponding to the target node in the edge data source, generate the target node identifiers of the multiple edges in the relationship table; according to the type field in the edge data source, generate the types of multiple edges in the relationship table; according to the edge data source The attribute field of, generates the attributes of multiple edges in the relationship table; summarizes the types and attributes of multiple edges respectively, and generates the type set and attribute set of the edges.
本发明实施例所提供的基于异构分布式知识图谱的大数据处理装置可执行本发明任意实施例所提供的基于异构分布式知识图谱的大数据处理方法,具备执行方法相应的功能模块和有益效果。The big data processing device based on the heterogeneous distributed knowledge graph provided by the embodiment of the present invention can execute the big data processing method based on the heterogeneous distributed knowledge graph provided by any embodiment of the present invention, and has the corresponding functional modules and execution methods. Beneficial effect.
实施例四Example four
图4是本发明实施例四提供的一种计算机设备的结构示意图,如图4所示, 该计算机设备包括处理器40、存储器41、输入装置42和输出装置43;计算机设备中处理器40的数量可以是一个或多个,图4中以一个处理器40为例;计算机设备中的处理器40、存储器41、输入装置42和输出装置43可以通过总线或其他方式连接,图4中以通过总线连接为例。Fig. 4 is a schematic structural diagram of a computer device provided in the fourth embodiment of the present invention. As shown in Fig. 4, the computer device includes a processor 40, a memory 41, an input device 42, and an output device 43; The number can be one or more. In Figure 4, one processor 40 is taken as an example; the processor 40, memory 41, input device 42, and output device 43 in the computer equipment can be connected by a bus or other means. Take bus connection as an example.
存储器41作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本发明实施例中的基于异构分布式知识图谱的大数据处理方法对应的程序指令/模块(例如,基于异构分布式知识图谱的大数据处理装置中的构建模块31、确定模块32、计算节点获取模块33、过滤模块34和计算模块35)。处理器40通过运行存储在存储器41中的软件程序、指令以及模块,从而执行电子设备的多种功能应用以及数据处理,即实现上述的基于异构分布式知识图谱的数据处理方法。As a computer-readable storage medium, the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the big data processing method based on a heterogeneous distributed knowledge graph in the embodiment of the present invention ( For example, the building module 31, the determining module 32, the computing node obtaining module 33, the filtering module 34, and the computing module 35 in a big data processing device based on a heterogeneous distributed knowledge graph). The processor 40 executes multiple functional applications and data processing of the electronic device by running the software programs, instructions, and modules stored in the memory 41, that is, realizes the aforementioned data processing method based on the heterogeneous distributed knowledge graph.
存储器41可包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器41可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器41可包括相对于处理器40远程设置的存储器,这些远程存储器可以通过网络连接至电子设备。上述网络的实例包括互联网、企业内部网、局域网、移动通信网及其组合。The memory 41 may include a program storage area and a data storage area. The program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like. In addition, the memory 41 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 41 may include a memory remotely provided with respect to the processor 40, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置42可用于接收输入的数字或字符信息,以及产生与计算机设备的用户设置以及功能控制有关的键信号输入,例如节点表和数据表等。输出装置43可包括显示屏等显示设备,显示设备用于显示大数据处理结果。The input device 42 can be used to receive input digital or character information, and generate key signal inputs related to user settings and function control of the computer equipment, such as node tables and data tables. The output device 43 may include a display device such as a display screen, and the display device is used to display the big data processing result.
实施例五Example five
本发明实施例五还提供一种存储有指令的存储介质。指令在由计算机处理器执行时用于执行一种基于异构分布式知识图谱的大数据处理方法,该方法包括:The fifth embodiment of the present invention also provides a storage medium storing instructions. When the instructions are executed by a computer processor, they are used to execute a big data processing method based on a heterogeneous distributed knowledge graph, and the method includes:
根据异构分布式知识图谱的数据结构,构造异构分布式知识图谱的节点表和关系表;According to the data structure of the heterogeneous distributed knowledge graph, construct the node table and relation table of the heterogeneous distributed knowledge graph;
根据图谱计算请求,确定图谱计算场景,确定图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性;According to the graph calculation request, determine the graph calculation scenario, determine the type and/or attribute of the node required for the graph calculation scenario, and the type and/or attribute of the edge;
根据图谱计算场景所需的节点的类型和/或属性,以及边的类型和/或属性,从节点表和关系表中,提取与图谱计算场景对应的至少一个计算节点;According to the type and/or attribute of the node and the type and/or attribute of the edge required for the graph calculation scene, extract at least one calculation node corresponding to the graph calculation scene from the node table and the relation table;
从异构分布式知识图谱中过滤出与至少一个计算节点对应的节点数据;Filter out node data corresponding to at least one computing node from the heterogeneous distributed knowledge graph;
对过滤出的节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;Perform data processing on the filtered node data to obtain data processing results based on heterogeneous distributed knowledge graphs;
其中,节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和属性集合,关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和属性集合。Among them, the node table includes: multiple node IDs, multiple node types, multiple node attributes, node type collections and attribute collections, and the relationship table includes: multiple edge start node IDs, multiple edge targets Node ID, multiple edge types, multiple edge attributes, edge type set and attribute set.
本发明实施例所提供的一种其上存储指令的存储介质,其存储的指令不限于如上的方法操作,还可以执行本发明任意实施例所提供的基于异构分布式知识图谱的大数据处理方法中的相关操作。An embodiment of the present invention provides a storage medium on which instructions are stored. The stored instructions are not limited to the above method operations, and can also perform big data processing based on heterogeneous distributed knowledge graphs provided by any embodiment of the present invention Related operations in the method.
通过以上关于实施方式的描述,本公开可借助软件及必需的通用硬件来实现,也可以通过硬件实现。本公开可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括多个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开多个实施例的方法。Through the above description of the embodiments, the present disclosure can be implemented by software and necessary general-purpose hardware, or can be implemented by hardware. The present disclosure can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods of the multiple embodiments of the present disclosure.
上述基于异构分布式知识图谱的大数据处理装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的名称也只是为了便于相互区分,并不用于限制本公开的保护范围。In the above embodiment of the big data processing device based on the heterogeneous distributed knowledge graph, the multiple units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized. Yes; in addition, the names of multiple functional units are only for the convenience of distinguishing each other and are not used to limit the protection scope of the present disclosure.

Claims (15)

  1. 一种基于异构分布式知识图谱的大数据处理方法,包括:A big data processing method based on heterogeneous distributed knowledge graph, including:
    根据异构分布式知识图谱的数据结构,构造所述异构分布式知识图谱的节点表和关系表;Constructing the node table and relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;
    根据图谱计算请求,确定图谱计算场景,以及确定所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性;According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
    根据所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从所述节点表和所述关系表中,提取与所述图谱计算场景对应的至少一个计算节点;At least one of the following nodes required for calculating the scene according to the graph: type, attribute, and at least one of the following: type, attribute, and extracting from the node table and the relationship table and the graph calculation At least one computing node corresponding to the scene;
    从所述异构分布式知识图谱中过滤出与所述至少一个计算节点对应的节点数据;Filtering out node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph;
    对过滤出的所述节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;Perform data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph;
    其中,所述节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和节点的属性集合,所述关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和边的属性集合。Wherein, the node table includes: identifiers of multiple nodes, types of multiple nodes, attributes of multiple nodes, set of types of nodes, and set of attributes of nodes, and the relationship table includes: identifiers of starting nodes of multiple edges , The target node identification of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
  2. 根据权利要求1所述的方法,其中,所述异构分布式知识图谱分布式存储在多个设备中;The method according to claim 1, wherein the heterogeneous distributed knowledge graph is stored in a plurality of devices in a distributed manner;
    所述从所述异构分布式知识图谱中过滤出与所述至少一个计算节点对应的节点数据,包括:The filtering out the node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph includes:
    从每个设备中过滤出与所述至少一个计算节点对应的节点数据;Filtering out node data corresponding to the at least one computing node from each device;
    所述对过滤出的所述节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果,包括:The performing data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph includes:
    对从每个设备中过滤出的所述至少一个计算节点对应的节点数据进行数据处理;Performing data processing on the node data corresponding to the at least one computing node filtered from each device;
    汇总每个设备的数据处理结果,得到所述基于异构分布式知识图谱的数据处理结果。Summarize the data processing results of each device to obtain the data processing results based on the heterogeneous distributed knowledge graph.
  3. 根据权利要求1所述的方法,其中,在所述对过滤出的所述节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果之后,还包括以下至少之一:The method according to claim 1, wherein after performing data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph, the method further comprises at least one of the following:
    将所述数据处理结果添加到所述节点表中的属性集合中,或者所述关系表 中的属性集合中;Adding the data processing result to the attribute set in the node table or the attribute set in the relationship table;
    将所述数据处理结果添加到所述节点表中与所述数据处理结果对应的节点的属性中,或者所述关系表中与所述数据处理结果对应的边的属性中。The data processing result is added to the attribute of the node corresponding to the data processing result in the node table, or the attribute of the edge corresponding to the data processing result in the relationship table.
  4. 根据权利要求3所述的方法,其中,所述根据所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从所述节点表和所述关系表中,提取与所述图谱计算场景对应的至少一个计算节点,包括:The method according to claim 3, wherein at least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the following: type, attribute, from the node table And in the relationship table, extracting at least one computing node corresponding to the graph computing scene includes:
    根据所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,在所述节点表中查找节点的标识,以及在所述关系表中查找起始节点标识和目标节点标识;At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, look up the identifier of the node in the node table, and in the relationship table Find the starting node ID and the target node ID;
    根据所述节点的标识、所述起始节点标识和所述目标节点标识,确定所述至少一个计算节点。The at least one computing node is determined according to the identification of the node, the identification of the starting node, and the identification of the target node.
  5. 根据权利要求1所述的方法,其中,所述根据异构分布式知识图谱的数据结构,构造所述异构分布式知识图谱的节点表和关系表,包括:The method according to claim 1, wherein the constructing the node table and the relationship table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph comprises:
    加载用于构建所述异构分布式知识图谱的节点数据源和边数据源;Load the node data source and the edge data source used to construct the heterogeneous distributed knowledge graph;
    识别所述节点数据源的数据结构以及所述边数据源的数据结构;Identifying the data structure of the node data source and the data structure of the edge data source;
    根据所述节点数据源的数据结构,构建所述节点表,以及根据所述边数据源的数据结构,构建所述关系表;Constructing the node table according to the data structure of the node data source, and constructing the relationship table according to the data structure of the edge data source;
    按照所述节点表和所述关系表,将所述节点数据源和所述边数据源中的数据读入到所述异构分布式知识图谱对应的图数据库中。According to the node table and the relationship table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.
  6. 根据权利要求5所述的方法,其中,所述根据所述节点数据源的数据结构,构建所述节点表,包括:The method according to claim 5, wherein the constructing the node table according to the data structure of the node data source comprises:
    根据所述节点数据源中的编号列,生成所述节点表中多个节点的标识;Generate the identities of multiple nodes in the node table according to the number column in the node data source;
    根据所述节点数据源中的类型字段,生成所述节点表中多个节点的类型;Generating the types of multiple nodes in the node table according to the type field in the node data source;
    根据所述节点数据源中的属性字段,生成所述节点表中多个节点的属性;Generating the attributes of multiple nodes in the node table according to the attribute fields in the node data source;
    对多个节点的类型和属性分别进行汇总,生成节点的类型集合和节点的属性集合。The types and attributes of multiple nodes are respectively summarized to generate the type set of the node and the attribute set of the node.
  7. 根据权利要求5所述的方法,其中,所述根据所述边数据源的数据结构,构建所述关系表,包括:The method according to claim 5, wherein the constructing the relationship table according to the data structure of the edge data source comprises:
    根据所述边数据源中的起始节点对应的编号列,生成所述关系表中多条边的起始节点标识;According to the serial number column corresponding to the start node in the edge data source, generating the start node identifiers of the multiple edges in the relationship table;
    根据所述边数据源中的目标节点对应的编号列,生成所述关系表中多条边的目标节点标识;Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the edge data source;
    根据所述边数据源中的类型字段,生成所述关系表中多条边的类型;Generate the types of multiple edges in the relational table according to the type field in the edge data source;
    根据所述边数据源中的属性字段,生成所述关系表中多条边的属性;Generating the attributes of multiple edges in the relationship table according to the attribute fields in the edge data source;
    对多条边的类型和属性分别进行汇总,生成边的类型集合和属性集合。Summarize the types and attributes of multiple edges separately to generate an edge type set and attribute set.
  8. 一种计算机设备,包括处理器和存储器,所述存储器用于存储指令,在所述指令执行的情况下使得所述处理器执行以下操作:A computer device includes a processor and a memory, the memory is used to store instructions, and when the instructions are executed, the processor is caused to perform the following operations:
    根据异构分布式知识图谱的数据结构,构造所述异构分布式知识图谱的节点表和关系表;Constructing the node table and relation table of the heterogeneous distributed knowledge graph according to the data structure of the heterogeneous distributed knowledge graph;
    根据图谱计算请求,确定图谱计算场景,以及确定所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性;According to the graph calculation request, determine the graph calculation scenario, and determine at least one of the following nodes required for the graph calculation scenario: type, attribute, and at least one of the following: type, attribute;
    根据所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,从所述节点表和所述关系表中,提取与所述图谱计算场景对应的至少一个计算节点;At least one of the following nodes required for calculating the scene according to the graph: type, attribute, and at least one of the following: type, attribute, and extracting from the node table and the relationship table and the graph calculation At least one computing node corresponding to the scene;
    从所述异构分布式知识图谱中过滤出与所述至少一个计算节点对应的节点数据;Filtering out node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph;
    对过滤出的所述节点数据进行数据处理,得到基于异构分布式知识图谱的数据处理结果;Perform data processing on the filtered node data to obtain a data processing result based on a heterogeneous distributed knowledge graph;
    其中,所述节点表包括:多个节点的标识、多个节点的类型、多个节点的属性、节点的类型集合和节点的属性集合,所述关系表包括:多条边的起始节点标识、多条边的目标节点标识、多条边的类型、多条边的属性、边的类型集合和边的属性集合。Wherein, the node table includes: identifiers of multiple nodes, types of multiple nodes, attributes of multiple nodes, set of types of nodes, and set of attributes of nodes, and the relationship table includes: identifiers of starting nodes of multiple edges , The target node identification of multiple edges, the types of multiple edges, the attributes of multiple edges, the set of edge types and the set of edge attributes.
  9. 根据权利要求8所述的计算机设备,其中,所述异构分布式知识图谱存储分布式存储在多个设备中;The computer device according to claim 8, wherein the heterogeneous distributed knowledge graph is stored in a plurality of devices in a distributed manner;
    所述处理器是设置为通过以下方式从所述异构分布式知识图谱中过滤出与所述至少一个计算节点对应的节点数据:The processor is configured to filter out the node data corresponding to the at least one computing node from the heterogeneous distributed knowledge graph in the following manner:
    从每个设备中过滤出与所述至少一个计算节点对应的节点数据;Filtering out node data corresponding to the at least one computing node from each device;
    所述处理器是设置为通过以下方式得到基于异构分布式知识图谱的数据处理结果:The processor is set to obtain the data processing result based on the heterogeneous distributed knowledge graph in the following manner:
    对从每个设备中过滤出的所述至少一个计算节点对应的节点数据进行数据处理;Performing data processing on the node data corresponding to the at least one computing node filtered from each device;
    汇总每个设备的数据处理结果,得到所述基于异构分布式知识图谱的数据处理结果。Summarize the data processing results of each device to obtain the data processing results based on the heterogeneous distributed knowledge graph.
  10. 根据权利要求8所述的计算机设备,其中,所述处理器还设置为:The computer device according to claim 8, wherein the processor is further configured to:
    在得到基于异构分布式知识图谱的数据处理结果之后,还包括以下至少之一:After obtaining the data processing result based on the heterogeneous distributed knowledge graph, it also includes at least one of the following:
    将所述数据处理结果添加到所述节点表中的属性集合中,或者所述关系表中的属性集合中;Adding the data processing result to the attribute set in the node table or the attribute set in the relationship table;
    将所述数据处理结果添加到所述节点表中与所述数据处理结果对应的节点的属性中,或者所述关系表中与所述数据处理结果对应的边的属性中。The data processing result is added to the attribute of the node corresponding to the data processing result in the node table, or the attribute of the edge corresponding to the data processing result in the relationship table.
  11. 根据权利要求10所述的计算机设备,其中,所述处理器是设置为通过以下方式提取与所述图谱计算场景对应的至少一个计算节点:The computer device according to claim 10, wherein the processor is configured to extract at least one computing node corresponding to the graph computing scene in the following manner:
    根据所述图谱计算场景所需的节点的以下至少之一:类型、属性,以及边的以下至少之一:类型、属性,在所述节点表中查找节点的标识,以及在所述关系表中查找起始节点标识和目标节点标识;At least one of the following nodes required to calculate the scene according to the graph: type, attribute, and at least one of the edge: type, attribute, look up the identifier of the node in the node table, and in the relationship table Find the starting node ID and the target node ID;
    根据所述节点的标识、所述起始节点标识和所述目标节点标识,确定所述至少一个计算节点。The at least one computing node is determined according to the identification of the node, the identification of the starting node, and the identification of the target node.
  12. 根据权利要求8所述的计算机设备,其中,所述处理器是设置为通过以下方式构造所述异构分布式知识图谱的节点表和关系表:8. The computer device according to claim 8, wherein the processor is configured to construct a node table and a relationship table of the heterogeneous distributed knowledge graph in the following manner:
    加载用于构建所述异构分布式知识图谱的节点数据源和边数据源;Load the node data source and the edge data source used to construct the heterogeneous distributed knowledge graph;
    识别所述节点数据源的数据结构以及所述边数据源的数据结构;Identifying the data structure of the node data source and the data structure of the edge data source;
    根据所述节点数据源的数据结构,构建所述节点表,以及根据所述边数据源的数据结构,构建所述关系表;Constructing the node table according to the data structure of the node data source, and constructing the relationship table according to the data structure of the edge data source;
    按照所述节点表和所述关系表,将所述节点数据源和所述边数据源中的数据读入到所述异构分布式知识图谱对应的图数据库中。According to the node table and the relationship table, the data in the node data source and the edge data source are read into the graph database corresponding to the heterogeneous distributed knowledge graph.
  13. 根据权利要求12所述的计算机设备,其中,所述处理器是设置为通过以下方式构建所述节点表:The computer device according to claim 12, wherein the processor is configured to construct the node table in the following manner:
    根据所述节点数据源中的编号列,生成所述节点表中多个节点的标识;Generate the identities of multiple nodes in the node table according to the number column in the node data source;
    根据所述节点数据源中的类型字段,生成所述节点表中多个节点的类型;Generating the types of multiple nodes in the node table according to the type field in the node data source;
    根据所述节点数据源中的属性字段,生成所述节点表中多个节点的属性;Generating the attributes of multiple nodes in the node table according to the attribute fields in the node data source;
    对多个节点的类型和属性分别进行汇总,生成节点的类型集合和属性集合。The types and attributes of multiple nodes are respectively summarized to generate a set of types and attributes of nodes.
  14. 根据权利要求12所述的计算机设备,其中,所述处理器是设置为通过以下方式构建所述关系表:The computer device according to claim 12, wherein the processor is configured to construct the relationship table in the following manner:
    根据所述边数据源中的起始节点对应的编号列,生成所述关系表中多条边的起始节点标识;According to the serial number column corresponding to the start node in the edge data source, generating the start node identifiers of the multiple edges in the relationship table;
    根据所述边数据源中的目标节点对应的编号列,生成所述关系表中多条边的目标节点标识;Generate the target node identifiers of multiple edges in the relationship table according to the number column corresponding to the target node in the edge data source;
    根据所述边数据源中的类型字段,生成所述关系表中多条边的类型;Generate the types of multiple edges in the relational table according to the type field in the edge data source;
    根据所述边数据源中的属性字段,生成所述关系表中多条边的属性;Generating the attributes of multiple edges in the relationship table according to the attribute fields in the edge data source;
    对多条边的类型和属性分别进行汇总,生成边的类型集合和属性集合。Summarize the types and attributes of multiple edges separately to generate an edge type set and attribute set.
  15. 一种存储介质,所述存储介质用于存储指令,所述指令用于执行如权利要求1-7中任一项所述的基于异构分布式知识图谱的大数据处理方法。A storage medium for storing instructions, and the instructions are used for executing the method for processing big data based on a heterogeneous distributed knowledge graph according to any one of claims 1-7.
PCT/CN2020/109226 2019-08-20 2020-08-14 Big data processing method based on heterogeneous distributed knowledge graph, device, and medium WO2021032002A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910770620.6 2019-08-20
CN201910770620.6A CN110472068B (en) 2019-08-20 2019-08-20 Big data processing method, equipment and medium based on heterogeneous distributed knowledge graph

Publications (1)

Publication Number Publication Date
WO2021032002A1 true WO2021032002A1 (en) 2021-02-25

Family

ID=68512958

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/109226 WO2021032002A1 (en) 2019-08-20 2020-08-14 Big data processing method based on heterogeneous distributed knowledge graph, device, and medium

Country Status (2)

Country Link
CN (1) CN110472068B (en)
WO (1) WO2021032002A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282011A (en) * 2022-03-01 2022-04-05 支付宝(杭州)信息技术有限公司 Knowledge graph construction method and device, and graph calculation method and device

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046115B (en) * 2019-12-24 2023-08-08 四川文轩教育科技有限公司 Heterogeneous database interconnection management method based on knowledge graph
CN111324643B (en) * 2020-03-30 2023-08-29 北京百度网讯科技有限公司 Knowledge graph generation method, relationship mining method, device, equipment and medium
CN113704559A (en) * 2020-05-21 2021-11-26 北京金山数字娱乐科技有限公司 Data processing method and device
CN111708895B (en) * 2020-05-28 2023-06-20 北京赛博云睿智能科技有限公司 Knowledge graph system construction method and device
CN111708894B (en) * 2020-05-28 2023-06-20 北京赛博云睿智能科技有限公司 Knowledge graph creation method
CN113761286B (en) * 2020-06-01 2024-01-02 杭州海康威视数字技术股份有限公司 Knowledge graph embedding method and device and electronic equipment
CN111858956B (en) * 2020-07-07 2024-04-12 咪咕文化科技有限公司 Knowledge graph construction method, knowledge graph construction device, network equipment and storage medium
CN111931069B (en) * 2020-09-25 2021-01-22 浙江口碑网络技术有限公司 User interest determination method and device and computer equipment
CN112364045A (en) * 2020-10-23 2021-02-12 济南慧天云海信息技术有限公司 Heterogeneous data aggregation method
CN112271001B (en) * 2020-11-17 2022-08-16 中山大学 Medical consultation dialogue system and method applying heterogeneous graph neural network
CN114615027A (en) * 2022-02-24 2022-06-10 奇安信科技集团股份有限公司 Behavior data processing method, behavior data processing device, behavior data processing equipment and storage medium
CN114491085B (en) * 2022-04-15 2022-08-09 支付宝(杭州)信息技术有限公司 Graph data storage method and distributed graph data calculation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160378851A1 (en) * 2015-06-25 2016-12-29 International Business Machines Corporation Knowledge Canvassing Using a Knowledge Graph and a Question and Answer System
CN106503035A (en) * 2016-09-14 2017-03-15 海信集团有限公司 A kind of data processing method of knowledge mapping and device
CN109657065A (en) * 2018-10-31 2019-04-19 百度在线网络技术(北京)有限公司 Knowledge mapping processing method, device and electronic equipment
CN109766445A (en) * 2018-12-13 2019-05-17 平安科技(深圳)有限公司 A kind of knowledge mapping construction method and data processing equipment
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504568B2 (en) * 2009-01-08 2013-08-06 Fluid Operations Gmbh Collaborative workbench for managing data from heterogeneous sources
US9665620B2 (en) * 2010-01-15 2017-05-30 Ab Initio Technology Llc Managing data queries
CN108446367A (en) * 2018-03-15 2018-08-24 湖南工业大学 A kind of the packaging industry data search method and equipment of knowledge based collection of illustrative plates
CN109240821B (en) * 2018-07-20 2022-01-14 北京航空航天大学 Distributed cross-domain collaborative computing and service system and method based on edge computing
CN109213747B (en) * 2018-08-08 2021-11-16 麒麟合盛网络技术股份有限公司 Data management method and device
CN109388663A (en) * 2018-08-24 2019-02-26 中国电子科技集团公司电子科学研究院 A kind of big data intellectualized analysis platform of security fields towards the society
CN109918478A (en) * 2019-02-26 2019-06-21 北京悦图遥感科技发展有限公司 The method and apparatus of knowledge based map acquisition geographic products data
CN110119463A (en) * 2019-04-04 2019-08-13 厦门快商通信息咨询有限公司 Information processing method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160378851A1 (en) * 2015-06-25 2016-12-29 International Business Machines Corporation Knowledge Canvassing Using a Knowledge Graph and a Question and Answer System
CN106503035A (en) * 2016-09-14 2017-03-15 海信集团有限公司 A kind of data processing method of knowledge mapping and device
CN109657065A (en) * 2018-10-31 2019-04-19 百度在线网络技术(北京)有限公司 Knowledge mapping processing method, device and electronic equipment
CN109766445A (en) * 2018-12-13 2019-05-17 平安科技(深圳)有限公司 A kind of knowledge mapping construction method and data processing equipment
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282011A (en) * 2022-03-01 2022-04-05 支付宝(杭州)信息技术有限公司 Knowledge graph construction method and device, and graph calculation method and device
CN114282011B (en) * 2022-03-01 2022-08-23 支付宝(杭州)信息技术有限公司 Knowledge graph construction method and device, and graph calculation method and device

Also Published As

Publication number Publication date
CN110472068A (en) 2019-11-19
CN110472068B (en) 2020-04-24

Similar Documents

Publication Publication Date Title
WO2021032002A1 (en) Big data processing method based on heterogeneous distributed knowledge graph, device, and medium
US11693833B2 (en) Computer-implemented method for storing unlimited amount of data as a mind map in relational database systems
CN109101652B (en) Label creating and managing system
CN104881424B (en) A kind of acquisition of electric power big data, storage and analysis method based on regular expression
TWI634449B (en) Method and device for auditing sql
TWI564737B (en) Web search methods and devices
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
EP3732587B1 (en) Systems and methods for context-independent database search paths
US20140019454A1 (en) Systems and Methods for Caching Data Object Identifiers
US20200320216A1 (en) Systems and methods for determining database permissions
US10747786B2 (en) Spontaneous networking
CN115544183A (en) Data visualization method and device, computer equipment and storage medium
CN104765823A (en) Method and device for collecting website data
Song et al. Parallel incremental association rule mining framework for public opinion analysis
Cai et al. A semi-transparent selective undo algorithm for multi-user collaborative editors
CN106250456A (en) Bid winning announcement extraction method and device
US9092338B1 (en) Multi-level caching event lookup
He et al. The high-activity parallel implementation of data preprocessing based on MapReduce
CN103488757A (en) Clustering feature equivalent histogram maintaining method based on cloud computing
CN113190718A (en) Data processing method and device for graph database, electronic equipment and storage medium
CN110851517A (en) Source data extraction method, device and equipment and computer storage medium
Zhu Creating a NoSQL database for the Internet of Things: Creating a key-value store on the SensibleThings platform
CN112860812B (en) Method and device for non-invasively determining data field level association relation in big data
Zhang Design and implementation of data mining based on distributed computing
Barquero et al. Comparison and Performance Evaluation of Processing Platforms–Technical Report

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20854558

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20854558

Country of ref document: EP

Kind code of ref document: A1