CN116796007B - Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment - Google Patents

Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment Download PDF

Info

Publication number
CN116796007B
CN116796007B CN202310970701.7A CN202310970701A CN116796007B CN 116796007 B CN116796007 B CN 116796007B CN 202310970701 A CN202310970701 A CN 202310970701A CN 116796007 B CN116796007 B CN 116796007B
Authority
CN
China
Prior art keywords
graph
nodes
sub
target
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310970701.7A
Other languages
Chinese (zh)
Other versions
CN116796007A (en
Inventor
朱洪银
张闯
王敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310970701.7A priority Critical patent/CN116796007B/en
Publication of CN116796007A publication Critical patent/CN116796007A/en
Application granted granted Critical
Publication of CN116796007B publication Critical patent/CN116796007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a target knowledge graph embedding method, a target knowledge graph embedding device and electronic equipment, wherein the method comprises the following steps: obtaining graph data of a target knowledge graph; dividing the target knowledge graph into a first sub graph and a second sub graph according to graph data of the target knowledge graph and the degrees of nodes in the target knowledge graph, wherein the degrees of the nodes in the first sub graph are larger than those of the nodes in the second sub graph; dividing the second sub-graph into a plurality of third sub-graphs according to graph data of the second sub-graph and the degrees of nodes in the second sub-graph; training the knowledge-graph embedding model according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, wherein the trained knowledge-graph embedding model is used for generating an embedded representation of the target knowledge graph. According to the application, various knowledge graph embedding models can be applied to large-scale knowledge graphs, and the problem that the knowledge graph embedding models in the related technology are difficult to process large-scale knowledge graphs is solved.

Description

Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment
Technical Field
The embodiment of the application relates to the field of computers, in particular to a target knowledge graph embedding method, a target knowledge graph embedding device and electronic equipment.
Background
With the continuous generation of big data, knowledge-graph data is increasing. Knowledge graph is a generic formalized description framework of semantic knowledge in which nodes represent entities or concepts and edges represent semantic relationships between entities or concepts. In a knowledge graph, where entities mainly describe real world instances and where ontologies mainly describe concepts, the knowledge graph can be stored using a graph database (Graphdatabase) and visually presented in a graph structure. The nodes in the knowledge graph are connected through edges and are used for expressing relations among the nodes, such as (client A, affiliated to company B), (person A, friends, person B) and the like, the entity types and the relation types of the knowledge graph in different fields are different, and the real world contains rich entities and relation types.
With the explosive growth of graph data size, graph databases suffer from some performance limitations, such as being not easy to perform distributed computations, not easy to perform direct semantic understanding, and so on. Knowledge-graph embedding (Knowledgegraphembedding) represents entities and relationships in the knowledge-graph as low-dimensional vectors. The entity and the relation in the knowledge graph are embedded into the continuous vector space, so that the structure information in the knowledge graph is reserved while the calculation is convenient. The knowledge graph embedding summarizes semantic and local structure information of a given node.
However, large-scale knowledge graphs generally have hundreds of millions of nodes, and are limited by the memory of a server, so that the existing knowledge graph embedding model is difficult to process the large-scale knowledge graphs.
Disclosure of Invention
The embodiment of the application provides a target knowledge graph embedding method, a device and electronic equipment thereof, which at least solve the problem that a knowledge graph embedding model in the related technology is difficult to process a large-scale knowledge graph.
According to an embodiment of the present application, there is provided a target knowledge-graph embedding method, including: obtaining graph data of a target knowledge graph; dividing the target knowledge graph into a first sub graph and a second sub graph according to graph data of the target knowledge graph and the degrees of nodes in the target knowledge graph, wherein the degrees of the nodes in the first sub graph are larger than those of the nodes in the second sub graph; dividing the second sub-graph into a plurality of third sub-graphs according to graph data of the second sub-graph and the degrees of nodes in the second sub-graph; training the knowledge-graph embedding model according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, wherein the trained knowledge-graph embedding model is used for generating an embedded representation of the target knowledge graph.
In one exemplary embodiment, dividing the target knowledge-graph into a first sub-graph and a second sub-graph according to graph data of the target knowledge-graph and degrees of nodes in the target knowledge-graph, includes: determining a first preset number according to graph data of the target knowledge graph, wherein the first preset number is smaller than the number of all nodes of the target knowledge graph; determining a plurality of first nodes meeting a first preset number in all nodes of the target knowledge graph according to the degrees of the nodes in the target knowledge graph, wherein the nodes except the plurality of first nodes in all nodes of the target knowledge graph are a plurality of second nodes, and the degrees of each first node in the plurality of first nodes are greater than the degrees of each second node in the plurality of second nodes; according to the first nodes and the second nodes, the target knowledge graph is divided into a first sub graph corresponding to the first nodes and a second sub graph corresponding to the second nodes.
In one exemplary embodiment, the first preset number is less than or equal to one fifth of all nodes of the target knowledge-graph.
In one exemplary embodiment, determining a plurality of first nodes satisfying a first preset number from all nodes of the target knowledge-graph includes: determining the degree of each node in all nodes of the target knowledge graph according to the target knowledge graph; the degrees of each node in all nodes of the target knowledge graph are ordered according to a preset rule, and a target node sequence is obtained, wherein the preset rule comprises one of the following steps: the degrees of the plurality of nodes are increased from top to bottom, and the degrees of the plurality of nodes are decreased from top to bottom; and acquiring a plurality of target nodes meeting the first preset number from the target node sequence to obtain a plurality of first nodes.
In one exemplary embodiment, determining the degree of each node in all nodes of the target knowledge-graph according to the target knowledge-graph includes: representing graph data of the target knowledge graph in the form of an adjacent linked list; and calculating the degree of each node in all nodes of the target knowledge graph according to graph data of the target knowledge graph expressed as an adjacent linked list.
In an exemplary embodiment, the degree of each node in all nodes of the target knowledge graph is ordered according to a preset rule to obtain a target node sequence, which includes: determining the number of all nodes of the target knowledge graph according to graph data of the target knowledge graph; acquiring a pre-stored node list according to the number of all nodes of the target knowledge graph; and adding the degree of each node in all the nodes of the target knowledge graph to a pre-stored node list to obtain a target node list with a target node sequence.
In an exemplary embodiment, the degree of each node in all nodes of the target knowledge graph is ordered according to a preset rule to obtain a target node sequence, and the method further includes: the pre-stored node list is initialized before the degree of each node in all nodes of the target knowledge graph is added to the pre-stored node list.
In an exemplary embodiment, the above method further comprises: storing the target node list into a database to obtain a new pre-stored node list, wherein the database stores a plurality of pre-stored node lists, and any one of the pre-stored node lists is different from at least one other node list for storing.
In an exemplary embodiment, obtaining a plurality of target nodes meeting a first preset number from a target node sequence to obtain a plurality of first nodes includes: under the condition that the preset rule is that the degrees of a plurality of nodes are increased from top to bottom, extracting target nodes from the target node sequence in sequence according to the sequence number corresponding to the target node sequence from large to small until the number of the extracted target nodes meets the first preset number; under the condition that the degree of the preset rules is decreased from top to bottom, extracting target nodes from the target node sequence in sequence from small to large according to the sequence number corresponding to the target node sequence until the number of the extracted target nodes meets the first preset number.
In one exemplary embodiment, dividing the second sub-spectrum into a plurality of third sub-spectrums according to the graph data of the second sub-spectrum and the degrees of the nodes in the second sub-spectrum, includes: dividing graph data of the second sub-graph into a plurality of data sets according to degrees of nodes in the second sub-graph, wherein each data set in the plurality of data sets corresponds to at least one node in all nodes of the second sub-graph respectively; dividing each data set in the plurality of data sets into a plurality of sub-data sets by adopting a preset graph dividing algorithm to obtain graph data of a plurality of third sub-graphs, wherein the plurality of sub-data sets obtained by dividing the plurality of data sets are in one-to-one correspondence with the plurality of third sub-graphs.
In one exemplary embodiment, dividing the graph data of the second sub-graph into a plurality of data sets according to degrees of the nodes in the second sub-graph includes: determining a second preset number according to the number of nodes in the second sub-map and the degrees of the nodes, wherein the second preset number is smaller than the number of all the nodes of the second sub-map; the graph data of the second sub-graph is divided into a plurality of data sets meeting a second preset number, wherein the degrees of nodes corresponding to different data sets in the plurality of data sets are different.
In an exemplary embodiment, dividing each of the plurality of data sets into a plurality of sub-data sets using a preset map dividing algorithm to obtain map data of a plurality of third sub-maps includes: determining a third preset number according to each data set, wherein the third preset number is smaller than the number of nodes corresponding to each data set; dividing each data set into a plurality of sub data sets meeting a third preset number by adopting a Metis algorithm, and obtaining map data of a plurality of third sub maps.
In one exemplary embodiment, training the knowledge-graph embedding model based on graph data of the first sub-graph and graph data of the plurality of third sub-graphs includes: acquiring a knowledge graph embedding model; determining a target training frame according to the knowledge graph embedding model; and training the knowledge graph embedding model by taking graph data of the first sub-graph and graph data of a plurality of third sub-graphs as training samples and adopting a target training framework.
In one exemplary embodiment, determining a target training framework from a knowledge-graph embedding model includes: judging whether the first training framework supports a knowledge graph embedding model or not; under the condition that the first training frame supports the knowledge graph embedding model, determining the first training frame as a target training frame; and under the condition that the first training frame does not support the knowledge graph embedding model, determining the second training frame as a target training frame.
In one exemplary embodiment, the first training frame is a DGL-KE frame and the second training frame is a OpenKE frame.
In an exemplary embodiment, training the knowledge-graph embedding model by using the target training frame with the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs as training samples includes: respectively taking each sub-spectrum in the plurality of sub-spectrums as a training sample, and training the knowledge spectrum embedding model by adopting a target training frame, wherein the plurality of sub-spectrums comprise a first sub-spectrum and a plurality of third sub-spectrums; and generating a plurality of embedded representations corresponding to the plurality of sub-maps according to the trained knowledge-graph embedded model.
In an exemplary embodiment, the above method further comprises: and sequentially uploading the embedded representations corresponding to the sub-maps to a storage node of the data center according to the sequence of the sub-maps serving as training samples.
In an exemplary embodiment, the above method further comprises: the plurality of embedded representations of the plurality of sub-maps uploaded to the storage node are merged.
In an exemplary embodiment, the above method further comprises: according to a preset instruction, a knowledge spectrum embedding model is obtained from a knowledge spectrum embedding model pool, and the knowledge spectrum embedding model pool comprises a translation distance model and a semantic matching model.
According to another embodiment of the present application, there is provided a target knowledge-graph embedding apparatus including: the acquisition module is used for acquiring graph data of the target knowledge graph; the first dividing module is used for dividing the target knowledge graph into a first sub graph and a second sub graph according to graph data of the target knowledge graph and the degrees of the nodes in the target knowledge graph, wherein the degrees of the nodes in the first sub graph are larger than those of the nodes in the second sub graph; the second dividing module is used for dividing the second sub-spectrum into a plurality of third sub-spectrums according to the graph data of the second sub-spectrum and the degrees of the nodes in the second sub-spectrum; the training module is used for training the knowledge graph embedding model according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, and the trained knowledge graph embedding model is used for generating the embedded representation of the target knowledge graph.
According to a further embodiment of the application, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application there is also provided an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the method and the device, after the graph data of the target knowledge graph are obtained, the target knowledge graph is divided into the first sub graph and the second sub graph according to the graph data of the target knowledge graph and the degrees of the nodes in the target knowledge graph, so that the degrees of the nodes in the first sub graph are larger than the degrees of the nodes in the second sub graph, and then the second sub graph is divided into a plurality of third sub graphs according to the graph data of the second sub graph and the degrees of the nodes in the second sub graph, so that the dividing efficiency of the large-scale knowledge graph can be improved by combining a low-complexity coarse dividing algorithm and a high-complexity fine dividing algorithm, and the problem that the calculating performance and the information loss of the large-scale knowledge graph in the related technology cannot be balanced can be avoided. Therefore, the application solves the problem that the knowledge-graph embedding model in the related technology is difficult to process the large-scale knowledge graph by providing the mode of combining the double-layer graph dividing process and the distributed knowledge-graph embedding training to be applied to the large-scale knowledge-graph embedding training.
Drawings
FIG. 1 is a block diagram of a hardware architecture of a computer device for a target knowledge-graph embedding method according to an embodiment of the present application;
FIG. 2 is a flow chart of a target knowledge-graph embedding method, according to an embodiment of the application;
FIG. 3 is a workflow diagram of distributed OpenKE-DGL in a target knowledge-graph embedding method, according to an embodiment of the present application;
Fig. 4 is a block diagram of a target knowledge-graph embedding apparatus according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
For convenience of description, the following will describe some terms or terminology involved in the embodiments of the present application:
TransE, transR, transD model: embedding a model for knowledge graph, which is used for interpreting vector representation of the relation into a transfer vector between a head entity vector and a tail entity vector, and modeling Relationship.
Knowledge graph embedding tool bag (OpenKE): the learning tool kit is also called a knowledge graph representation, and is an open source framework developed based on a symbolic mathematical system (TensorFlow) and an open source Python machine learning library (PyTorch) and used for embedding the knowledge graph into a low-dimensional continuous vector space for marking.
Graph data processing toolkit (NetworkX): as an open source package of Python, the system is convenient for users to create, operate and learn complex networks, and can store the networks in standardized and non-standardized data formats by utilizing NetworkX, generate various random networks and classical networks, analyze network structures, establish network models, design new network algorithms, conduct network drawing and the like.
Graph model framework (Pytorch-BigGraph, PBG): the system is a novel embedded system, can be expanded to a graph with billions of nodes and billions of edges, supports distributed computing by a PBG framework, supports graph division algorithm, and can process large-scale knowledge graphs.
The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the operation on a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of a target knowledge-graph embedding method according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the computer terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the target knowledge graph embedding method in the embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a target knowledge graph embedding method running on the computer terminal is provided, fig. 2 is a flowchart of the target knowledge graph embedding method according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:
Step S202, obtaining graph data of a target knowledge graph;
Step S204, dividing the target knowledge graph into a first sub graph and a second sub graph according to graph data of the target knowledge graph and the degrees of the nodes in the target knowledge graph, wherein the degrees of the nodes in the first sub graph are larger than those of the nodes in the second sub graph;
Step S206, dividing the second sub-graph into a plurality of third sub-graphs according to graph data of the second sub-graph and the degrees of nodes in the second sub-graph;
Step S208, training a knowledge-graph embedding model according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, wherein the trained knowledge-graph embedding model is used for generating an embedding representation of the target knowledge graph.
It should be noted that the degree of any node in the target knowledge graph can be understood as the number of edges connected to the node. Illustratively, the node a is connected to N nodes by N edges, and the degree of the node a is N.
For a knowledge graph with a very large scale, complex semantic link relations exist between objects. Graph partitioning is the basic work of distributed graph processing, partitioning large-scale knowledge graphs onto different machines, and graph partitioning is a non-deterministic polynomial (NP-hard) problem. According to the short plate effect, the speed of the process depends on the machine with the lowest performance. In addition, the process of graph calculation can also carry out communication between different machines, and the communication cost is high, so that the quality of graph division has a great influence on the calculation performance. Some edges are cut in the graph dividing process, and high-quality division needs to meet load balancing, minimize the number of cut edges or nodes, i.e. the number of cross-partition edges is minimum, or the number of repeated nodes is minimum. Therefore, how to divide the high-quality graph for the knowledge graph with very large scale, and satisfying the balance between the improvement of the computing performance and the reduction of the information loss are also the problems to be solved.
Specifically, the graph division processing in the related art adopts NetworkX, but NetworkX is mainly applied to a single-machine environment, and can process a small-sized knowledge graph. The processing effect of the large-scale knowledge graph is not ideal, and the knowledge graph cannot be directly used in a distributed environment. In the related art, a large-scale knowledge graph is processed by adopting a PBG, the PBG can solve the limitation of CPU memory, divide nodes into disjoint partitions and store the disjoint partitions in a memory, but the waste of GPU resources still can be caused when data is loaded from the memory.
In the method provided by the embodiment of the application, through the steps S202 to S208, not only the dividing efficiency of the large-scale knowledge graph can be improved by combining the coarse division of the low-complexity graph and the fine division of the high-complexity graph, but also the problem that the computing performance and the information loss are not balanced in the division of the large-scale knowledge graph in the related technology can be avoided, so that the large-scale knowledge graph can be processed by applying the graph data of the first sub-graph and the graph data of the third sub-graphs obtained after the knowledge graph is divided through the double-layer graph to the training of the knowledge graph embedding model. Therefore, the application solves the problem that the knowledge-graph embedded model is difficult to process the large-scale knowledge graph in the related technology by providing the mode of combining the double-layer graph dividing process and the distributed knowledge-graph embedded training to be applied to the large-scale knowledge-graph embedded training, and can apply various knowledge-graph embedded models developed for a single machine to the large-scale knowledge graph.
Furthermore, the double-layer graph dividing mode provided by the embodiment of the application can realize multi-sub-graph parallel knowledge graph embedding training, so that the problem of distributed adaptation can be solved, the difficult problem of distributed knowledge graph training is solved, and various knowledge graph embedding models aiming at single machine development can be applied to large-scale knowledge graphs.
In addition, the method provided by the embodiment of the application is low in coupling, can be used for quickly adapting and accessing new service data, does not need manual annotation data, and can be applied to various field scenes such as a graph database and the like related to graph data analysis.
The above-described distributed knowledge-graph embedding training of embodiments of the application includes two subtasks, namely graph partitioning (Graphpartition) and knowledge-graph embedding (Knowledgegraphembedding). The graph partitioning can be formalized as: given graph dataDividing it into multiple subgraphs/>, by a graph dividing process. The knowledge-graph embedding task may be formalized as: given triplet set/>Wherein one knowledge graph consists of a plurality of triples/>Composition,/>And/>Head node and tail node representing a relationship, R representing relationship, E representing entity set, R representing relationship type set, knowledge graph embedding aiming at finding a mapping/>The representation of entities and relationships in the knowledge-graph is referred to as a low-dimensional dense vector.
The embodiment of the application provides a double-layer graph dividing processing mode, which can improve the processing efficiency of a large-scale knowledge graph. Specifically, in the steps S202 to S206, the graph data of the target knowledge graph is obtained, the target knowledge graph is divided into a first sub-graph and a second sub-graph according to the graph data of the target knowledge graph and the degree of the node in the target knowledge graph, the degree of the node in the first sub-graph is greater than the degree of the node in the second sub-graph, and then the second sub-graph is divided into a plurality of third sub-graphs according to the graph data of the second sub-graph and the degree of the node in the second sub-graph.
In some optional embodiments, dividing the target knowledge-graph into a first sub-graph and a second sub-graph according to graph data of the target knowledge-graph and degrees of nodes in the target knowledge-graph, includes: determining a first preset number according to graph data of the target knowledge graph, wherein the first preset number is smaller than the number of all nodes of the target knowledge graph; determining a plurality of first nodes meeting a first preset number in all nodes of the target knowledge graph according to the degrees of the nodes in the target knowledge graph, wherein the nodes except the plurality of first nodes in all nodes of the target knowledge graph are a plurality of second nodes, and the degrees of each first node in the plurality of first nodes are greater than the degrees of each second node in the plurality of second nodes; according to the first nodes and the second nodes, the target knowledge graph is divided into a first sub graph corresponding to the first nodes and a second sub graph corresponding to the second nodes.
In the above optional embodiment, after the graph data of the target knowledge graph is obtained, the number of all the nodes in the target knowledge graph may be determined first, so that a preset number smaller than the number, that is, a first preset number, is determined according to the number of all the nodes in the target knowledge graph, so that the degree of the remaining nodes is smaller than the degree of the extracted nodes by extracting the nodes satisfying the first preset number from all the nodes in the target knowledge graph, and therefore, all the nodes in the target knowledge graph may be divided into two parts, that is, a plurality of first nodes and a plurality of second nodes by the first preset number, and then the target knowledge graph may be divided into a first sub-graph and a second sub-graph according to the two parts of nodes.
In the above alternative embodiment, the first preset number may be less than or equal to one fifth of all nodes of the target knowledge-graph. Since the graph data of the knowledge graph generally follows long-tail distribution, namely, nodes with high output and input degrees occupy only a small part (for example, 10%), most of the nodes are connected in a sparse way. Therefore, the embodiment of the application can find out the part of nodes with the highest degree by setting the first preset number to be far smaller than the number of all nodes in the knowledge graph, thereby being beneficial to realizing the division of the double-layer graph from coarse to fine.
In some optional embodiments, determining a plurality of first nodes satisfying a first preset number from all nodes of the target knowledge-graph includes: determining the degree of each node in all nodes of the target knowledge graph according to the target knowledge graph; the degrees of each node in all nodes of the target knowledge graph are ordered according to a preset rule, and a target node sequence is obtained, wherein the preset rule comprises one of the following steps: the degrees of the plurality of nodes are increased from top to bottom, and the degrees of the plurality of nodes are decreased from top to bottom; and acquiring a plurality of target nodes meeting the first preset number from the target node sequence to obtain a plurality of first nodes.
In the above alternative embodiment, the degrees of all the nodes of the target knowledge graph may be determined first, and the degrees of all the nodes may be ordered to form a sequence, where the degrees may be increased or decreased from top to bottom due to the difference between the degrees of the nodes.
In the above optional embodiment, obtaining a plurality of target nodes satisfying a first preset number from the target node sequence to obtain a plurality of first nodes may include: under the condition that the preset rule is that the degrees of a plurality of nodes are increased from top to bottom, extracting target nodes from the target node sequence in sequence according to the sequence number corresponding to the target node sequence from large to small until the number of the extracted target nodes meets the first preset number; under the condition that the degree of the preset rules is decreased from top to bottom, extracting target nodes from the target node sequence in sequence from small to large according to the sequence number corresponding to the target node sequence until the number of the extracted target nodes meets the first preset number.
Specifically, for a sequence with the degree decreasing from top to bottom in the sequence, head nodes meeting the preset number are extracted, so that a plurality of nodes with larger degree, namely first nodes, can be obtained, and the remaining nodes are smaller in degree than the nodes located at the heads of the sequence, namely second nodes. And extracting tail nodes meeting the preset number for the sequence with the number increasing from top to bottom in the sequence, so that a plurality of nodes with larger number, namely first nodes, can be obtained, and the remaining nodes with the number smaller than that of the nodes positioned at the tail of the sequence, namely second nodes.
In the above optional embodiment, determining, according to the target knowledge-graph, the degree of each node in all nodes of the target knowledge-graph may include: representing graph data of the target knowledge graph in the form of an adjacent linked list; and calculating the degree of each node in all nodes of the target knowledge graph according to graph data of the target knowledge graph expressed as an adjacent linked list.
Specifically, the knowledge graph is represented in the form of an adjacent linked list, the adjacent linked list is a chained storage mode of the graph and comprises two parts of a vertex and an adjacent point, the vertex comprises vertex information and a pointer pointing to a first adjacent point, the adjacent point is a storage index comprising adjacent points and a pointer pointing to a next adjacent point, all adjacent points of the vertex form a single-chain table, specifically, the adjacent linked list can be understood as a one-dimensional array, each position stores a head node of the single-chain table, and the head node is provided with a data field and a pointer field pointing to the next node (adjacent point). Based on the knowledge graph expressed as the adjacent linked list, the degree of each node in the knowledge graph can be calculated conveniently.
In some optional embodiments, the ranking the degrees of each node in all nodes of the target knowledge graph according to a preset rule to obtain the target node sequence includes: determining the number of all nodes of the target knowledge graph according to graph data of the target knowledge graph; acquiring a pre-stored node list according to the number of all nodes of the target knowledge graph; and adding the degree of each node in all the nodes of the target knowledge graph to a pre-stored node list to obtain a target node list with a target node sequence.
In the above-mentioned alternative embodiment, a plurality of different node lists may be pre-stored, where the number of nodes in each node list is different, so that, according to the number of all nodes in the target knowledge graph, one pre-stored node list whose number of nodes is greater than or identical to that of the target knowledge graph may be selected from the plurality of pre-stored node lists, and used as the target pre-stored node list, so that the degrees of all nodes in the target knowledge graph may be added to the target pre-stored node list, and a list storing the degrees of all nodes in the target knowledge graph may be obtained.
In the above optional embodiment, the sorting the degrees of each node in all the nodes of the target knowledge graph according to the preset rule to obtain the target node sequence may further include: the pre-stored node list is initialized before the degree of each node in all nodes of the target knowledge graph is added to the pre-stored node list.
Specifically, the foregoing pre-stored node list may be added with a list of degrees of all nodes in the historical knowledge graph, at this time, in order to add the degrees of all nodes in the target knowledge graph to the pre-stored node list, the pre-stored node list needs to be initialized to empty the degrees of all nodes in the historical knowledge graph stored in the pre-stored node list, and then the degrees of all nodes in the target knowledge graph are added to the pre-stored node list.
In some optional implementations, the above method provided in the embodiments of the present application further includes: storing the target node list into a database to obtain a new pre-stored node list, wherein the database stores a plurality of pre-stored node lists, and any one of the pre-stored node lists is different from at least one other node list for storing.
In the above alternative embodiment, the plurality of pre-stored node lists stored in the database may be added with the degrees of all the nodes in the plurality of historical knowledge patterns, and since the number of all the nodes in the plurality of historical knowledge patterns may be different, the number of the nodes added in the pre-stored node list corresponding to the different historical knowledge patterns is different, and therefore, the node degrees in the knowledge patterns with different numbers of nodes can be stored by storing all the pre-stored node lists corresponding to the different historical knowledge patterns in the database.
In the step S204 of the embodiment of the present application, the rough map division faces the original large-scale knowledge map, faces hundreds of millions of nodes, and many conventional map division algorithms face the problem of unavailability. The algorithm adopted by the map coarse division in the above-mentioned double-layer map division algorithm is exemplified as follows:
input: graph data Graph of the target knowledge Graph, and the number N of nodes with highest degrees;
And (3) outputting: a partitioned sub-graph (i.e., a second sub-graph);
In the algorithm, graph data Graph of the target knowledge Graph and the number N of nodes with the highest degree in the target knowledge Graph are input. Line 1 is in the form of representing the graph data as a contiguous linked list. Line 2 initializes a list of node degrees. Lines 3 to 6 are the process of calculating the degree of all nodes in the target knowledge graph, where line 4 is the degree of each node calculated, and line 5 is the degree added to the list. Line 7 is the operation of sorting according to the degree of the node, and the returned index is the coordinates of the array element. The 8 th row divides the nodes according to the coordinates after the nodes are ordered, wherein the first N nodes are node sets high_set with high degrees, the rest nodes are node sets low_set with lower degrees, and the returned result is a sub-graph after division.
By adopting the algorithm, the nodes of the head part in the knowledge graph are determined by setting the proportion (namely, the proportion of the number N of the nodes to the number of all the nodes in the knowledge graph), and the nodes are very rich in local structural characteristics due to the fact that edges are relatively dense, so that embedded representations with relatively high quality can be learned without communicating with other parts in the training process of embedding the knowledge graph, and the communication cost is greatly saved. The subgraphs with low degrees of other nodes in the knowledge graph can be processed through a fine division algorithm.
In some alternative embodiments, dividing the second sub-spectrum into a plurality of third sub-spectrums according to the graph data of the second sub-spectrum and the degrees of the nodes in the second sub-spectrum, including: dividing graph data of the second sub-graph into a plurality of data sets according to degrees of nodes in the second sub-graph, wherein each data set in the plurality of data sets corresponds to at least one node in all nodes of the second sub-graph respectively; dividing each data set in the plurality of data sets into a plurality of sub-data sets by adopting a preset graph dividing algorithm to obtain graph data of a plurality of third sub-graphs, wherein the plurality of sub-data sets obtained by dividing the plurality of data sets are in one-to-one correspondence with the plurality of third sub-graphs.
In the above-mentioned alternative embodiment, the nodes with high output and input degrees in the knowledge-graph are removed from the second sub-graph obtained by coarse division with low complexity, and most of the nodes in the second sub-graph are connected sparsely, so that fine division can be performed through a graph division algorithm in the related art to obtain a plurality of third sub-graphs, thereby improving the processing efficiency of the target knowledge-graph, and being beneficial to the balance between the improvement of the computing performance and the reduction of the information loss of the machine for processing the knowledge-graph division.
In the above alternative embodiment, dividing the graph data of the second sub-graph into a plurality of data sets according to the degrees of the nodes in the second sub-graph may include: determining a second preset number according to the number of nodes in the second sub-map and the degrees of the nodes, wherein the second preset number is smaller than the number of all the nodes of the second sub-map; dividing the graph data of the second sub-graph into a plurality of data sets meeting a second preset number according to the degrees of the nodes in the second sub-graph, wherein the degrees of the nodes corresponding to different data sets in the plurality of data sets are different.
Specifically, the second sub-map is obtained through coarse division with low complexity, wherein most of nodes are connected in a sparse mode, a preset number for dividing the second sub-map is determined according to the number of the nodes in the second sub-map and the degrees of all the nodes, and the graph data of the second sub-map are divided according to the preset number, so that the degrees of the nodes corresponding to each data set obtained after division are different.
In the above optional embodiment, dividing each of the plurality of data sets into a plurality of sub-data sets by using a preset graph dividing algorithm to obtain graph data of a plurality of third sub-graphs may include: determining a third preset number according to each data set, wherein the third preset number is smaller than the number of nodes corresponding to each data set; dividing each data set into a plurality of sub data sets meeting a third preset number by adopting a Metis algorithm, and obtaining map data of a plurality of third sub maps.
Specifically, after the graph data of the second sub-graph is divided to obtain a plurality of data sets meeting the preset number, the graph data is finely divided by adopting a Metis algorithm, so that a plurality of third sub-graphs corresponding to the plurality of sub-data sets are obtained, and the target knowledge graph is divided to obtain the plurality of third sub-graphs by combining coarse division with low complexity and fine division with high complexity, so that the dividing efficiency of the large-scale knowledge graph can be improved.
In the above-mentioned two-layer graph dividing algorithm provided by the embodiment of the present application, the algorithm adopted for fine-dividing the graph may be, for example, a graph dividing algorithm in Metis, louvain related technologies. Taking a metas algorithm as an example, the algorithm adopted by the map refinement division in the double-layer map division algorithm is as follows:
input: the graph data low_set of the second sub-graph, the degree_list of the nodes, the parallel quantity C and the quantity M of divided mini subgraphs (namely the third sub-graph)
And (3) outputting: divided mini subgraphs (i.e., third subgraph)
In the algorithm, the input is a sub-graph low_set obtained by an algorithm adopted by the rough map division, a node degree degree_list obtained by calculation in the rough map division process, a parallel quantity C and a quantity M for dividing mini sub-graphs (namely a third sub-graph). Since the graph dividing algorithm in the related art cannot be directly applied to the low_set graph data (hundred million-level nodes), it is required to divide the low_set graph data into C parts, and then divide each part into M mini subgraphs, so that the calculation efficiency can be improved by inputting the parallel number C to divide the large-scale knowledge graph into C parts.
In the algorithm, line 1 is node set construction graph data of the subgraph obtained by rough graph division. Line 2 is parallel processing of dividing the graph data into C parts, which can be divided by using node degrees calculated by the rough map division. Lines 3 to 7 are the process of performing fine division on each of the C shares, wherein line 4 is to construct a piece of graph data for each set of the C shares, line 5 is to perform fine division on the graph data by adopting a metas algorithm to obtain a plurality of mini subgraphs, and line 6 is to save the mini subgraphs on a disk. The above procedure can also be performed in parallel in multiple processes while dividing each set in the C shares.
In the step S208, training the knowledge-graph embedding model according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, where the trained knowledge-graph embedding model is used to generate the embedded representation of the target knowledge-graph. Knowledge graph embedding aims at converting nodes and edges in the knowledge graph into a general low-dimensional dense vector representation, thereby facilitating better utilization and fusion of knowledge by a machine learning system.
In the prior art, the whole knowledge graph is mostly required to be put on a GPU display card for accelerating calculation, while the small-sized knowledge graph can finish calculation, with the increase of the graph scale, the knowledge graph of 500 ten thousand nodes can overflow from a memory on a high-performance GPU display card (such as 40G), so that the large-scale knowledge graph of hundreds of millions of nodes is not needed. For example, a knowledge graph with 20 hundred million nodes, each node using a 100-D floating point number embedded parameter model, requires 800GB of memory to store its parameters, and many standard methods exceed the memory capacity of typical commercial servers when faced with large-scale knowledge graphs.
In the above S208 of the embodiment of the present application, only the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs in the target knowledge graph need to be used as training samples to train the knowledge graph embedding model, so that the limitation of the GPU graphics card on the large-scale knowledge graph can be avoided, and the large-scale knowledge graph embedding training can be realized.
In some alternative embodiments, training the knowledge-graph embedding model based on the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs includes: acquiring a knowledge graph embedding model; determining a target training frame according to the knowledge graph embedding model; and training the knowledge graph embedding model by taking graph data of the first sub-graph and graph data of a plurality of third sub-graphs as training samples and adopting a target training framework.
Specifically, the knowledge-graph embedding algorithm is mainly used for learning semantic embedding. Knowledge patterns can be conveniently used for downstream tasks through knowledge pattern embedding representation. Among knowledge-graph embedding algorithms, transE algorithm optimization aims at enabling vector representation to meet the requirementWherein one knowledge graph consists of a plurality of triples/>Composition,/>And/>The head node and the tail node representing a relationship, r represents the relationship, and the learning can be performed by adopting an interval-based ordering loss (TRIPLET RANKING loss), so that the knowledge graph is embedded into the knowledge graph with the following formula:
Wherein, Is an interval parameter,/>Representing only the positive number of parts. /(I)Representing a negative set of samples comprising elements as shown in the following formula:
Wherein, And/>Representing the randomly sampled head and tail entities, but not both.
In the optional implementation manner, the method of the embodiment of the present application may further include: and acquiring a knowledge graph embedding model from the knowledge graph embedding model pool according to a preset instruction.
The knowledge spectrum embedding model in the knowledge spectrum embedding model pool is mainly divided into a distance-based model, a semantic matching model and other innovation models. The distance-based model, i.e., the translational distance model, optimizes the objective function by modeling the relative distance of entities in vector space. Wherein TransE model optimization is aimed atThe model has a small amount of parameters and low computational complexity, but the model cannot handle complex relationships, such as one-to-many relationships. The TransH model projects the entity vector into the relationship space by adding a parameter matrix so that the entity has different representations under different relationships. The TransR model builds separate semantic spaces for the different relationships, then projects the entity representations of the triples to the corresponding relationship spaces, and learns the vector representations of the head entity, tail entity, and relationship based on the TransE model. The TransD model reduces model complexity and increases diversity of entity and relationship representations by setting different matrix projections for the head entity and the tail entity into the relationship space. KG2E proposes modeling the uncertainty of entities and relationships by using gaussian distributions and probability densities to represent the entities and relationships as sampled random vectors.
The semantic matching model optimizes the objective function by calculating the matching degree of the entity combinations and relationships in the embedded space. The RESCAL model is a bilinear model and is used for encoding the knowledge graph into an adjacent matrix, further forming tensors for various relationship types, and obtaining entity vectors and relationship matrix representation through tensor decomposition. The matrix models interactions embedded in the head-to-tail entity representation. DistMult, also a bilinear model, is used to define the relationship type matrix as a diagonal matrix, thereby simplifying the learning process of the RESCAL model and reducing model parameters. The holographic embedding (holographic embeddings, hoIE) model combines the advantages of the expressive power of the RESCAL model and the high efficiency of the DistMult model, and represents the entities and relationships in the entity-combination vector space. The HolE model provides that cyclic operation is adopted when entity combination calculation is carried out, so that asymmetric relation modeling is realized, model parameters are reduced, and the entity and the relation are expressed as vectors.
The method is characterized in that other models except the distance-based model and the semantic matching model are used for solving the problems of dynamic knowledge graph embedding, multi-source information fusion, entity attribute consideration, time sequence information consideration, graph structure consideration and the like.
In the above optional embodiment, determining the target training framework according to the knowledge-graph embedding model includes: judging whether the first training framework supports a knowledge graph embedding model or not; under the condition that the first training frame supports the knowledge graph embedding model, determining the first training frame as a target training frame; and under the condition that the first training frame does not support the knowledge graph embedding model, determining the second training frame as a target training frame.
Specifically, the first training frame may be a DGL-KE frame, and the second training frame may be a OpenKE frame.
In the above optional embodiment, training the knowledge-graph embedding model by using the target training frame with the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs as training samples may include: respectively taking each sub-spectrum in the plurality of sub-spectrums as a training sample, and training the knowledge spectrum embedding model by adopting a target training frame, wherein the plurality of sub-spectrums comprise a first sub-spectrum and a plurality of third sub-spectrums; and generating a plurality of embedded representations corresponding to the plurality of sub-maps according to the trained knowledge-graph embedded model.
Specifically, after the target knowledge graph is divided into a plurality of sub-graphs by adopting the aforementioned double-layer graph dividing algorithm, the plurality of sub-graphs are taken as training samples, and the knowledge graph embedded model is trained by adopting a target training frame, wherein the target training frame can be selected from a DGL-KE frame and a OpenKE frame.
In some optional implementations, the foregoing method in the embodiments of the present application further includes: and sequentially uploading the embedded representations corresponding to the sub-maps to a storage node of the data center according to the sequence of the sub-maps serving as training samples, and merging the embedded representations of the sub-maps uploaded to the storage node. Specifically, after each sub-graph is trained, the corresponding knowledge graph is embedded and uploaded to a data center storage node, and after all sub-graphs are uploaded, automatic merging of knowledge graph embedding is performed.
Taking the first training frame as a DGL-KE frame and the second training frame as a OpenKE frame as an example, the OpenKE frame has the advantage that the model contained in the frame is relatively comprehensive, but has the limitation of only supporting the training of a single machine environment and not supporting the distributed training. And meanwhile, a large-scale knowledge graph cannot be trained. The DGL-KE framework has the advantages of supporting distributed training, supporting CPU-GPU hybrid training, optimizing single-machine multi-process training based on a shared memory, and being capable of training a large-scale knowledge graph, and has the limitation that the framework only supports fewer knowledge graph embedded models, and the selection surface is narrower. Many knowledge-graph embedding models are very challenging, or even impossible, to implement in a distributed retrofit.
Since there are many models for training knowledge-graph embedding, different models have different scenes for exerting the best performance. However, most of these models are proposed algorithms in a single-machine environment, and cannot be directly applied to large-scale knowledge graphs. Developing a distributed version for each model individually is very time consuming and not very efficient, and many models are inherently unsuitable for conversion to a distributed algorithm. Based on this, the embodiment of the application further provides a distributed OpenKE-DGL knowledge graph embedding training method, which is used for learning large-scale knowledge graph embedding representation, and can cover a complete knowledge graph embedding model by integrating two frameworks, so that the problem of adapting most knowledge graph embedding models can be solved.
The distributed training algorithm adopted by the distributed OpenKE-DGL knowledge graph embedding training method is as follows:
Input: graph data graph of target knowledge graph and model name model
And (3) outputting: knowledge graph embedding V
In the algorithm, graph data Graph of a target knowledge Graph and model name model are input, and trained knowledge Graph embedding V is output. Line 1 is the division of the target knowledge graph into a plurality of sub-graphs, namely a first sub-graph and a plurality of third sub-graphs, by adopting the double-layer graph division algorithm. Line 2 determines whether the DGL-KE framework supports the corresponding model. If so, line 3 uses the DGL-KE framework for model training. If the corresponding model is not supported, lines 5-8 use OpenKE framework for model training. Because OpenKE does not support multi-machine multi-card parallel training, the embodiment of the application provides a multi-subgraph parallel knowledge graph embedding training strategy. Line 5 represents the processing of the data for each sub-graph, training using the OpenKE framework. Line 7 represents embedding and uploading the knowledge graph generated by the subgraph training to a data center storage node server. Line 9 represents merging all uploaded knowledge graph embeddings when training of all subgraphs is finished.
The distributed OpenKE-DGL workflow in the target knowledge graph embedding method provided by the embodiment of the present application may be as shown in fig. 3, and specifically includes the following steps:
1. and inputting a small-sized knowledge graph, and setting a selected model by a user, wherein the system can preferentially select the DGL-KE framework according to the existence of the model. When OpenKE frames are selected, knowledge-graph embedding is uploaded to the data center storage nodes after OpenKE frames are trained.
2. Inputting a large-scale knowledge graph, firstly executing the double-layer graph dividing algorithm, then executing the distributed training algorithm, judging whether a selected model (TransE, transR, transD, holE, complEX …) exists or not in the process, and preferentially selecting the DGL-KE framework. If the model does not exist, selecting OpenKE a framework for training, embedding and uploading the corresponding knowledge graph to a data center storage node after each sub-graph is trained, and automatically combining the knowledge graph after all sub-graphs are uploaded.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiment of the present application further provides a target knowledge graph embedding device, which is used for implementing the foregoing embodiment and the preferred implementation manner, and the description is omitted herein. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 4 is a block diagram of a target knowledge-graph embedding apparatus according to an embodiment of the present application, as shown in fig. 4, the apparatus includes:
An acquisition module 41, configured to acquire graph data of a target knowledge graph;
a first dividing module 42, configured to divide the target knowledge-graph into a first sub-graph and a second sub-graph according to the graph data of the target knowledge-graph and the degrees of the nodes in the target knowledge-graph, where the degrees of the nodes in the first sub-graph are greater than the degrees of the nodes in the second sub-graph;
a second dividing module 43, configured to divide the second sub-spectrum into a plurality of third sub-spectrums according to the graph data of the second sub-spectrum and the degrees of the nodes in the second sub-spectrum;
The training module 44 is configured to train the knowledge-graph embedding model according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, where the trained knowledge-graph embedding model is used to generate an embedded representation of the target knowledge-graph.
According to the method, after the graph data of the target knowledge graph are obtained through the obtaining module, the target knowledge graph is divided into the first sub-graph and the second sub-graph according to the graph data of the target knowledge graph and the degree of the nodes in the target knowledge graph through the first dividing module, so that the degree of the nodes in the first sub-graph is larger than that of the nodes in the second sub-graph, the second sub-graph is divided into the third sub-graph according to the graph data of the second sub-graph and the degree of the nodes in the second sub-graph through the second dividing module, the dividing efficiency of the large-scale knowledge graph can be improved through combining a coarse dividing algorithm with low complexity and a fine dividing algorithm with high complexity, the problem that the calculation performance and the information loss cannot be balanced in the division of the large-scale knowledge graph in the related technology can be avoided, and the graph data of the first sub-graph and the graph data of the third sub-graphs obtained after the division of the knowledge graph through the double-layer graph are applied to training of the knowledge embedding model through the training module, so that the large-scale knowledge graph can be further processed by the server. Therefore, the application solves the problem that the knowledge-graph embedded model is difficult to process the large-scale knowledge graph in the related technology by providing the mode of combining the double-layer graph dividing process and the distributed knowledge-graph embedded training to be applied to the large-scale knowledge-graph embedded training, and can apply various knowledge-graph embedded models developed for a single machine to the large-scale knowledge graph.
In some optional embodiments, the first dividing module includes: the first determining submodule is used for determining a first preset number according to graph data of the target knowledge graph, wherein the first preset number is smaller than the number of all nodes of the target knowledge graph; the second determining submodule is used for determining a plurality of first nodes meeting a first preset number in all nodes of the target knowledge graph according to the degrees of the nodes in the target knowledge graph, wherein the nodes except the first nodes in all the nodes of the target knowledge graph are a plurality of second nodes, and the degrees of each first node in the plurality of first nodes are larger than the degrees of each second node in the plurality of second nodes; the first dividing sub-module is used for dividing the target knowledge graph into a first sub-graph corresponding to the first nodes and a second sub-graph corresponding to the second nodes according to the first nodes and the second nodes.
In some alternative embodiments, the first preset number is less than or equal to one fifth of all nodes of the target knowledge-graph.
In some optional embodiments, the second determining submodule includes: the third determining submodule is used for determining the degree of each node in all nodes of the target knowledge graph according to the target knowledge graph; the sequencing sub-module is used for sequencing the degrees of each node in all the nodes of the target knowledge graph according to a preset rule to obtain a target node sequence, wherein the preset rule comprises one of the following steps: the degrees of the plurality of nodes are increased from top to bottom, and the degrees of the plurality of nodes are decreased from top to bottom; the first acquisition sub-module is used for acquiring a plurality of target nodes meeting a first preset number from the target node sequence to obtain a plurality of first nodes.
In some optional embodiments, the third determining submodule includes: the representation submodule is used for representing the graph data of the target knowledge graph into a form of adjacent linked list; and the first calculation sub-module is used for calculating the degree of each node in all nodes of the target knowledge graph according to graph data of the target knowledge graph expressed as an adjacent linked list.
In some alternative embodiments, the sorting submodule includes: a fourth determining submodule, configured to determine, according to graph data of the target knowledge graph, the number of all nodes of the target knowledge graph; the second acquisition sub-module is used for acquiring a pre-stored node list according to the number of all the nodes of the target knowledge graph; and the adding sub-module is used for adding the degree of each node in all the nodes of the target knowledge graph to the pre-stored node list to obtain a target node list with a target node sequence.
In some optional embodiments, the sorting sub-module further includes: an initializing sub-module, configured to initialize a pre-stored node list before adding the degrees of each node in all nodes of the target knowledge graph to the pre-stored node list.
In some alternative embodiments, the apparatus further comprises: the storage module is used for storing the target node list into the database to obtain a new pre-stored node list, wherein the database is stored with a plurality of pre-stored node lists, and any one of the pre-stored node lists is different from the rest of at least one pre-stored node list in the number of nodes used for storage.
In some optional embodiments, the first obtaining submodule includes: the first extraction submodule is used for sequentially extracting target nodes from the target node sequence according to the sequence number corresponding to the target node sequence from large to small under the condition that the preset rule is that the degrees of the plurality of nodes are increased from top to bottom, and the number of the extracted target nodes meets the first preset number; and the second extraction submodule is used for sequentially extracting the target nodes from the target node sequence according to the sequence number corresponding to the target node sequence from small to large under the condition that the degree of the plurality of nodes is reduced from top to bottom according to the preset rule, and the number of the extracted target nodes meets the first preset number.
In some optional embodiments, the second dividing module includes: the second dividing sub-module is used for dividing the graph data of the second sub-graph into a plurality of data sets according to the degrees of the nodes in the second sub-graph, wherein each data set in the plurality of data sets corresponds to at least one node in all the nodes of the second sub-graph respectively; and the third dividing sub-module is used for dividing each data set in the plurality of data sets into a plurality of sub-data sets by adopting a preset graph dividing algorithm to obtain graph data of a plurality of third sub-graphs, wherein the plurality of sub-data sets obtained by dividing the plurality of data sets are in one-to-one correspondence with the plurality of third sub-graphs.
In some optional embodiments, the second dividing submodule includes: a fifth determining sub-module, configured to determine a second preset number according to the number of nodes in the second sub-graph and the degrees of the nodes, where the second preset number is smaller than the number of all nodes in the second sub-graph; and a fourth dividing sub-module, configured to divide the graph data of the second sub-graph into a plurality of data sets that satisfy a second preset number, where the degrees of nodes corresponding to different data sets in the plurality of data sets are different.
In some optional embodiments, the third partitioning submodule includes: a sixth determining submodule, configured to determine a third preset number according to each data set, where the third preset number is smaller than the number of nodes corresponding to each data set; and a fifth dividing sub-module, configured to divide each data set into a plurality of sub-data sets meeting a third preset number by adopting the method, so as to obtain map data of a plurality of third sub-maps.
In some alternative embodiments, the training module includes: the third acquisition submodule is used for acquiring the knowledge graph embedding model; a seventh determining submodule, configured to determine a target training frame according to the knowledge-graph embedding model; the first training sub-module is used for training the knowledge graph embedding model by taking graph data of the first sub-graph and graph data of a plurality of third sub-graphs as training samples and adopting a target training frame.
In some optional embodiments, the seventh determining submodule includes: the judging submodule is used for judging whether the first training frame supports the knowledge graph embedding model or not; an eighth determining submodule, configured to determine, when the first training frame supports the knowledge-graph embedding model, that the first training frame is a target training frame; and the ninth determining submodule is used for determining the second training frame as the target training frame under the condition that the first training frame does not support the knowledge graph embedding model.
In some alternative embodiments, the first training frame is a DGL-KE frame and the second training frame is a OpenKE frame.
In some optional embodiments, the first training submodule includes: the second training sub-module is used for training the knowledge graph embedded model by taking each sub-graph in the plurality of sub-graphs as a training sample and adopting a target training frame, wherein the plurality of sub-graphs comprise a first sub-graph and a plurality of third sub-graphs; and the generation sub-module is used for generating a plurality of embedded representations corresponding to the plurality of sub-atlases according to the trained knowledge atlas embedded model.
In some alternative embodiments, the apparatus further comprises: and the uploading module is used for sequentially uploading the embedded representations corresponding to the sub-spectrums to a storage node of the data center according to the sequence of taking the sub-spectrums as training samples.
In some alternative embodiments, the apparatus further comprises: and the merging module is used for merging the embedded representations of the sub-maps uploaded to the storage node.
In some alternative embodiments, the apparatus further comprises: and the fourth acquisition submodule is used for acquiring a knowledge spectrum embedding model from a knowledge spectrum embedding model pool according to a preset instruction, wherein the knowledge spectrum embedding model pool comprises a translation distance model and a semantic matching model.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (16)

1. The target knowledge graph embedding method is characterized by comprising the following steps of:
obtaining graph data of a target knowledge graph;
Dividing the target knowledge graph into a first sub graph and a second sub graph according to graph data of the target knowledge graph and degrees of nodes in the target knowledge graph, wherein the method comprises the following steps: determining a first preset number according to the graph data of the target knowledge graph, wherein the first preset number is smaller than the number of all nodes of the target knowledge graph; determining the degree of each node in all nodes of the target knowledge graph according to the target knowledge graph; the degree of each node in all nodes of the target knowledge graph is ordered according to a preset rule to obtain a target node sequence, wherein the nodes in the target knowledge graph at least comprise clients, companies and figures, the relationship between the nodes in the target knowledge graph at least comprises the membership of the clients and the companies and the friend relationship between the figures, and the preset rule comprises one of the following steps: the degrees of the plurality of nodes are increased from top to bottom, and the degrees of the plurality of nodes are decreased from top to bottom; obtaining a plurality of target nodes meeting the first preset number from the target node sequence to obtain a plurality of first nodes, wherein nodes except the plurality of first nodes in all nodes of the target knowledge graph are a plurality of second nodes, and the degree of each first node in the plurality of first nodes is larger than the degree of each second node in the plurality of second nodes; dividing the target knowledge graph into a first sub-graph corresponding to the first nodes and a second sub-graph corresponding to the second nodes according to the first nodes and the second nodes, wherein the degree of the nodes in the first sub-graph is larger than that of the nodes in the second sub-graph;
Dividing the second sub-spectrum into a plurality of third sub-spectrums according to the graph data of the second sub-spectrum and the degrees of the nodes in the second sub-spectrum, wherein the method comprises the following steps: dividing graph data of the second sub-graph into a plurality of data sets according to degrees of nodes in the second sub-graph, wherein each data set in the plurality of data sets corresponds to at least one node in all nodes of the second sub-graph; dividing each data set in the plurality of data sets into a plurality of sub-data sets by adopting a preset graph dividing algorithm to obtain graph data of the plurality of third sub-graphs, wherein the plurality of sub-data sets obtained by dividing the plurality of data sets are in one-to-one correspondence with the plurality of third sub-graphs;
Training a knowledge-graph embedding model in a GPU display card according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, wherein the trained knowledge-graph embedding model is used for generating an embedding representation of the target knowledge graph, and the training method comprises the following steps: acquiring the knowledge graph embedding model; judging whether the first training framework supports the knowledge graph embedding model or not; under the condition that the first training frame supports the knowledge graph embedding model, determining the first training frame as a target training frame; under the condition that the first training frame does not support the knowledge graph embedding model, determining a second training frame as a target training frame, wherein the first training frame is a DGL-KE frame, and the second training frame is a OpenKE frame; and training the knowledge graph embedding model in a GPU display card by adopting the target training frame by taking the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs as training samples.
2. The method of claim 1, wherein the first preset number is less than or equal to one fifth of all nodes of the target knowledge-graph.
3. The method of claim 1, wherein determining the degree of each node in all nodes of the target knowledge-graph according to the target knowledge-graph comprises:
Representing the graph data of the target knowledge graph in the form of an adjacent linked list;
And calculating the degree of each node in all nodes of the target knowledge graph according to graph data of the target knowledge graph expressed as the adjacent linked list.
4. The method of claim 1, wherein the sorting the degrees of each node in all nodes of the target knowledge graph according to a preset rule to obtain a target node sequence includes:
Determining the number of all nodes of the target knowledge graph according to the graph data of the target knowledge graph;
Acquiring a pre-stored node list according to the number of all the nodes of the target knowledge graph;
And adding the degree of each node in all the nodes of the target knowledge graph to the pre-stored node list to obtain a target node list with the target node sequence.
5. The method of claim 4, wherein the ranking the degrees of each node in all nodes of the target knowledge graph according to a preset rule to obtain a target node sequence further comprises:
Initializing the pre-stored node list before adding the degree of each node in all nodes of the target knowledge graph to the pre-stored node list.
6. The method as recited in claim 5, further comprising:
And storing the target node list into a database to obtain a new pre-stored node list, wherein the database stores a plurality of pre-stored node lists, and any one of the pre-stored node lists is different from the rest of at least one pre-stored node list in the number of nodes for storage.
7. The method of claim 1, wherein the obtaining a plurality of target nodes from the target node sequence that satisfy the first preset number, to obtain the plurality of first nodes, includes:
Under the condition that the preset rule is that the degrees of the plurality of nodes are increased from top to bottom, extracting target nodes from the target node sequence in sequence from big to small according to the sequence number corresponding to the target node sequence until the number of the extracted target nodes meets the first preset number;
and under the condition that the degree of the plurality of nodes is decreased from top to bottom according to the preset rule, extracting target nodes from the target node sequence in sequence from small to large according to the sequence number corresponding to the target node sequence until the number of the extracted target nodes meets the first preset number.
8. The method of claim 1, wherein the dividing the graph data of the second sub-graph into a plurality of data sets according to degrees of nodes in the second sub-graph comprises:
Determining a second preset number according to the number of nodes in the second sub-map and the degree of the nodes, wherein the second preset number is smaller than the number of all the nodes of the second sub-map;
dividing the graph data of the second sub-graph into a plurality of data sets satisfying the second preset number, wherein the degrees of nodes corresponding to different data sets in the plurality of data sets are different.
9. The method of claim 1, wherein dividing each of the plurality of data sets into a plurality of sub-data sets using a preset map dividing algorithm to obtain map data of the plurality of third sub-maps, comprises:
determining a third preset number according to each data set, wherein the third preset number is smaller than the number of nodes corresponding to each data set;
Dividing each data set into a plurality of sub data sets meeting the third preset number by adopting a Metis algorithm, and obtaining graph data of the plurality of third sub graphs.
10. The method of claim 1, wherein training the knowledge-graph embedding model using the target training framework with the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs as training samples comprises:
respectively taking each sub-spectrum in a plurality of sub-spectrums as a training sample, and training the knowledge spectrum embedded model by adopting the target training frame, wherein the plurality of sub-spectrums comprise the first sub-spectrum and the plurality of third sub-spectrums;
and generating a plurality of embedded representations corresponding to the plurality of sub-maps according to the trained knowledge-graph embedded model.
11. The method as recited in claim 10, further comprising:
and sequentially uploading the embedded representations corresponding to the plurality of sub-images to a storage node of a data center according to the sequence of the plurality of sub-images serving as training samples.
12. The method as recited in claim 11, further comprising:
Merging the plurality of embedded representations of the plurality of sub-spectrums uploaded to the storage node.
13. The method as recited in claim 1, further comprising:
And acquiring the knowledge spectrum embedding model from a knowledge spectrum embedding model pool according to a preset instruction, wherein the knowledge spectrum embedding model pool comprises a translation distance model and a semantic matching model.
14. A target knowledge graph embedding device, characterized by comprising:
the acquisition module is used for acquiring graph data of the target knowledge graph;
The first dividing module is configured to divide the target knowledge-graph into a first sub-graph and a second sub-graph according to graph data of the target knowledge-graph and degrees of nodes in the target knowledge-graph, where the degrees of nodes in the first sub-graph are greater than the degrees of nodes in the second sub-graph, and the first dividing module includes: the first determining submodule is used for determining a first preset number according to the graph data of the target knowledge graph, wherein the first preset number is smaller than the number of all nodes of the target knowledge graph; the third determining submodule is used for determining the degree of each node in all nodes of the target knowledge graph according to the target knowledge graph; the sequencing sub-module is used for sequencing the degrees of each node in all the nodes of the target knowledge graph according to a preset rule to obtain a target node sequence, wherein the nodes in the target knowledge graph at least comprise clients, companies and figures, the relationship between the nodes in the target knowledge graph at least comprises the membership of the clients and the companies and the friend relationship between the figures, and the preset rule comprises one of the following steps: the degrees of the plurality of nodes are increased from top to bottom, and the degrees of the plurality of nodes are decreased from top to bottom; the first acquisition submodule is used for acquiring a plurality of target nodes meeting the first preset number from the target node sequence to obtain a plurality of first nodes, wherein nodes except the plurality of first nodes in all nodes of the target knowledge graph are a plurality of second nodes, and the degree of each first node in the plurality of first nodes is larger than that of each second node in the plurality of second nodes; the first dividing sub-module is used for dividing the target knowledge graph into a first sub-graph corresponding to the plurality of first nodes and a second sub-graph corresponding to the plurality of second nodes according to the plurality of first nodes and the plurality of second nodes;
the second dividing module is configured to divide the second sub-spectrum into a plurality of third sub-spectrums according to the graph data of the second sub-spectrum and degrees of nodes in the second sub-spectrum, where the second dividing module includes: the second dividing sub-module is used for dividing the graph data of the second sub-graph into a plurality of data sets according to the degrees of the nodes in the second sub-graph, wherein each data set in the plurality of data sets corresponds to at least one node in all the nodes of the second sub-graph respectively; a third dividing sub-module, configured to divide each dataset in the plurality of datasets into a plurality of sub-datasets by using a preset map dividing algorithm, to obtain map data of the plurality of third sub-maps, where the plurality of sub-datasets obtained by dividing the plurality of datasets are in one-to-one correspondence with the plurality of third sub-maps;
The training module is configured to train a knowledge-graph embedding model in a GPU display card according to the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs, where the trained knowledge-graph embedding model is used to generate an embedded representation of the target knowledge-graph, and the training module includes: the third acquisition submodule is used for acquiring the knowledge graph embedding model; the judging submodule is used for judging whether the first training framework supports the knowledge graph embedding model or not; an eighth determining submodule, configured to determine, when the first training frame supports the knowledge-graph embedding model, that the first training frame is a target training frame; a ninth determining submodule, configured to determine, when the first training frame does not support the knowledge graph embedding model, that a second training frame is a target training frame, where the first training frame is a DGL-KE frame and the second training frame is a OpenKE frame; the first training sub-module is used for training the knowledge graph embedded model in the GPU display card by adopting the target training frame by taking the graph data of the first sub-graph and the graph data of the plurality of third sub-graphs as training samples.
15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program, when executed by a processor, implements the steps of the method of any of claims 1 to 13.
16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method as claimed in any one of claims 1 to 13 when the computer program is executed.
CN202310970701.7A 2023-08-03 2023-08-03 Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment Active CN116796007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310970701.7A CN116796007B (en) 2023-08-03 2023-08-03 Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310970701.7A CN116796007B (en) 2023-08-03 2023-08-03 Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment

Publications (2)

Publication Number Publication Date
CN116796007A CN116796007A (en) 2023-09-22
CN116796007B true CN116796007B (en) 2024-05-03

Family

ID=88049853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310970701.7A Active CN116796007B (en) 2023-08-03 2023-08-03 Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116796007B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824802A (en) * 2016-03-31 2016-08-03 清华大学 Method and device for acquiring knowledge graph vectoring expression
CN112242187A (en) * 2020-10-26 2021-01-19 平安科技(深圳)有限公司 Medical scheme recommendation system and method based on knowledge graph representation learning
CN114691891A (en) * 2022-04-19 2022-07-01 同方知网数字出版技术股份有限公司 Knowledge graph-oriented question-answer reasoning method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824802A (en) * 2016-03-31 2016-08-03 清华大学 Method and device for acquiring knowledge graph vectoring expression
CN112242187A (en) * 2020-10-26 2021-01-19 平安科技(深圳)有限公司 Medical scheme recommendation system and method based on knowledge graph representation learning
CN114691891A (en) * 2022-04-19 2022-07-01 同方知网数字出版技术股份有限公司 Knowledge graph-oriented question-answer reasoning method

Also Published As

Publication number Publication date
CN116796007A (en) 2023-09-22

Similar Documents

Publication Publication Date Title
US20240152754A1 (en) Aggregated embeddings for a corpus graph
EP4036724A1 (en) Method for splitting neural network model by using multi-core processor, and related product
US20210150372A1 (en) Training method and system for decision tree model, storage medium, and prediction method
Wu et al. Scheduling-guided automatic processing of massive hyperspectral image classification on cloud computing architectures
WO2024114399A1 (en) Optimization method for distributed execution of deep learning task, and distributed system
US11915104B2 (en) Normalizing text attributes for machine learning models
CN104298713B (en) A kind of picture retrieval method based on fuzzy clustering
WO2022068663A1 (en) Memory allocation method, related device, and computer readable storage medium
CN109359115B (en) Distributed storage method, device and system based on graph database
Chebbi et al. Improvement of satellite image classification: Approach based on Hadoop/MapReduce
CN114902273A (en) System and method for optimizing resource allocation using GPU
CN115860081B (en) Core algorithm scheduling method, system, electronic equipment and storage medium
CN114387608B (en) Table structure identification method combining convolution and graph neural network
CN117688984A (en) Neural network structure searching method, device and storage medium
CN111475158A (en) Sub-domain dividing method and device, electronic equipment and computer readable storage medium
CN116796007B (en) Target knowledge graph embedding method, target knowledge graph embedding device and electronic equipment
CN117193988A (en) Task scheduling method and medium for wafer-level framework AI acceleration chip
CN114756557B (en) Data processing method of improved computer algorithm model
Yang et al. Exploring the integration of big data analytics in landscape visualization and interaction design
CN116029187A (en) Deep learning model parallel strategy space representation method
CN115705496A (en) Quantum computer operating system and quantum computer
CN115730663A (en) Quantum computing task mapping method and quantum computer operating system
Waseem et al. Issues and Challenges of KDD Model for Distributed Data Mining Techniques and Architecture
WO2023207630A1 (en) Task solving method and apparatus therefor
CN117291236A (en) Training method and device for neural network model and nonvolatile storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant