CN116775893A

CN116775893A - Knowledge graph dividing method, device, equipment and storage medium

Info

Publication number: CN116775893A
Application number: CN202211425870.4A
Authority: CN
Inventors: 傅强; 李路中; 杨晓明
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-09-19

Abstract

The disclosure provides a knowledge graph dividing method, a knowledge graph dividing device, knowledge graph dividing equipment and a storage medium. Wherein the method comprises the following steps: determining the weight of the nodes in the knowledge graph according to the semantic information of the nodes in the knowledge graph and the semantic information of edges between the nodes; partitioning the knowledge graph according to the weight of the nodes in the knowledge graph and the structural information of the knowledge graph; and dividing the nodes to be divided into target partitions according to the semantic similarity between the nodes to be divided in the knowledge graph and each partition of the knowledge graph. The technical scheme provided by the embodiment of the disclosure aims at the dynamic knowledge graph, so that newly added nodes can be divided in real time, and the processing efficiency is improved.

Description

Knowledge graph dividing method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a knowledge graph dividing method, a knowledge graph dividing device, a knowledge graph dividing equipment and a knowledge graph storing medium.

Background

With the rapid development of artificial intelligence (Artificial Intelligence, AI) in various fields, the information and problems in various fields are interwoven, so that the information and problems in various fields become increasingly complex and the data volume is huge, a complete set of data models is needed to represent all resources, and a Knowledge Graph (knowledgegraph) is generated.

The knowledge graph is proposed by Google in 2012, and is an extension form of graph g= (V, E), wherein V and E are sets of vertices and edges, and represent association relationships between entities, respectively, and the current knowledge graph has reached the scale of millions of vertices and billions of edges. RDF (Resource Description Framework ) is one of the most widely used data types in knowledge-graphs, which describes resources in the form of triples (s, p, o), where s, p and o represent subjects, predicates, objects, respectively. Taking dbpetia (wikipedia knowledge graph) as an example, more than 30 hundred million triples of data are currently contained. Due to the dramatic growth of RDF graph data, conventional stand-alone data processing has failed to meet current demands, and distributed storage and processing of large-scale RDF graph data is a necessary choice. The distributed storage and query of RDF graph data face the main problems of RDF graph division, namely, the RDF graph data are divided into a plurality of sub-graphs, and specific requirements such as less crossing edges, high query efficiency, low communication cost and the like are met.

In the related art, the schemes for knowledge graph division are mainly divided into two types, one type is based on a distributed multi-level graph division algorithm, and the other type is based on a graph division algorithm of local search and simulated annealing. However, the nodes and edges in the knowledge graph are newly increased over time, for example, the knowledge graph applied to the social network changes the relationship between the entities, and the above schemes are not suitable for dynamic knowledge graphs, and cannot divide the newly increased nodes of the knowledge graph in real time.

Disclosure of Invention

The embodiment of the disclosure provides a knowledge graph dividing method, a knowledge graph dividing device, knowledge graph dividing equipment and a storage medium.

In a first aspect of an embodiment of the present disclosure, a knowledge-graph dividing method is provided, where the method includes:

determining the weight of the nodes in the knowledge graph according to the semantic information of the nodes in the knowledge graph and the semantic information of edges between the nodes;

partitioning the knowledge graph according to the weight of the nodes in the knowledge graph and the structural information of the knowledge graph;

and dividing the nodes to be divided into target partitions according to semantic similarity between the nodes to be divided in the knowledge graph and each partition of the knowledge graph.

In one embodiment, the determining the weight of the node in the knowledge-graph according to the semantic information of the node in the knowledge-graph and the semantic information of the edge between the nodes includes:

determining the initialization weight of each side according to the semantic information of the side in the knowledge graph;

determining the initialization weight of each node according to the semantic information of the node in the knowledge graph;

and determining the weight of each node according to the initialization weight of each edge and the initialization weight of each node.

In one embodiment, the determining the initialization weight of each edge according to the semantic information of the edge in the knowledge graph includes:

determining a first semantic hierarchy according to attribute information of edges in the knowledge graph and association relations among different edges;

a weight of each of the edges in the first semantic hierarchy is determined.

In one embodiment, the determining the initialization weight of each node according to the semantic information of the node in the knowledge graph includes:

determining a second semantic hierarchy according to attribute information of nodes in the knowledge graph and association relations among the nodes;

and determining the weight of each node in the second semantic hierarchy.

In one embodiment, the determining the weight of each node according to the initialization weight of each edge and the initialization weight of each node includes:

for each of the nodes, performing the following operations:

when the node outlet set is not empty, determining the weight of the node according to the initialization weight of the node, the initialization weight of the node outlet and the weight of a target node corresponding to the outlet;

And when the node outlet set is empty, determining the initialization weight of the node as the weight of the node.

In one embodiment, the partitioning the knowledge-graph according to the weight of the node in the knowledge-graph and the structural information of the knowledge-graph includes:

selecting a plurality of first nodes according to the preset partition number and the ordering of the nodes from high to low according to the weight;

taking each first node as an initial node, and initializing and partitioning the knowledge graph by combining the structural information of the knowledge graph and the structural information of a preset query template;

for each partition of the knowledge graph, when a second node exists in leaf nodes in the partition, the second node is reclassified; wherein the semantic similarity between the second node and the partition is less than a similarity threshold.

In one embodiment, for each partition of the knowledge-graph, when determining that a second node exists in leaf nodes in the partition, the repartitioning the second node includes:

calculating the semantic similarity between the leaf nodes in each partition and the partition where the leaf nodes are located through the slave nodes in the distributed system in parallel, and determining whether the second node exists in the leaf nodes in each partition;

When receiving node information of a second node sent by a first slave node, a master node in the distributed system sends the node information of the second node to a plurality of second slave nodes so as to acquire semantic similarity between the second node and partitions stored by each second slave node;

and the second node is reclassified through the master node according to the maximum value of the semantic similarity corresponding to the second node.

In one embodiment, the dividing the node to be divided into the target partitions according to the semantic similarity between the node to be divided in the knowledge graph and each partition of the knowledge graph includes:

determining the maximum value in semantic similarity between the nodes to be divided and each partition;

and taking a partition corresponding to the maximum value in the plurality of semantic similarity as the target partition, and dividing the node to be divided into the target partition.

In one embodiment, for any one of the partitions of the knowledge-graph, the semantic similarity between the nodes in the knowledge-graph and the partition is: and a statistical value of semantic similarity of the node and the node in the partition.

In a second aspect of the embodiments of the present disclosure, there is provided a knowledge-graph dividing apparatus, the apparatus including:

the determining module is used for determining the weight of the nodes in the knowledge graph according to the semantic information of the nodes in the knowledge graph and the semantic information of edges between the nodes;

the dividing module is used for dividing the knowledge graph in a partitioning way according to the weight of the nodes in the knowledge graph and the structural information of the knowledge graph;

the dividing module is further configured to divide the node to be divided into target partitions according to semantic similarity between the node to be divided in the knowledge graph and each partition of the knowledge graph.

A third aspect of an embodiment of the present disclosure provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the knowledge-graph dividing method according to any one of the first aspects when the program is executed by the processor.

In a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the knowledge-graph dividing method according to any one of the first aspects.

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for dividing a knowledge graph, which are used for determining the weight of nodes in the knowledge graph according to the semantic information of the nodes in the knowledge graph and the semantic information of edges between the nodes, so that the weight of the nodes in the knowledge graph in different fields can be different due to the consideration of the semantic information of the nodes and the edges in the knowledge graph when the weight information of the nodes is calculated; and the knowledge graph is partitioned according to the weight of the nodes in the knowledge graph and the structural information of the knowledge graph, and the nodes to be partitioned are partitioned into target partitions according to the semantic similarity between the nodes to be partitioned in the knowledge graph and each partition of the knowledge graph, so that the newly added nodes can be partitioned in real time aiming at the dynamic knowledge graph, and the processing efficiency is improved.

Drawings

Fig. 1 is a flowchart of a knowledge graph dividing method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a knowledge graph dividing method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a knowledge graph dividing method according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a knowledge graph dividing method according to an embodiment of the present disclosure;

Fig. 5 is a schematic diagram of a knowledge graph dividing architecture according to an embodiment of the present disclosure;

fig. 6 is a flowchart of a knowledge graph dividing method according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a knowledge graph dividing apparatus according to an embodiment of the present disclosure;

fig. 8 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure as detailed in the accompanying claims.

The terminology used in the embodiments of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

It is to be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other for brevity and will not be repeated.

In the related art, the partition scheme of the knowledge graph mainly carries out parallelization transformation on the existing single graph partition algorithm or directly proposes a new algorithm under a distributed framework. The distributed graph dividing algorithm is mainly divided into two types, wherein one type is based on a distributed multi-level graph dividing algorithm, and the other type is based on a graph dividing algorithm of local search and simulated annealing.

Representative of the distributed multi-level graph partitioning algorithm is the KaPPa algorithm, which employs a matching algorithm in the roughening stage, stores the weights of edges and nodes by adjacency matrix, and records the migrated vertices and associated edges by hash table. Before the local improvement algorithm is performed, vertex sets corresponding to boundaries of each partition are generated through the breadth-first search algorithm, and then refinement operation is performed on the vertex sets. The KaFFPa algorithm is provided on the basis of the KaPPa algorithm, a multi-stage iteration algorithm is expanded, the main contribution is that multi-stage iteration is carried out in the roughening and anti-roughening stages, once the graph is divided, the crossing edges between different partitions cannot shrink, and therefore partition quality can be ensured. Conventional multi-stage partitioning algorithms perform roughening and anti-roughening only once, also referred to as V-cycles. KaFFPa uses two new global search methods, F-loop and W-loop, respectively.

A representative of graph partitioning algorithms based on local search and simulated annealing techniques is the JA-BE-JA distributed graph partitioning algorithm. The algorithm firstly randomly distributes a color for each node, pi_v represents the color of the vertex v, and nodes with the same color belong to the same partition. N_v (c) represents a node set of c color among neighbor nodes of the node v, d_v represents the number of neighbor nodes of the node v, and d_v (c) = |n_v (c) | represents the number of c color among neighbor nodes. The energy of a graph is defined as the number of edges between nodes having different colors. By searching for neighboring nodes and random node sets as candidate nodes, if the energy of the graph is reduced after switching the node colors, switching is performed, and when the vertex of the exchangeable color is not searched, the algorithm is terminated. The method is based on the JA-BE-JA-VC algorithm, and the core idea is to modify vertex division into edge division, wherein the algorithm randomly divides all edges into different partitions, then defines an energy function of each edge and each partition, judges whether to exchange edges according to system energy, and then improves initial division by applying local search algorithm iteration.

However, the above method ignores semantic information, structural information, and attribute information contained in the knowledge-graph.

The method is characterized in that a distributed matching algorithm is introduced based on a KaPPa graph dividing algorithm, a large-scale graph is converted into a small-scale graph by adopting roughening, the small-scale graph is divided by utilizing the existing algorithm, and then the small-scale graph is converted into an original graph by anti-roughening, wherein the algorithm needs to store vertexes and edges in advance, semantic information contained in nodes and edges in a knowledge graph is not considered when weight information of the nodes and the edges is calculated, and when new nodes and edges are added in the graph, the new nodes and edges need to be divided again, so that the newly added nodes and edges cannot be divided in real time. Meanwhile, in the iterative process of graph division, the number of cutting edges is selected as a division standard, and the fact that the weights of the edges in different fields are different is considered, the number of the cutting edges is used as the division standard, so that the method is not applicable to RDF knowledge graph division standards, namely the final states of the division algorithms are different.

The graph dividing algorithm based on JA-BE-JA adopts the graph coloring problem, is simple in calculation and strong in locality, can avoid local optimization, but gradually increases the running time along with the increase of iteration times, and is not suitable for dynamic graphs such as RDF knowledge graphs, which are larger and larger along with the time. Meanwhile, the graph dividing algorithm is mainly aimed at undirected graphs, but is not applicable to directed graphs such as RDF knowledge graphs, and the like.

Fig. 1 shows a flowchart of a knowledge graph dividing method provided by an embodiment of the present disclosure. As shown in fig. 1, the knowledge graph dividing method may include the steps of:

101, determining the weight of the nodes in the knowledge graph according to the semantic information of the nodes in the knowledge graph and the semantic information of edges between the nodes;

102, partitioning the knowledge graph according to the weight of the nodes in the knowledge graph and the structural information of the knowledge graph;

and 103, dividing the nodes to be divided into target partitions according to semantic similarity between the nodes to be divided in the knowledge graph and each partition of the knowledge graph.

The knowledge graph dividing method provided by the embodiment of the disclosure can be executed by any device, equipment, platform or equipment cluster with computing and processing capabilities.

In the embodiment of the disclosure, the knowledge graph may be generated based on structured data and/or unstructured data in a preset domain. The preset fields may include, but are not limited to: telecom, finance, medical and/or insurance etc.

The knowledge graph may be an RDF graph. RDF graphs describe relationships of entities or entity to entity attribute values in the manner of RDF triples. In addition, the knowledge graph can be other types of knowledge graphs, such as an attribute graph, a directed label graph and the like.

Nodes in the knowledge graph are used to represent entities. Edges in the knowledge graph have directions for connecting two nodes, representing entity-to-entity relationships or entity-to-entity attribute value relationships.

The semantic information of the node refers to semantic features of the node information of the node. The node information may include node attribute information and/or association relationships between nodes. The node attribute information may include node identification, node name, node type, etc. The node identification refers to an identification for uniquely identifying the node in the knowledge graph, and may be the number of the node according to a certain ordering rule, for example, the initial value may be set to v1, and the sequence number may be v1, v2, … …, and the like.

The semantic information of the side refers to semantic features of the side information of the side. The side information of the side may include side attribute information and/or an association relationship of the side. The side attribute information may include a side identifier, a side name, a side type. The edge identifier refers to an identifier for uniquely identifying an edge in the knowledge graph, and may be the number of the edge according to a certain ordering rule, for example, the initial value may be set to e1, and the sequential number may be e1, e2, … …, etc.

Each node in the knowledge graph has a corresponding weight value, and the weight value of each node is used for reflecting the weight relation between the entities corresponding to each node.

Each side in the knowledge graph has a corresponding weight value, and the weight value of each side is used for reflecting the weight relation between the corresponding relations of each side.

In the step 101, the initialization weight of each node and the initialization weight of each side may be determined according to the semantic information of the node and the semantic information of the side in the knowledge graph, and the weight of each node may be determined according to the initialization weight of each side and the initialization weight of each node.

In the embodiment of the disclosure, when the weight information of the node is calculated, the weight of the node in the knowledge graph in different fields can be different due to the fact that semantic information of the node and the edge in the RDF knowledge graph is considered.

In the step 102, the nodes may be sorted according to the order of the weights of the nodes in the knowledge graph from high to low, the nodes sorted in the first K bits are selected to be the initial nodes, and the knowledge graph is partitioned by combining the structure information of the knowledge graph and the structure information of the preset query template, so as to obtain multiple partitions that are not overlapped with each other.

In step 103, for each node to be divided in the knowledge graph, the semantic similarity between the node to be divided and each partition of the knowledge graph may be determined, and the node to be divided may be divided into the target partition corresponding to the maximum semantic similarity according to the semantic similarity.

In some examples, the node to be partitioned may be a leaf node in any partition, and the semantic similarity between the leaf node and the partition in which the leaf node is located is less than a similarity threshold.

Here, leaf nodes within a partition may be included in a set of nodes corresponding to the boundaries of the partition.

In other examples, the nodes to be partitioned may be newly added nodes after the knowledge-graph partition.

In some examples, for any one partition of the knowledge-graph, the semantic similarity between a node in the knowledge-graph and that partition is: statistics of semantic similarity of the node to nodes within the partition.

Here, the statistical value of the semantic similarity may be an average value or a median value of the semantic similarity, or the like.

The embodiment of the disclosure provides a knowledge graph dividing method, which determines weights of nodes in a knowledge graph according to semantic information of the nodes in the knowledge graph and semantic information of edges between the nodes, so that different weights of the nodes and the edges in different fields can be embodied, and the generated dividing results are different, so that the semantic information of the nodes and the edges can be distinguished for different knowledge graphs; and the knowledge graph is partitioned according to the weight of the nodes in the knowledge graph and the structural information of the knowledge graph, and the nodes to be partitioned are partitioned into the target partitions according to the semantic similarity between the nodes to be partitioned in the knowledge graph and each partition of the knowledge graph, so that the dynamic knowledge graph can be dealt with, newly added nodes can be partitioned in real time, and the processing efficiency is improved.

In one embodiment, as shown in fig. 2, in step 101, determining the weight of the node in the knowledge-graph according to the semantic information of the node in the knowledge-graph and the semantic information of the edge between the nodes includes:

and 201, determining the initialization weight of each side according to the semantic information of the side in the knowledge graph.

In some examples, the implementation of step 201 may include:

determining a first semantic hierarchy according to attribute information of edges in the knowledge graph and association relations among different edges; weights of the edges in the first semantic hierarchy are determined.

Taking RDF knowledge graph as an example, a definition language RDFS (RDF Schema, resource description framework mode) for describing resources exists in the RDF knowledge graph, wherein a predicate named RDFs: subPropertyOf exists for indicating that a certain inheritance relationship exists between two relationships. For example, RDF triples (dbo: champoin RDFs: subPropertyOf owl: hasParticipant), meaning that if someone is the champion of a sport, one can deduce that someone is a participant of a sport, i.e., in RDF knowledge graph, there is an inheritance relationship between edges, so the weights of the edges cannot be initialized uniformly to the same weight.

In this embodiment, when dividing the knowledge graph, a first semantic hierarchy (i.e., a semantic hierarchy of edges) of a tree may be formed according to edges and association relationships between edges existing in the knowledge graph, and then the edges are initialized according to weights of each edge in the first semantic hierarchy.

All sides representing the same association in the knowledge graph correspond to the same side in the first semantic hierarchy. For example, all edges in the knowledge graph that represent the relationship between entities as "friends" correspond to the same edge in the first semantic hierarchy.

Wherein, the determining the weight of each edge in the first semantic hierarchy may include:

and determining the initialization weight of the current edge according to the number of the current edge, the number of all child nodes contained in the first semantic hierarchy by the current edge and the total number of the edges in the first semantic hierarchy.

The initialization weight of the current edge may be: the sum of the number of current edges and the number of all child nodes contained in the first semantic hierarchy by the current edges is the ratio of the sum of the number of child nodes contained in the first semantic hierarchy to the total number of edges in the first semantic hierarchy.

Illustratively, the initialization weights for each edge may be calculated using the following formula:

Wherein ω (e) _i ) Representing edge e _i Is set to be equal to or greater than the initialization weight, c (e _i ) Representing edge e _i I e, number of child nodes of (i) _i I represents the currentThe number of edges, c (E), represents the total number of edges in the semantic hierarchy of edges.

202, determining the initialization weight of each node according to the semantic information of the nodes in the knowledge graph.

In some examples, the implementation of step 202 may include:

determining a second semantic hierarchy according to attribute information of nodes in the knowledge graph and association relations among the nodes; weights of the nodes in the second semantic hierarchy are determined.

Taking RDF knowledge graph as an example, a predicate named RDFs: subClassOf exists in RDFS to indicate that a certain inheritance relationship exists between two entities. For example, RDF triples (dbo: softWare RDFs: suhClasof dbo: work) indicate that if a thing belongs to SoftWare engineering, it can be deduced that the thing belongs to a task, that is, in RDF knowledge graphs, inheritance relationships exist between entities as well, so that the weight of a node cannot be uniformly initialized to 1 for calculation.

In this embodiment, when the knowledge graph is divided, node types and association relations between nodes existing in the current RDF knowledge graph can be sorted out to form a tree-shaped second semantic hierarchy (i.e., a semantic hierarchy of nodes), and then the nodes are initialized according to weights of each node in the second semantic hierarchy.

All nodes representing the same type in the knowledge graph correspond to the same node in the second semantic hierarchy. For example, all nodes in the knowledge graph that represent types of "people" correspond to the same node in the second semantic hierarchy.

Wherein, the determining the weight of each node in the second semantic hierarchy may include:

and determining the initialization weight of the current node according to the number of the current node, the number of all the child nodes contained in the second semantic hierarchy by the current node and the total number of the nodes in the second semantic hierarchy.

The initialization weight of the current node may be: the sum of the number of current nodes and the number of all child nodes contained in the current node is the ratio of the sum to the total number of nodes in the second semantic hierarchy.

Illustratively, the initialization weights of the nodes may be calculated using the following equation (2):

wherein ω (v) _i ) Represents the initialization weight of the node, c (v _i ) Representing node v _i Is the number of child nodes, |v _i The i represents the number of current nodes and c (V) represents the total number of nodes in the semantic hierarchy of nodes.

203, determining the weight of each node according to the initialization weight of each side and the initialization weight of each node.

In some examples, the following is performed for each node:

when the node outlet set is not empty, determining the weight of the node according to the initialization weight of the node, the initialization weight of the node outlet and the weight of a target node corresponding to the outlet; when the node's outgoing set is empty, the node's initialization weight is determined as the node's weight.

In this embodiment, considering that in the RDF knowledge graph query process, a node is generally used as a starting point of the query, and an outgoing edge is used as a query direction to perform the query, the weight of the node can be determined according to the initialization weight of the node, the outgoing edge set of the node and the weight of the target node.

Illustratively, the weight of the node may be calculated using the following equation (3):

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the weight of node v, θ(v) Represents the set of outgoing edges of node v, θ (e _i ) Representing edge e _i Corresponding target node->Representing edge e _i The weight of the corresponding target node. The weight of the node v is a normalized value obtained by adding the initializing weight of the node to the weight of the node v and the weight of the target node, and when the set of the node v is empty, the weight of the node is the initializing weight.

In this embodiment, in the process of calculating the weights of the nodes, the node whose edge set is empty may be selected first to perform weight assignment, then the weights of the node are calculated according to the source node corresponding to the edge of the node, and when the weights of the target nodes corresponding to the edge of the node are not assigned, the weights of the node are calculated after waiting for assignment of the target nodes.

In one embodiment, as shown in fig. 3, in the step 102, partitioning the knowledge graph according to the weights of the nodes in the knowledge graph and the structural information of the knowledge graph includes:

301, selecting a plurality of first nodes according to the preset partition number and the ordering of the nodes from high to low according to the weight.

Here, the preset number of partitions may be determined according to actual application requirements. For example, the number of partitions is set to 5 or 6 or other suitable value.

In this embodiment, according to the number of partitions, a node with weights ordered in the first K bits may be selected as the first node. Wherein K is a positive integer greater than 1. For example, the value of K is equal to N times the number of partitions, where N is a positive integer greater than or equal to 1.

302, taking each first node as an initial node, and initializing and partitioning the knowledge graph by combining the structural information of the knowledge graph and the structural information of a preset query template.

The structural information of the query template may include, but is not limited to: chain structures and/or fork-like structures.

In some examples, a first node may be used as a starting point, an edge structure corresponding to the first node is searched from the knowledge graph according to structure information of a preset query template, and the knowledge graph is initialized and partitioned according to a search result.

In some examples, if the number of nodes of the first node is equal to the number of partitions, the plurality of first nodes respectively correspond to the start nodes of the respective partitions. In some examples, if the number of nodes of the first node is N times the number of partitions, the number of first nodes as start nodes in each partition is the same, and at least two first nodes in the same partition are used as start nodes. Wherein N is a positive integer greater than 1.

In some examples, when the number of nodes of the first node is N times the number of partitions, a difference between a sum of weights of the first node in the nth partition and a sum of weights of the first node in the N-1 th partition is less than a preset threshold, where N is greater than 1 and less than or equal to the number of partitions, and N is a positive integer greater than 1.

Here, the preset threshold may be set according to actual application needs. It can be understood that the smaller the preset threshold value is, the better the effect of initializing the partition of the knowledge graph is.

In practical application, when the knowledge graph is initialized to partition, the minimum number of edges crossing the partition can be met as a partition standard.

In the embodiment, in the partition division of the knowledge graph, the probability of cross-partition query is reduced from the query level by combining the structural information of the query template, so that the efficiency of the knowledge graph query processing is improved.

303, for each partition of the knowledge graph, when determining that a second node exists in the partition, re-dividing the second node; wherein the semantic similarity between the second node and the partition is less than a similarity threshold.

Here, the semantic similarity between the second node and the nth partition may be: the statistical value of the semantic similarity between the second node and the node in the nth partition may be an average value or a median value of the semantic similarity between the nodes in the nth partition. Wherein n is a positive integer less than or equal to the number of partitions.

The similarity threshold may be determined based on expert experience or experimental data training, and is not particularly limited herein.

In some examples, the semantic similarity may be one of:

semantic similarity based on distance;

Semantic similarity based on attributes;

semantic similarity based on adaptive weighting.

1) Distance-based semantic similarity

According to the semantic hierarchy of the node, not only the weight information of the node in the semantic hierarchy of the node can be calculated, but also the semantic similarity between brother nodes can be obtained according to the inheritance relationship between the nodes.

The semantic distance between two nodes is calculated mainly through the depth of the nodes in the semantic hierarchy based on the semantic similarity of the distance. That is, the similarity of two nodes is proportional to the depth of their nearest common ancestor and inversely proportional to the respective depth.

In some examples, the distance-based semantic similarity may be calculated using equation (4) as follows:

wherein sigma _D (v _i ,v _j ) For node v _i ,v _j Distance-based semantic similarity between NCA (v) _i ,v _j ) Representing node v _i ,v _j Is the most common ancestor of any two nodes in the semantic hierarchy chart, D _v Representing the path depth of the node to the root node,representing the path depth of the nearest common ancestor to the root node. If the nearest common ancestor of two nodes is the root nodeTheir semantic similarity is 0.

2) Semantic similarity based on attributes

In RDFS resource description languages, nodes typically have attributes, and the attributes of the nodes have inheritance relationships, i.e., a semantic hierarchy of edges, like the nodes. For example, a child node's owned attributes are typically inherited from its parent node's owned attributes, while a parent node's owned attributes may not be owned by a child node. Semantic similarity between nodes can be measured according to the number of common attributes of the nodes.

In some examples, the attribute-based semantic similarity may be calculated using the following formula:

wherein sigma _P (v _i ,v _j ) For node v _i ,v _j Semantic similarity based on attributes, delta (v) represents attribute set owned by node v, and alpha is based on node v _i ,v _j Is calculated by the depth of (d), delta (v) _i )∩δ(v _j ) Representing node v _i And node v _j Set of commonly owned attributes, delta (v _i )/δ(v _j ) Representing node v _i Owned and node v _j There is no set of attributes.

3) Semantic similarity based on adaptive weighting

The comprehensive semantic similarity algorithm in the related technology can improve accuracy of similarity calculation to a certain extent, but the weighting coefficient is usually determined by an expert in the field, has certain subjectivity and uniqueness, cannot be suitable for semantic similarity calculation of RDF knowledge maps in different fields, and does not combine actual structure information of the RDF knowledge maps in the process of semantic similarity calculation, and the similarity of semantic hierarchy structure layers is calculated by the formulas (4) and (6). In actual graph partitioning, the distance between nodes and the weight of the edges between nodes should be considered.

In some examples, the adaptively weighted semantic similarity may be determined by weighted summing the distance-based semantic similarity with the attribute-based semantic similarity and combining the weighted sum with the weights of the edges between the nodes.

In some examples, the calculation formula for semantic similarity based on adaptive weighting is as follows:

wherein σ (v) _i ,v _j ) For node v _i ,v _j Semantic similarity based on adaptive weighting, lambda (v _i ,v _j ) Representing slave node v _i To v _j The size of the set of passed edges is the value of v _i To v _j The weights of the edges between the nodes are the average of the weights of the edges on the path. Beta is node v _i ,v _j Distance-based semantic similarity sigma between _D (v _i ,v _j ) Corresponding weight, gamma is node v _i ,v _j Semantic similarity sigma between attributes _P (v _i ,v _j ) Corresponding weights, and β+γ=1.

In the embodiment, the self-adaptive weighted semantic similarity calculation can be adopted in the repartitioning stage, so that the method is applicable to the semantic similarity calculation of knowledge maps in different fields, and the generalization capability is improved.

It is noted that the division of the knowledge graph needs multiple iterations, in this iteration, the partition information of each partition of the current round is determined according to the partition division result of the previous iteration, the structure information of the query template and the structure information of the knowledge graph, and for each partition of the current round, when determining that the second node exists in the partition, the second node is re-divided to obtain the partition division result of the current round.

For example, assuming that 100 nodes exist in the knowledge graph, the number of partitions is 2, the nodes (K values are 4) of the TOP K with weight ordering are sequentially the nodes v1 to v4, the nodes v1 to v4 are respectively used as initial nodes, and the structure information of the query template and the structure information of the knowledge graph are combined to initialize and partition the knowledge graph to form two partitions. If the partition dividing result is: the nodes in the partition 1 comprise a node v1, a node v3, a node v5 and a node v6, the nodes in the partition 2 comprise a node v2, a node v4, a node v7 and a node v8, the semantic similarity of the node v5 serving as a leaf node of the partition 1 and the partition 1 is smaller than a similarity threshold, and the semantic similarity of the node v7 serving as a leaf node of the partition 2 and the partition 2 is smaller than the similarity threshold, so that the nodes v5 and the node v7 are re-divided. If the node v5 is repartitioned to the partition 2 and the node v7 is repartitioned to the partition 1, the partition division result of the round of iteration is: nodes within partition 1 include node v1, node v3, node v7, and node v6, and nodes within partition 2 include node v2, node v4, node v5, and node v8.

In some examples, after a plurality of iterations, when the semantic similarity between a leaf node in a partition and the partition is greater than or equal to a similarity threshold, the leaf node is not repartitioned, so that the possibility that the repartition falls into local optimum can be effectively reduced.

In one embodiment, as shown in fig. 4, for each partition of the knowledge-graph in step 303, when determining that a second node exists in the partition, the repartitioning the second node includes:

and 401, calculating the semantic similarity between the leaf nodes in each partition and the partition where the leaf nodes are located through the slave nodes in the distributed system in parallel, and determining whether a second node exists in the leaf nodes in each partition.

Wherein different slave nodes are used for storing different partitions of the knowledge graph.

In this embodiment, the master node in the distributed system may send the partition information of each partition of the knowledge graph to the slave node in the distributed system, and the slave node calculates the semantic similarity between the leaf node in each partition and the partition in parallel, and compares the calculated semantic similarity with the similarity threshold value to determine whether there is a second node whose semantic similarity is smaller than the similarity threshold value in the leaf nodes in the partition.

When any slave node determines that a second node exists in the stored partition, the slave node sends node information of the second node to the master node so that the master node performs repartition of the second node.

Illustratively, the semantic similarity of leaf nodes within each partition to that partition is calculated in parallel by the slave node as shown in equation (8):

Wherein P is _t Representing the partitions t, sim (v, P _t ) Representing the semantic similarity of partition t and node v, which is a leaf node within partition t, to partition t, |P _t The I represents the number of nodes in the partition t, namely the semantic similarity of the node v and the partition t is the average value of the semantic similarity of the node v and each node in the partition t.

And 402, when receiving node information of a second node sent by a first slave node, a master node in the distributed system sends the node information of the second node to a plurality of second slave nodes so as to acquire semantic similarity between the second node and partitions stored by each second slave node.

Here, the second slave node is a slave node different from the first slave node in the distributed system.

403, repartitioning the second node by the master node according to the maximum value of the semantic similarities corresponding to the second node.

In this embodiment, the master node may repartition the second node into the partition corresponding to the largest semantic similarity according to the largest value among the plurality of semantic similarities corresponding to the second node.

In one embodiment, in step 103, according to the semantic similarity between the node to be divided in the knowledge graph and each partition of the knowledge graph, dividing the node to be divided into the target partitions includes:

Determining the maximum value in semantic similarity between the node to be divided and each partition; and taking the partition corresponding to the maximum value in the plurality of semantic similarities as a target partition, and dividing the node to be divided into the target partition.

For example, if the partitions of the knowledge graph include partition 1, partition 2, partition 3, and partition 4, when it is determined that the semantic similarity between the node to be divided and partition 1 is the greatest according to the semantic similarity between the node to be divided and each partition, the node to be divided is divided into partition 1.

The technical solutions provided in the present disclosure are further described in detail below with reference to fig. 5 and 6.

The knowledge graph processing method provided by the disclosure can be applied to the division of large-scale RDF knowledge graphs. The method mainly comprises the steps of data preprocessing, partition initialization, semantic similarity calculation, repartition and the like.

1. Data preprocessing

When the node weight is initialized, the semantic information of the node and the semantic information of the edge are integrated.

1) Edge weight initialization

As can be seen from the side semantic information shown in FIG. 5, there are triples (dbo: champoin rdfs: subPropertyOf owl: hasParticipant), that is, representing: if someone is the champion of a certain sport, it can be deduced that someone is a participant of a certain sport, i.e. in the RDF knowledge graph, there is an inheritance relationship between edges, so the weights of the edges cannot be initialized to 1 or the same singly. The edges existing in different RDF knowledge maps are different, so for different RDF knowledge map divisions, the edges existing in the current RDF knowledge map and the association relationship between the edges need to be sorted out first, and the tree structure as shown in fig. 5, namely, the semantic hierarchy of the edges is sorted, and then the edges are initialized according to the weight of each edge in the current semantic hierarchy by using the calculation formula (1).

For example, the current edge has the name Ow1: hasParticipant, and the current edge contains 2 child nodes, node dbo: champion and node dbo: olympicOathSwornBy, respectively. Assuming that the number of the current sides (i.e. the number of the current sides in the knowledge graph) is 20 and the total number of the sides in the semantic hierarchy of the sides is 100, the initialization weight of the current sides is calculated to be (20+2)/100, i.e. 0.22, by using the calculation formula (1).

2) Weight initialization of nodes

As can be seen from the node semantic information shown in FIG. 5, there are triples (dbo: softWare rdfs: suhClassOf dbo: work), which represent: if a thing belongs to software engineering, it can be deduced that the thing belongs to a work, namely, in the RDF knowledge graph, inheritance relations exist among entities as well, so that the weights of nodes cannot be uniformly initialized to 1 for calculation. When the nodes existing in different RDF knowledge graphs are different, the node types existing in the current RDF knowledge graph and the association relation between the nodes need to be sorted out firstly, the tree structure shown in fig. 5, namely the semantic hierarchy of the nodes is sorted out, and then the nodes are initialized according to the weight of each node in the current semantic hierarchy by using the calculation formula (2).

For example, the name of the current node is Ow1:Thing, and the current node contains 3 child nodes, namely node dbo:Place, node dbo:Species and node dbo:works. Assuming that the number of the current nodes (i.e. the number of the current nodes in the knowledge graph) is 30 and the total number of the nodes in the semantic hierarchy of the nodes is 100, the initialization weight of the current nodes is calculated to be (30+3)/100 by using the calculation formula (2), namely, 0.33.

3) Weight calculation of nodes

In the RDF knowledge graph query process, a node is often used as a query starting point, and an outgoing side is used as a query direction to query, i.e. the weight of a source node can depend on the number of the outgoing sides and the weight of a target node.

After the initializing weights of the edges and the nodes are calculated according to the semantic hierarchy, the weights of the nodes are recalculated by using the calculation formula (3).

In the calculation process, firstly, selecting a node with an empty edge set to carry out weight assignment, then calculating a source node of an edge entering of the node to carry out calculation, and waiting for the assignment of a target node to carry out calculation when the calculated node has the target node weight of the edge which is not assigned.

2. Initializing partitions

As shown in fig. 6, after the weight calculation of the nodes is completed, the nodes of the weight TopK are selected for partition initialization according to the number of partitions actually required.

With the TopK node as a starting point, the corresponding edge structure is found out for partitioning in combination with the query template shown in fig. 5. When the partition is initialized, the crossing edges are reduced to be the dividing standard, and the partition is subsequently re-divided according to the semantic similarity.

3. Semantic similarity calculation

Under the initial partitioning of nodes and query templates according to the weight TopK, information of each partition is sent to each computing slave node in the distributed computing framework in FIG. 5, and semantic similarity between leaf nodes in each partition and the partition is calculated in parallel by the slave nodes.

The semantic similarity between a leaf node in any partition and that partition may be an average of the semantic similarity between the leaf node and each node in that partition.

Here, the similarity may be a semantic similarity based on adaptive weighting, that is, the distance-based semantic similarity is calculated by using the above calculation formula (4), the attribute-based semantic similarity is calculated by using the above calculation formulas (5) and (6), and then the adaptive weighting-based semantic similarity is calculated by using the above calculation formula (7). After the semantic similarity between the leaf node and each node in the partition is calculated, calculating by using the calculation formula (8) to obtain an average value of the semantic similarity between the leaf node and each node in the partition, namely, the semantic similarity between the leaf node and the partition.

4. Repartitioning

After the semantic similarity calculation in each round of iteration is completed, if the semantic similarity is smaller than a similarity threshold, the slave node sends node information of nodes with the semantic similarity smaller than the similarity threshold to the master node, the master node reclassifies, and sends the division result to each node to enter the next round of calculation.

The primary node repartitioning process comprises the following steps:

the master node sends the received node information of the nodes to each slave node so as to acquire semantic similarity between the node calculated by each slave node and the partition stored by each slave node, and the master node reclassifies the nodes according to the maximum value in the semantic similarity and combines the structural information of the query template and the structural information of the knowledge graph to obtain a partitioning result.

In order to avoid the repartition from falling into local optimum, the scheme supports threshold control on the semantic similarity, and when the semantic similarity of leaf nodes in the partition is greater than or equal to a similarity threshold after a certain number of iterations, the node is not repartitioned.

The technical scheme provided by the embodiment of the disclosure has at least the following beneficial effects:

1. semantic hierarchy information of the RDF knowledge graphs is integrated into weight initialization of the nodes and the edges, so that the semantic information of the nodes and the edges can be distinguished aiming at different RDF knowledge graphs.

2. By integrating the weights of the edges into the weights of the nodes and combining the structural information of the common knowledge-graph query templates contained in different knowledge-graphs in the initial partition, the high communication cost and the query processing time brought by cross-partition query are reduced from the query level, so that the query efficiency is improved.

3. In the repartitioning process, semantic similarity with the partition is calculated for leaf nodes in the partition, and the nodes are migrated according to the semantic similarity. When the RDF knowledge graph after division has new nodes and edges, real-time division can be performed by calculating semantic similarity with each partition, so that the efficiency is improved. In addition, through the self-adaptive weighted semantic similarity calculation scheme, the weight requirements of different knowledge maps on the distance and the attribute can be met. In addition, a similarity threshold can be set in the iterative process, so that the local optimum is prevented from being trapped, and the communication cost and the calculation time can be reduced.

Fig. 7 shows a block diagram of a knowledge graph dividing apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the knowledge-graph dividing apparatus 100 includes: a determination module 110 and a partitioning module 120, wherein:

a determining module 110, configured to determine a weight of a node in a knowledge-graph according to semantic information of the node in the knowledge-graph and semantic information of an edge between the nodes;

The division module 120 is configured to partition the knowledge graph according to the weight of the node in the knowledge graph and the structural information of the knowledge graph;

the dividing module 120 is further configured to divide the node to be divided into target partitions according to semantic similarity between the node to be divided in the knowledge graph and each partition of the knowledge graph.

In one embodiment, the determining module 110 is configured to:

a weight of each of the edges in the first semantic hierarchy is determined.

In one embodiment, the determining module 110 is configured to:

And determining the weight of each node in the second semantic hierarchy.

In one embodiment, the determining module 110 is configured to:

for each of the nodes, performing the following operations:

In one embodiment, the partitioning module 120 is configured to:

and partitioning the knowledge graph according to the weight of the nodes in the knowledge graph, the structural information of the knowledge graph and the structural information of a preset query template.

In one embodiment, the partitioning module 120 is configured to:

It should be noted that: in the processing device for recognition graphs provided in the above embodiment, when implementing the processing method for recognition graphs, only the division of each program module is used for illustration, in practical application, the processing allocation may be performed by different program modules according to needs, that is, the internal structure of the processing device for recognition graphs is divided into different program modules, so as to complete all or part of the processing described above. In addition, the processing device for the identification map provided in the above embodiment and the embodiments of the corresponding method belong to the same concept, and the specific implementation process of the processing device is detailed in the method embodiment, which is not repeated here.

FIG. 8 is a block diagram of a computer device according to an embodiment of the present disclosure; as shown in fig. 8, the computer device 900 includes: a processor 901 and a memory 902 for storing a computer program capable of running on the processor; the processor 901 is configured to implement the steps in the knowledge graph dividing method when running a computer program.

In actual use, computer device 900 may also include: at least one network interface 903. The various components in computer device 900 are coupled together by a bus system 904. It is appreciated that the bus system 904 is used to facilitate connected communications between these components. The bus system 904 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus system 904 in fig. 8. The number of the processors 901 may be at least one. The network interface 903 is used for wired or wireless communication between the computer device 900 and other devices.

The memory 902 in the disclosed embodiments is used to store various types of data to support the operation of the computer device 900.

The method disclosed in the embodiments of the present disclosure may be applied to the processor 901 or implemented by the processor 901. Processor 901 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 901 or instructions in the form of software. The Processor 901 may be a general purpose Processor, a DiGital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 901 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present disclosure. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in the decoded processor. The software modules may be located in a storage medium in a memory 902 and the processor 901 reads information in the memory 902, in combination with its hardware, performing the steps of the method as described above.

In an exemplary embodiment, the computer device 900 can be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field-programmable gate arrays (FPGA, field-Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components for performing the aforementioned methods.

The disclosed embodiments also provide a computer-readable storage medium having a computer program stored thereon; the computer program when executed by a processor implements the steps of the aforementioned knowledge-graph partitioning method.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above described device embodiments are only illustrative, e.g. the division of units is only one logical function division, and there may be other divisions in actual implementation, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present disclosure may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. The knowledge graph dividing method is characterized by comprising the following steps of:

2. The method of claim 1, wherein determining the weights of the nodes in the knowledge-graph based on semantic information of the nodes in the knowledge-graph and semantic information of edges between the nodes comprises:

3. The method of claim 2, wherein determining the initialization weight for each edge according to the semantic information of the edge in the knowledge-graph comprises:

a weight of each of the edges in the first semantic hierarchy is determined.

4. The method of claim 2, wherein determining the initialization weight of each node according to the semantic information of the node in the knowledge-graph comprises:

and determining the weight of each node in the second semantic hierarchy.

5. The method of claim 2, wherein determining the weight of each of the nodes based on the initialization weight of each of the edges and the initialization weight of each of the nodes comprises:

for each of the nodes, performing the following operations:

6. The method according to claim 1, wherein the partitioning the knowledge-graph according to the weights of the nodes in the knowledge-graph and the structural information of the knowledge-graph includes:

7. The method of claim 6, wherein for each partition of the knowledge-graph, upon determining that a second node exists in leaf nodes within the partition, repartitioning the second node comprises:

8. The method according to claim 1, wherein the dividing the node to be divided into the target partitions according to semantic similarity between the node to be divided in the knowledge-graph and each partition of the knowledge-graph includes:

9. The method according to any one of claims 1 to 8, wherein for any one of the partitions of the knowledge-graph, the semantic similarity between a node in the knowledge-graph and the partition is: and a statistical value of semantic similarity of the node and the node in the partition.

10. A knowledge-graph dividing apparatus, characterized in that the apparatus comprises:

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the knowledge-graph partitioning method of any one of claims 1 to 9 when the program is executed by the processor.

12. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the knowledge-graph partitioning method of any one of claims 1 to 9.