CN114625811A

CN114625811A - Method and system for improving sub-graph matching efficiency

Info

Publication number: CN114625811A
Application number: CN202210529123.9A
Authority: CN
Inventors: 游东海
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-06-14
Anticipated expiration: 2042-05-16
Also published as: CN114625811B

Abstract

The embodiment of the specification discloses a method and a system for improving subgraph matching efficiency. Wherein, the method comprises the following steps: acquiring a subgraph matching task; wherein the subgraph matching task comprises one or more matching units and one or more attribute constraints; the one or more matching units are obtained by disassembling the query graph; the attribute constraint condition corresponds to a graphic element, and the graphic element is a node or an edge; determining matching units corresponding to the one or more attribute constraints and the constraint intensity of the matching units; determining the matching and connection sequence of each matching unit when the sub-graph matching task is executed based on the matching units corresponding to one or more object attribute constraints and the constraint intensity thereof; the sequence is used for guiding the graph computation engine to execute the sub-graph matching task on the data graph so as to obtain an example meeting the relation of the query graph.

Description

Method and system for improving sub-graph matching efficiency

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and a system for improving sub-graph matching efficiency.

Background

Knowledge graph (or simply graph) can describe knowledge resources and carriers thereof by using visualization technology, and aims to describe various objects and relationships thereof existing in the real world, and the knowledge graph forms a huge semantic network graph.

The knowledge graph has a complex structure, diversified attribute types and a multi-level learning task, and the problems in various applications can be better solved by fully utilizing the knowledge graph. Subgraph matching is a basic application of knowledge graph, and the goal is to match subgraphs from data graphs that conform to a certain pattern (e.g., a query graph).

The embodiment of the specification provides a method and a system for improving sub-graph matching efficiency, and aims to improve the sub-graph matching efficiency.

Disclosure of Invention

One aspect of embodiments of the present specification provides a method for improving subgraph matching efficiency. The method comprises the following steps: acquiring a subgraph matching task; wherein the subgraph matching task comprises one or more matching units and one or more attribute constraints; the one or more matching units are obtained by disassembling the query graph; the attribute constraint conditions correspond to the graph elements, and the graph elements are nodes or edges; determining matching units corresponding to the one or more attribute constraints and the constraint intensity of the matching units; determining the matching and connection sequence of each matching unit when the sub-graph matching task is executed based on the matching units corresponding to one or more object attribute constraints and the constraint intensity thereof; the sequence is used for guiding the graph computation engine to execute the sub-graph matching task on the data graph so as to obtain an example meeting the relation of the query graph.

Another aspect of embodiments of the present specification provides a system for improving subgraph matching efficiency. The system comprises: the subgraph matching task acquisition module is used for acquiring a subgraph matching task; wherein the subgraph matching task comprises one or more matching units and one or more attribute constraints; the one or more matching units are obtained by disassembling the query graph; the attribute constraint conditions correspond to the graph elements, and the graph elements are nodes or edges; the constraint intensity determining module is used for determining the matching units corresponding to the one or more attribute constraints and the constraint intensities thereof; the matching connection order determining module is used for determining the matching and connection order of each matching unit when the sub-graph matching task is executed based on the matching units corresponding to one or more object attribute constraint conditions and the constraint intensity of the matching units; the sequence is used for guiding the graph computation engine to execute the sub-graph matching task on the data graph so as to obtain an example meeting the relation of the query graph.

Another aspect of embodiments of the present specification provides an apparatus for improving subgraph matching efficiency comprising at least one storage medium and at least one processor, the at least one storage medium storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a method of improving subgraph matching efficiency.

Another aspect of the embodiments of the present specification provides a computer-readable storage medium storing computer instructions, and when the computer reads the computer instructions in the storage medium, the computer executes a method for improving sub-graph matching efficiency.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is an exemplary diagram of subgraph splitting according to some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a method for improving subgraph matching efficiency in accordance with some embodiments described herein;

FIG. 3 is an exemplary flow diagram illustrating determining a constraint strength of an attribute constraint according to some embodiments of the present description;

FIG. 4 is an exemplary block diagram of a system for improving subgraph matching efficiency in accordance with some embodiments described herein.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flowcharts are used in this specification to illustrate the operations performed by the system according to embodiments of the present specification. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

A knowledge graph is a semantic network that exposes relationships between entities (otherwise known as objects). Nodes in the graph represent entities. There may be multiple types of nodes, called node types, that indicate various types of entities. The edges in the graph represent relationships, and there may be multiple types of edges, called edge types, for indicating various types of relationships. An entity may refer to something in the real world, such as a person, place name, concept, medicine, company, and so forth. Relationships can be used to express connections between different entities, e.g., Zhang three and Liqu are "friends," social accounts have a login relationship with the mobile terminal, and so on.

The knowledge-graph may be a directed graph or an undirected graph, i.e., edges in the knowledge-graph may be directed or undirected. The directional edges may be unidirectional or bidirectional to indicate the directionality of the relationship. When the knowledge-graph is an undirected graph, edges may indicate that a relationship has no directionality or that a relationship is bidirectional (e.g., a "friend" relationship). An edge that points to a node may be referred to as an in-edge of the node, and an edge that points from a node (i.e., points to other nodes) may be referred to as an out-edge of the node.

An instance of a knowledge-graph may be referred to as a data-graph (or as graph data, which may also be referred to as a knowledge-graph without causing confusion), the data-graph containing specific knowledge data (which may also be referred to as instance data, including node instance data and edge instance data), and each piece of knowledge may be represented as a triple that contains two entities and their relationships. For example, in a social network graph, there may be both "people" entities, such as Zhang three and Liqu, and "companies" entities, such as company A and company B. The relationship between people and people may be "friends" or "colleagues", and the relationship between people and companies may be "present at" or "present at". Relationships/edges may have directionality, for example, a "friends" relationship may be bidirectional, and a "incumbent" or "once incumbent" relationship may be unidirectional.

The knowledge graph may include homogeneous graphs and heterogeneous graphs. The isomorphic graph may mean that only one type of node and one type of relationship are provided in the graph, for example, the types of nodes are all people, and the relationships among the nodes are all friends. The abnormal graph may refer to more than one type of node or relationship in the graph, for example, the type of the node includes people, accounts, terminals, etc., and the relationship between the nodes includes "friends", "registration time", "terminal type", etc. In fact most knowledge graphs use heterogeneous property graphs. Compared with the heterogeneous graph, the graph data of the heterogeneous attribute graph also comprises additional attribute information. The nodes and the relations in the heterogeneous attribute graph are provided with labels and attributes, wherein the labels can refer to the types of the nodes or the relations, for example, the type of a certain node is "user", the attributes are additional description information of the nodes or the relations, for example, the "user" node can have attributes such as "name", "registration time", "registration address", "age", and the like.

The knowledge-graph may have a corresponding schema definition. The schema definition, which may also be referred to as knowledge-graph ontology definition data, may be a definition of the type of any one point in a knowledge-graph, as well as a definition of the type of edge that has between two points of a certain type. For example, the knowledge-graph ontology definition data defines a "user" node, and examples of the "user" node in the data graph corresponding to the "user" node may be zhang three, li four, wang five, and the like. It should be noted that the knowledge graph referred to in the embodiments of the present specification generally refers to a heterogeneous attribute graph, and the referred nodes and edges are concepts at the schema level, and their instance data is referred to as node instance data or edge instance data, and may also be referred to as instance data or an instance without causing confusion.

Knowledge graph-based data queries have many applications, such as searching, recommendation, intelligent question answering, and graph feature mining. When graph data is used in various applications, graph query is firstly needed, and partial relevant sub-graph data is extracted from a data graph for processing. The graph data of the knowledge graph is a graph structure, and in order to meet the application requirements of related scenes (such as the graph query described above), the search of specific graph relationships (such as a query graph) can be completed through a query matching technology. The query graph defines nodes, edges and connection relations among the nodes and the edges which need to search the instance data from the schema level. The query graph can correspond to a query request of a user and can reflect query conditions input by the user. The core problem of searching through the query graph is to determine whether graph data (such as a data graph) contains a subgraph satisfying nodes, edges and connection relations thereof described by the query graph, and therefore graph query may also be called subgraph matching or pattern matching.

The sub-graph matching may be performed by a series of join operations, and a join operation may include a matching operation and a join operation. Thus, join operations may also be referred to as match and join operations. The matching may be to query an instance satisfying the structural relationship of the matching unit from the graph data, and the connecting may be to associate the instance obtained by the current matching with the instance obtained by the previous matching. The matching unit may be an edge, a node, or a V-shape (two edges at one vertex) structure obtained by splitting a subgraph (query graph).

In some embodiments, the splitting of the query graph may be as shown in FIG. 1. Fig. 1 is an exemplary schematic diagram of query graph splitting according to some embodiments of the present specification, where 100 denotes a query graph before splitting, 101, 102, and 103 denote matching units obtained after splitting, 101 and 102 are exemplary V-shaped structures obtained by splitting, 103 is an exemplary side structure obtained by splitting, and a node may be included in a V-shaped structure or a side structure. In some embodiments, more than two edges may be included in the V-shaped structure, for example, more than two edges (not shown) may be formed by the same vertex and more than two vertices.

The current correlation scheme for subgraph matching includes two types. One type is a hash join strategy based on MapReduce, and the scheme completes sub-graph matching by performing multiple rounds of parallel join operations. For example, assuming that the query graph has 4 nodes A, B, C, D in total and three edges a-B, B-C, C-D are split according to the edges, the obtained three matching units are a-B, B-C, C-D, in the sub-graph matching process of hash join, example data meeting the corresponding relationship are firstly queried for the three matching units in the data graph respectively, and then join (connection) is performed on the example data based on the common node examples of the example data, so as to obtain sub-graph data meeting the relationship of the whole query graph. However, such solutions have the following disadvantages: 1. the invalid calculation is too much, and the intermediate result is very large. When a join operation is executed, if the number of instances contained in the A side (A-B) is very small when the A side (A-B) joins the B side (B-C), the scheme still needs to calculate all the instance data of the B side due to parallel calculation, and invalid calculation is caused; 2. it is not possible to quickly determine whether an invalid query is present. When some relations in the query graph cannot exist in the data graph, the scheme still needs to perform a large amount of calculation to draw a conclusion, for example, in the above example, assuming that example data corresponding to an edge of C — D does not exist in the data graph, the scheme still needs to query example data of a — B, B — C, and finally, the case of invalid query can be determined only if C — D is not matched; 3. when the super-large task is executed, the system is easy to crash, and computing resources are wasted. The other type is aiming at sub-graph matching, and compared with a hash join strategy, the sub-graph matching is optimized in aspects of data organization, join strategy and the like. Such as StarJoin and BigJoin, etc. The main problems with this type of solution are: 1. only the same composition is considered, and attribute constraints are not considered, so that the effect is seriously reduced or even unavailable in a real service scene, namely a heterogeneous attribute graph; 2. the join strategy is not optimized by fully utilizing the graph data; 3. whether the task can be executed or not cannot be judged in advance; 4. no filtering is performed on apparently unrealistic sub-graph queries (i.e., sub-graph queries that cannot be queried to obtain results).

The inventor finds that in the process of researching the sub-graph matching scheme, in the process of constructing the knowledge graph, graph data statistical indexes such as the number of examples of different types, the side communication degree, the hot spot information, the attribute distribution information and the like are generally generated, and the information has a great reference meaning for improving the matching efficiency in the sub-graph matching task.

Therefore, the embodiment of the present specification provides a method and a system for improving sub-graph matching efficiency, which combine ontology definition data and statistical indexes of a knowledge graph to optimize a sub-graph matching method and improve sub-graph matching efficiency. The technical solutions disclosed in the present specification are explained in detail by the description of the drawings below.

FIG. 2 is an exemplary flow diagram of a method for improving subgraph matching efficiency, according to some embodiments described herein. In some embodiments, flow 200 may be performed by a processing device. For example, the process 200 may be stored in a storage device (e.g., an onboard storage unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 200. The flow 200 may include the following operations.

Step 206, a subgraph matching task is obtained. In some embodiments, step 206 may be performed by subgraph matching task fetch module 410.

The sub-graph matching task can be used for acquiring relevant sub-graph data from a corresponding data graph according to conditions given by a user.

In some embodiments, a subgraph matching task may include one or more matching units and one or more attribute constraints.

The matching unit may be a basic data structure obtained by parsing the query graph and used for performing the matching operation. In some embodiments, the structure of the matching cells may include a dot, edge, or chevron structure, or the like. One or more matching units can be obtained by decomposing the query graph.

The attribute constraints correspond to primitive elements, which may be nodes or edges. It should be noted that, unless otherwise specified, the primitive elements referred to in the embodiments of the present specification are generally concepts of the schema level, and a node and an edge refer to a node or an edge of the schema level. Instance data in a data graph for a node or edge may be referred to as a node instance or an edge instance, or simply as an instance without causing confusion.

The attribute constraint condition may refer to a matching rule that needs to be followed when performing subgraph matching, and it may be understood that the attribute constraint condition may filter instance data of a node or an edge in subgraph matching, and thus the attribute constraint condition generally has a graph element corresponding thereto. For example, taking an attribute as an age, which is a size, the attribute constraint may correspond to a user node, specifically 20< age < 80; for another example, taking a house as an example, the house may have attributes such as area size, location, etc., and the attribute constraint condition may correspond to a house node and constrain the house area to be greater than 100 square meters and to be located in a first-line city, etc. The attribute constraint condition is combined with the query graph, the instance data can be filtered while the instance data corresponding to the matching unit is queried, and the sub-graph data meeting the query graph relation and the attribute constraint condition is obtained, so that the sub-graph matching or the sub-graph data query can be more accurately performed.

In some embodiments, the processing device may split the query graph by a common graph splitting method to obtain one or more matching units, for example, splitting by edge, and the like.

In some embodiments, the processing device may obtain the sub-graph matching task by receiving a user input, calling a related data interface, and the like.

Step 208, determining the matching units corresponding to the one or more attribute constraints and the constraint strengths thereof. In some embodiments, step 208 may be performed by constraint strength determination module 420.

The attribute constraint condition generally corresponds to a graph element, and therefore, a matching unit containing a primitive element corresponding to the attribute constraint condition can be determined as a matching unit corresponding to the attribute constraint condition. For example, an attribute constraint of "20 < age < 80" corresponds to a "user" node, and a node A in a matching unit A-B is a "user node", so that the matching unit corresponding to the attribute constraint can be determined to be A-B.

The constraint strength may reflect the degree of filtering of instance data by attribute constraints during the subgraph matching process. Generally, the less instance data in the data graph that satisfies the attribute constraint, the greater the constraint strength that illustrates the attribute constraint. In other words, the constraint strength may be inversely related to the number of instances resulting from the matching.

In some embodiments, the constraint strength may be expressed in a number of ways, for example, using a numerical value, the magnitude of which is positively or negatively correlated with the constraint strength; for example, the higher the star level, the greater the constraint strength, and the like.

In some embodiments, the processing device may determine the constraint strength of the corresponding attribute constraint based on the attribute distribution information of the graph elements in the data graph.

For a detailed description of determining the matching units corresponding to the attribute constraints and the constraint strengths thereof, reference may be made to the related description of fig. 3, which is not repeated herein.

Step 210, determining the matching and connection order of each matching unit when executing the sub-graph matching task based on the matching unit corresponding to the one or more attribute constraints and the constraint strength thereof. In some embodiments, step 210 may be performed by the matching connection order determination module 430.

In some embodiments, matching and connecting may be considered an integral operation involving one matching and one connecting. Matching may refer to querying an instance satisfying the structural relationship of the matching unit from the data graph, and connecting may refer to associating the currently matched instance with the previously matched instance, and if the current is the first matching, not associating. In some embodiments, one match and join may be referred to as one join operation. The association may be a step of splicing the instance data corresponding to different matching units, so as to obtain the instance data corresponding to the whole query graph. And the connection points of the splices may be the same instance in the instance data corresponding to different matching units.

For example, the first matching unit is A-B edge, the type of the edge is transaction, the second matching unit is B-C edge, the type of the edge is friend, and the attribute constraint condition of the A-B edge is that the consumption amount is more than 1000. After the first matching unit is matched, a plurality of examples meeting the attribute constraint condition can be obtained, and the example data of Zhao (transfer 1500 Yuan) Liqu is assumed. And performing matching based on a second matching unit, wherein the second matching unit is a side B-C, the attribute constraint condition is that the friend relationship is met, and a plurality of examples meeting the attribute constraint condition are obtained, and the example data including Liqu (friend) king V is assumed. After the two matching units complete the matching of the instance data, the result of the first matching and the result of the second matching may be associated, for example, zhangsan (transfer 1500 yuan) liquad (friend king five) may be obtained. Therefore, the example data corresponding to different matching units can be spliced through association.

In some embodiments, when the query graph is split into a plurality of matching units, the plurality of matching units obtained by splitting may be matched and connected in a certain order. Therefore, the order of matching and joining can be used to direct the graph computation engine to perform the subgraph matching task on the data graph to obtain an instance satisfying the query graph relationship. Different matching and connection sequences correspond to different completion efficiencies of the subgraph matching task. For example, when the matching and connection order is the second matching unit, the first matching unit, and the third matching unit, the graph computation engine will perform the matching and connection operation in the matching and connection order. For example, the graph calculation engine may first perform the matching operation of the second matching unit (first perform, do not perform association) to obtain the instance data corresponding to the second matching unit, and then perform the matching and associating operation of the first matching unit based on the instance data corresponding to the second matching unit to obtain the instance data corresponding to the first matching unit. Illustratively, the second matching unit is an a-B edge, the type of the edge is a transaction, after matching, a plurality of examples meeting attribute constraint conditions can be obtained, wherein the examples include example data of zhangsan (transfer 1500 yuan) lie four and zhangsan (transfer 2000 yuan) wang five, the first matching unit is a B-a edge, the type of the edge is a friend, when matching the first unit, matching can be performed by taking the node example lie four and wang five as starting points to obtain a lie four (friend) zhao two and wang five (friend) maxi, and then matching and association operations of the third matching unit are performed on the basis of the obtained example data, so that example data meeting the relation of the whole query graph can be obtained.

In some embodiments, the processing device may determine the order of matching and connecting based on the constraint strength of the matching unit corresponding to the attribute constraint. For example, the stronger the constraint strength of the attribute constraint corresponding to the matching unit, the earlier the matching and connection order thereof.

When the constraint strength of the attribute constraint condition is higher, the number of instances obtained by executing the matching operation of the matching unit corresponding to the attribute constraint condition is smaller, and accordingly, when the matching and the association of the next matching unit are executed, the number of instances needing matching and association is smaller, the required calculation amount is smaller, and the calculation efficiency is higher.

Taking the query graph A-B-C-D as an example, the attribute constraint condition corresponding to the transaction relationship of the side A-B is that the consumption amount is larger than 1000 yuan, the attribute constraint condition of the side B-C is that friend relationship is satisfied, the constraint condition of the type transaction place of the side C-D is that the transaction place is Beijing, and the constraint strengths of the attribute constraint conditions of the sides are respectively 10, 5 and 1 by supposing that the constraint strength of the side C-D is the maximum and the constraint strength of the side A-B is smaller according to the attribute distribution information corresponding to the respective attributes.

If the matching and connection order is not adjusted according to the size of the constraint strength, that is, when matching is performed in sequence, the number of instances obtained after the first matching (matching of the edges a-B) is 1000, the number of instances obtained after the second matching (matching of the edges B-C) is 500 based on 1000 instances, and the number of instances obtained after the third matching (matching of the edges C-D) is 100 based on 500 instances, it can be understood that the number of the finally obtained instances satisfying the query graph is 100. In the process, in the example data obtained by the first matching and the second matching, 500 examples and 400 examples do not form example data which finally meets the relation of the query graph, and a large amount of useless workload is caused.

And if the matching and connection sequence are adjusted according to the constraint strength, namely, the matching of the C-D side is executed for the first time, the matching of the B-C side is executed for the second time, and the matching of the A-B side is executed for the third time. Then, the number of instances obtained after the first matching is 100, at this time, since the number of instances of the C node satisfying the attribute constraint condition of the C — D edge is 100 after the first matching, when performing the second matching, matching is only performed on the 100 instances, and then the number of instances which satisfy the attribute constraint condition obtained by the final matching does not exceed 100, that is, the number of instances obtained by the second matching is at most 100, and the third matching is the same. In contrast, the number of intermediate results in the matching process will be effectively reduced.

In the embodiment, the attribute constraint evaluation is performed on the matching unit, and the matching connection operation of the strong constraint part is calculated in advance, so that the intermediate result magnitude can be greatly reduced. The foregoing example is relatively simple and intuitive, and a query graph and a splitting manner thereof in practical application may be more complex, but it can be determined from the foregoing example that matching and connection operations corresponding to an attribute constraint condition with a large constraint strength are advanced, so that the number of intermediate results can be always reduced to a certain extent, and the efficiency of sub-graph matching is improved.

In some embodiments, in an actual application scenario, a situation may occur that a subgraph matching task is not matched with a data graph, which may cause invalid computation, for example, a query graph in the subgraph matching task is a-B-C-D, and a data graph to be matched does not include a D node, so that the subgraph matching task inevitably does not obtain a result, and the matching computations of the a-B edge and the B-C edge computed in the matching process are both invalid computations, which wastes computing resources.

Therefore, in some embodiments, the processing device may further perform validity determination on the sub-graph matching task based on the schema information of the data graph, and when the sub-graph matching task is invalid, the sub-graph matching task is not executed, so as to avoid waste of computing resources. Illustratively, the processing device may determine whether to perform the sub-graph matching task by performing step 202. As shown in the following embodiments, step 202 may include sub-steps 2022 and 2026.

Step 2022, obtain ontology definition data of the data map.

Since any data map is generated based on the preset ontology definition data, there is always ontology definition data corresponding to any data map. In some embodiments, the processing device may obtain the ontology definition data of the data diagram by reading from a database or a storage device storing the ontology definition data of the data diagram, or by calling a related data interface.

Step 2024, determine whether the ontology definition data includes the query graph.

In some embodiments, the processing device may make the determination by traversing nodes and edges in the ontology-defining data. For example, each node and each edge in the query graph are searched in the ontology definition data to determine whether the node or the edge is included. If it is determined that at least one node or at least one edge of the query graph is not included in the ontology definition data, the processing device may perform step 2086.

Step 2026, determine not to perform the subgraph matching task.

Because the ontology definition data of the data graph does not contain the query graph, a subgraph matching task based on the query graph cannot be completed in the data graph. In this embodiment, the query graph of the sub-graph matching task is verified through the ontology definition data of the data graph, so that invalid sub-graph matching tasks can be filtered out quickly, and computing resources are saved.

In some embodiments, the processing device may further evaluate possible data magnitude for performing the sub-graph matching task and required computing resources based on the statistical indicator data of the data graph, and determine whether to perform the sub-graph matching task according to a result of the evaluation. Illustratively, the processing device may evaluate the data magnitude of the sub-graph matching task by performing step 204. As shown in the following embodiments, step 204 may include sub-steps 2042 and 2046.

It should be noted that, step 202 and step 204 in the dashed box in fig. 2 are optional additional steps, and in some embodiments, step 202 and step 204 may not be included, or only one of them is included, which is not limited in this specification.

Step 2042, obtaining statistical index data of the data map.

The statistical index data of the data graph may refer to statistical information for the case of corresponding instances of nodes, edges, node types, attributes, etc. in the data graph. In some embodiments, the statistical indicator data may include one or more of a number of instances of different types, a degree of edge connectivity, and hotspot information.

The number of instances of different types may include a number of node instances and edge instances of different types, e.g., a number of instances of nodes of a user type, a number of instances of edges of a buddy relationship type.

Edge connectivity may refer to the average degree of edges for different relationship types. The calculation of the edge connectivity can be performed based on the nodes, specifically, the nodes with the edges of the specified type can be determined, the number of the edges of the type of each node is counted, and finally, the average value of the number of the edges of the type of each node is calculated to obtain the connectivity of the edges of the type. For example, taking the calculation type as the connectivity of the edge of the friend relationship as an example, the node of the edge with the friend relationship is determined first, and assuming that there are three user nodes, the first user has 100 friends, the second user has 200 friends, and the third user has 150 friends, the connectivity of the edge with the friend relationship is (100 +200+ 150)/3 = 150.

A hotspot may refer to a node in the data graph having a large number of edges. For example, in the above example, a user has 1000 buddies, which is far beyond the average 200 buddies of the user, the user may be considered as a hotspot user.

In some embodiments, the processing device may read from the database and the storage device, and call the relevant data interface to obtain the statistical indicator data of the data graph. The statistical index data can be obtained by statistics in advance after the data graph is constructed.

And step 2044, evaluating the data magnitude of the subgraph matching task based on the statistical index data.

The data magnitude of the subgraph matching task can reflect the computing resources required by the subgraph matching task. For example, the larger the magnitude of the data, the more computing resources are required.

In some embodiments, the processing device may evaluate an approximate data magnitude of the execution of the sub-graph matching task based on the statistical indicator data. For example, the processing device may determine, based on the target node and the target edge to be matched in the query graph, the number of instance data of the target node and the target edge in the data graph, and further evaluate the approximate data magnitude of the sub-graph matching task. For another example, the processing device may determine an order of magnitude of edges in the subgraph matching task that require join based on the degree of edge connectivity. Assuming that A, B, C three nodes, a- > B, B- > C two edges, the type of the first edge is a transaction relationship, the type of the second edge is a friend relationship, the number of node instances of a is 100, the edge connectivity of the transaction type is 10, and the edge connectivity of the friend relationship is 10, it can be estimated that 100 × 10=10000 edge instance data can be obtained by matching the two edges.

In some embodiments, when there is a hot spot in the matched nodes, for example, assuming that the node B is a hot spot, the data magnitude that would be generated when the sub-graph is matched may also be estimated according to the number of outgoing edges and the number of incoming edges of the hot spot B, for example, the number of outgoing edges of the node B is 10000 and the number of incoming edges of the node B is 10000, it may be determined that 10000 × 10000 pieces of instance data may be generated when the hot spot B is matched, and a matching result may be greatly increased, that is, the data magnitude of the sub-graph matching task may become large. At this time, an upper limit that the data magnitude of the final sub-graph matching task may reach may be estimated according to the hotspot information (e.g., a possible data magnitude after matching a node or edge corresponding to the hotspot information is estimated), and whether to execute the sub-graph matching task may be determined based on the upper limit.

In some embodiments, the above statistical index data may be used in combination to evaluate the data magnitude of the subgraph matching task.

And 2046, when the data magnitude is greater than a preset value, determining not to execute the sub-graph matching task.

In some embodiments, the predetermined value may be predetermined, for example, the predetermined value is determined based on available computing resources of the processing device, and the data magnitude that can be processed by the available computing resources is determined as the predetermined value. For example, the preset value can be obtained by dividing the available memory of the current processing device by the memory occupied by one edge, according to how much memory is occupied by one edge. When the data magnitude is larger than the preset value, the processing equipment may be crashed, and at this time, it may be determined that the sub-graph matching task is not executed, thereby avoiding resource waste caused by executing the sub-graph matching task due to the fact that calculation cannot be completed.

In some embodiments of the present description, before executing the sub-graph matching task, validity of the sub-graph matching task is determined based on ontology definition data of the knowledge graph, so as to avoid invalid computation, evaluate possible result magnitude of sub-graph matching and evaluation of computing resources according to the number of statistical indexes of the data graph, filter large tasks that cannot be computed, and prevent waste of computing resources. When the sub-graph matching task is executed, the execution sequence of the matching operation of each matching unit is adjusted based on the attribute constraint condition, so that the intermediate data volume in the sub-graph matching execution process is effectively reduced, and the sub-graph matching efficiency is improved.

FIG. 3 is an exemplary flow diagram illustrating determining a constraint strength of an attribute constraint according to some embodiments of the present description. In some embodiments, flow 300 may be performed by a processing device. For example, the process 300 may be stored in a storage device (e.g., an onboard storage unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 300. The flow 300 may include the following operations.

Step 302, obtaining attribute distribution information of one or more graph elements in the data graph.

The attribute distribution information may refer to the distribution number of the instance data of the graph element in the data map in different value intervals. For example, the nodes representing the users have age attributes, the distribution range of the age attributes is from 10 to 100 years old, and the corresponding attribute distribution information may refer to how many pieces of example data are respectively from each age group of 10 to 100 years old, for example, 500 pieces are from 10 to 20 years old, 2000 pieces are from 20 to 30 years old, … … pieces are from 80 to 90 years old, and 100 pieces are from 80 to 90 years old. Wherein, a divided age group can be regarded as a value interval, for example, 10-20 years old is a value interval.

In some embodiments, the processing device may count and store distribution information of each attribute of each graph element in advance after the data graph is constructed, and when the sub-graph matching task needs to be executed, read and obtain the required attribute distribution information from a storage space (e.g., a database, a storage device, etc.) of the attribute distribution information.

And step 304, determining the constraint strength of the corresponding attribute constraint condition based on the attribute distribution information.

In some embodiments, the processing device may determine the corresponding constraint strength based on the number of corresponding instances in the attribute distribution information under the constraint of the attribute. For example, the processing device may determine the constraint strength of the corresponding attribute constraint in the manner described in the following embodiments.

The processing device may determine a value range corresponding to the attribute constraint. As described in step 302, the value interval may be a division of each attribute interval under the attribute constraint. To give a more specific example of the user's age, it is assumed that the intervals of the age attribute are divided, and the first interval is 1 to 20 years old, the second interval is 20 to 40 years old, the third interval is 40 to 70 years old, the fourth interval is 70 to 80 years old, and the fifth interval is 80 years old or older. When the attribute constraint is that the age is greater than 80 years, it may be determined that the value range corresponding to the attribute constraint is a fifth range corresponding to the age of 80 years or more.

The processing device may determine the constraint strength of the attribute constraint based on the number of distributions of the instances of the corresponding value intervals.

In some embodiments, the constraint strength may be inversely related to the number of distributions. For example, in the above example, if the number of the distributions of the instances in the first section is 5000, the number of the distributions of the instances in the second section is 10000, the number of the distributions of the third section is 15000, the number of the distributions of the instances in the fourth section is 3000, and the number of the distributions of the instances in the fifth section is 2000, it may be determined that the magnitude relation of the constraint strength of the attribute constraint corresponding to each section is fifth section > fourth section > first section > second section > third section.

In the embodiment, by evaluating the constraint strength of the attribute constraint condition, the sequence of matching calculation of the part with higher constraint strength can be advanced, the number of results obtained by intermediate matching is effectively reduced, and the efficiency of subgraph matching is further improved.

It should be noted that the above description of the respective flows is only for illustration and description, and does not limit the applicable scope of the present specification. Various modifications and changes to the flow may occur to those skilled in the art, given the benefit of this disclosure. However, such modifications and variations are intended to be within the scope of the present description. For example, changes to the flow steps described herein, such as the addition of pre-processing steps and storage steps, may be made.

FIG. 4 is an exemplary block diagram of a system for improving subgraph matching efficiency in accordance with some embodiments described herein. As shown in fig. 4, system 400 may include an output subgraph matching task acquisition module 410, a constraint strength determination module 420, and a matching connection order determination module 430.

Subgraph matching task obtaining module 410 may be used to obtain subgraph matching tasks.

Wherein the subgraph matching task comprises one or more matching units and one or more attribute constraints; the one or more matching units are obtained by disassembling the query graph; the attribute constraint conditions correspond to primitive elements, which are nodes or edges.

The constraint strength determination module 420 may be configured to determine matching units corresponding to the one or more attribute constraints and the constraint strengths thereof.

The matching connection order determining module 430 may be configured to determine matching and connection orders of matching units when performing the sub-graph matching task based on matching units corresponding to one or more object attribute constraints and their constraint strengths.

The sequence is used for guiding the graph computation engine to execute the sub-graph matching task on the data graph so as to obtain an example meeting the relation of the query graph.

With regard to the detailed description of the modules of the system shown above, reference may be made to the flow chart section of this specification, e.g., the associated description of fig. 2-3.

It should be understood that the system and its modules shown in FIG. 4 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above description of the system for improving sub-graph matching efficiency and the modules thereof is only for convenience of description, and the description is not limited to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, the sub-graph matching task obtaining module 410, the constraint strength determining module 420, and the matching connection order determining module 430 may be different modules in one system, or may be a module that implements the functions of two or more modules described above. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present description.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) the schema information and the statistical index data of the knowledge graph are introduced into the subgraph matching task, so that the task calculation efficiency is improved, and the calculation resources are saved; (2) attribute constraint conditions are added into the sub-graph matching task, the constraint strength is evaluated while the sub-graph structure and the attributes are constrained, the matching and connection sequence during sub-graph matching is optimized according to the constraint strength, and the sub-graph matching efficiency is improved; (3) and evaluating the result magnitude and the computing resources of the subgraph matching in advance, filtering large tasks which cannot be computed, and further preventing the computing resources from being wasted.

It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, though not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means a feature, structure, or characteristic described in connection with at least one embodiment of the specification. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit-preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of improving subgraph matching efficiency, the method comprising:

acquiring a subgraph matching task; wherein the subgraph matching task comprises one or more matching units and one or more attribute constraints; the one or more matching units are obtained by disassembling the query graph; the attribute constraint conditions correspond to the graph elements, and the graph elements are nodes or edges;

determining matching units corresponding to the one or more attribute constraints and the constraint intensity of the matching units;

determining the matching and connection sequence of each matching unit when the sub-graph matching task is executed based on the matching units corresponding to one or more object attribute constraints and the constraint intensity thereof; the sequence is used for guiding the graph computation engine to execute the sub-graph matching task on the data graph so as to obtain an example meeting the relation of the query graph.

2. The method of claim 1, wherein determining matching units and their constraint strengths for the one or more attribute constraints comprises:

taking a matching unit containing a graph element corresponding to a certain attribute constraint condition as a matching unit corresponding to the attribute constraint condition;

acquiring attribute distribution information of one or more graph elements in a data graph;

based on the attribute distribution information, the constraint strength of the corresponding attribute constraint condition is determined.

3. The method according to claim 2, wherein the attribute distribution information of the graph element includes the distribution number of the instance of the graph element in each value interval of the attribute; the determining the constraint strength of the corresponding attribute constraint condition based on the attribute distribution information includes:

determining a value interval corresponding to the attribute constraint condition;

determining the constraint strength of the attribute constraint condition based on the distribution quantity of the instances of the corresponding value interval; wherein the constraint strength is inversely related to the number of distributions.

4. The method of claim 1, wherein the stronger the constraint strength of the attribute constraint corresponding to the matching unit, the earlier the matching and connection order.

5. The method of claim 1, further comprising:

acquiring ontology definition data of the data graph;

judging whether the ontology definition data contains the query graph;

and when not, determining not to execute the subgraph matching task.

6. The method of claim 1, further comprising:

acquiring statistical index data of a data graph;

evaluating the data magnitude of the sub-graph matching task based on the statistical index data; the data magnitude of the subgraph matching task can reflect the computing resources required by the subgraph matching task;

and when the data magnitude is larger than a preset value, determining not to execute the sub-graph matching task.

7. The method of claim 6, the statistical indicator data comprising one or more of a number of instances of different types, a degree of edge connectivity, and hotspot information.

8. A system for improving subgraph matching efficiency, the system comprising:

the subgraph matching task acquisition module is used for acquiring a subgraph matching task; wherein the subgraph matching task comprises one or more matching units and one or more attribute constraints; the one or more matching units are obtained by disassembling the query graph; the attribute constraint condition corresponds to a graphic element, and the graphic element is a node or an edge;

the constraint intensity determining module is used for determining the matching units corresponding to the one or more attribute constraints and the constraint intensities thereof;

the matching connection sequence determining module is used for determining the matching and connection sequence of each matching unit when the sub-graph matching task is executed based on the matching units corresponding to one or more object attribute constraint conditions and the constraint intensity thereof; the sequence is used to direct the graph computation engine to perform the sub-graph matching tasks on the data graph to obtain instances that satisfy the query graph relationships.

9. An apparatus for improving subgraph matching efficiency, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when read by a computer, cause the computer to perform the method of any one of claims 1-7.