CN114020781A

CN114020781A - Query task optimization method based on scientific and technological consultation large-scale graph data

Info

Publication number: CN114020781A
Application number: CN202111316037.1A
Authority: CN
Inventors: 鄂海红; 宋美娜; 梁静茹; 刘雨薇; 魏秋实
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-08
Also published as: WO2023077731A1

Abstract

The query task optimization method, the query task optimization system and the storage medium based on scientific and technological consultation large-scale graph data acquire the identification of a query task, and select a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting a graph traversal and expansion sequence strategy, Cardinal reduction, mode advance and materialized view, then querying a graph database by using the query optimization method, and outputting a query result. Therefore, in the method provided by the disclosure, the corresponding query optimization method can be selected according to the identification of the query task, and the flexibility of the query method is improved. Meanwhile, in the method provided by the disclosure, the query optimization method improves the query efficiency of the query task under different scenes of scientific and technological consultation large-scale graph data, reduces the complexity of query calculation, and shortens the time spent on query.

Description

Query task optimization method based on scientific and technological consultation large-scale graph data

Technical Field

The application relates to the field of large-scale graph data query, in particular to a query task optimization method and device based on scientific and technological consultation large-scale graph data and a storage medium.

Background

The query task on graph data is one of the most fundamental problems in the field of knowledge graph, so that efficient query processing is generally required on large-scale graph data so that a user can quickly obtain a query result.

Currently, although query optimization techniques on graph data have advanced sufficiently, there are still some problems: like graph partitioning techniques for graph query optimization, graph data can be split into multiple servers, but the servers have high communication cost and processing overhead. Moreover, most query optimization technologies perform query optimization based on graph data of a social network, and are not suitable for graph data of a complex topological structure of a scientific and technological consultation scene. Therefore, how to optimize the query task based on the scientific and technical consultation large-scale graph data is a problem which needs to be solved urgently.

Disclosure of Invention

The application provides a method, a system and a storage medium for optimizing a query task based on scientific and technological consultation large-scale graph data, and provides the method for optimizing the query task based on the scientific and technological consultation large-scale graph data.

An embodiment of a first aspect of the present application provides a method for optimizing a query task based on scientific and technological consulting large-scale graph data, including:

acquiring an identifier of a query task;

selecting a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises adjusting a graph traversal and expansion sequence strategy, reducing Cardiality, advancing a mode and materializing a view;

and querying the graph database by using the query optimization method, and outputting a query result.

An embodiment of a second aspect of the present application provides a query task optimization system based on scientific and technological consulting large-scale graph data, including:

the acquisition module is used for acquiring the identifier of the query task;

the selection module is used for selecting a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises adjusting a graph traversal and expansion sequence strategy, reducing Cardiality, advancing a mode and materializing a view;

and the display module is used for inquiring the graph database by using the inquiry optimization method and outputting an inquiry result.

A computer storage medium provided in an embodiment of the third aspect of the present application, where the computer storage medium stores computer-executable instructions; the computer executable instructions, when executed by a processor, are capable of performing the method of the first aspect as described above.

A computer device according to an embodiment of a fourth aspect of the present application includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method according to the first aspect is implemented.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a query task optimization method based on scientific and technological consulting large-scale graph data according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a query task optimization system based on scientific and technological consulting large-scale graph data according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a query task optimization method and system based on scientific and technological consulting large-scale graph data according to an embodiment of the present application with reference to the accompanying drawings.

Example one

Fig. 1 is a schematic flowchart of a query task optimization method based on scientific and technological consulting large-scale graph data according to an embodiment of the present application, and as shown in fig. 1, the method may include:

step 101, obtaining the identification of the query task.

It should be noted that, in the embodiment of the present disclosure, the query task may include an organization, a talent, and an industry chain. In the embodiment of the present disclosure, the organization may be an ID of a company, and the talent may be a person

In the embodiment of the present disclosure, the identifier of the query task may be obtained according to the content of the query task. For example, in the embodiment of the present disclosure, assuming that the query task is to view company and patent conditions associated with a certain person, the identifier of the query task is obtained.

And 102, selecting a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting a graph traversal and expansion sequence strategy, reducing Cardiality, advancing a mode and materializing a view.

In the embodiment of the present disclosure, different identifiers correspond to different query optimization methods, and a corresponding query method may be selected according to the identifier of the query task.

And query optimization prevention in embodiments of the present disclosure may include adjusting graph traversal expansion order policy, Cardiality reduction, pattern advancement, materialized views.

Further, in the embodiment of the disclosure, a graph traversal expansion sequence strategy is adjusted in combination with a scientific and technological consultation actual query scenario, a graph traversal expansion sequence of the bidirectional BFS is designed, search is started from two directions of a starting point and an end point, and once a searched position in the other direction is searched (or a certain state is visited by both directions), a shortest path connecting the starting point and the end point is found. Then converge to a point in the middle of the shortest path and meet at the middle point of the path, so that the node number of the bidirectional BFS is 2 ×^nm/2+1An order of magnitude.

Specifically, in the embodiment of the present disclosure, adjusting the graph traversal expansion order policy may include the following steps:

s11, inputting a source entity node and a target entity node, and inputting an intermediate entity node type mtype and a path mode pattern;

s12, initializing two node sets of S1 and S2, wherein S1 is initialized to be an input source entity node, and S2 is initialized to be an input target entity node;

s13, calculating the unfolding sequence of the bidirectional BFS by using pattern and mtype, and representing the unfolding sequence of the left end by using pattern1 and the unfolding sequence of the right end by using pattern 2;

s14, if S1 or S2 is not empty, continuing to execute step S15; otherwise, executing step S111;

s15 and S are the set of the expansion nodes of the layer;

s16, exchanging S1 and S2, and alternately expanding from the left end and expanding from the right end;

s17, expanding the next layer neighbor node of the node according to the mode for each node in the S1 set, and expressing the node by next _ nodes;

s18, judging the node in each next _ nodes, if the node is in the S set, finding a path, and executing the step S111;

s19, adding all the next _ nodes of the expanded nodes of the layer into a set S, copying the set S to S1, and storing the path;

s110, repeating the step S14;

and S111, ending.

For example, in the embodiment of the disclosure, the query task gives an industry chain tag and person information person, and queries the sub industry chain tag from the tag, the patent belonging to the sub industry chain tag, the company belonging to the patent, the job/investment of the company, and other related persons. In the established scientific and technological consultation knowledge graph, 146284 patent intermediate nodes can be generated on the path of an industry chain-a sub-industry chain label-a patent, if the 146284 patents are expanded by using the unidirectional BFS, explosive intermediate results can be generated, and the query performance is seriously influenced.

If the graph traversal and expansion sequence optimization strategy of the bidirectional BFS in the embodiment of the present disclosure is used, bidirectional search is performed from the starting point and the end point, that is, traversal is performed in two directions of the industry chain label-sub-industry chain label-patent and the personnel-company-patent, 146284 patent intermediate nodes generated by the industry chain label-sub-industry chain label-patent are processed into a hash table, and then a set of results is generated from the personnel node in the reverse direction, the path of the personnel-company-patent is intersected to find a path that connects the starting point and the end point and meets the condition, and the time complexity also only needs o (n).

Further, in the embodiments of the present disclosure, Cardinality represents the number of unique values after deduplication, such as Columns Cardinality (column Cardinality) refers to the number of non-duplicate values contained in a column. This quantity directly affects the effect of model compression and performance of the engine when scanning. It is therefore desirable to minimize cardability to reduce the time required for queries.

In an embodiment of the present disclosure, the cardability reduction may include the following steps:

s21, inputting a source entity node and a path pattern;

s22, next _ nodes is a node set of the next layer of expansion, and is initialized to a neighbor node of the next layer of the source entity node expanded according to the mode;

s23, removing the duplicate of the next _ nodes;

s24 and q are node queues and are initialized to next _ nodes;

s25, if q is not empty, continuing to execute step S26; otherwise, executing step S212;

s26, setting size as the current queue number;

s27, if the size is not empty, continuing to execute the step S28; otherwise, executing step S211;

s28, popping up a node of the current queue;

s29, expanding next-layer neighbor nodes next _ nodes of the node according to the mode;

s210, adding the next _ nodes into a queue q;

s211, if the pattern is traversed completely, continuing to execute the step S212, otherwise executing the step S25;

and S212, ending.

For example, in the embodiment of the present disclosure, in the knowledge graph in the practical scenario of scientific and technological consultation, there may be a heavy edge or different types of edges between two points, for example, there are three relationships "company-investor"/"company-princess stockholder-person"/"company-staffing person" between the "company" node and the "person" node. Therefore, looking for a "people" node adjacent to a company from a company may locate some of the same "people" nodes from the three relationships described above, resulting in duplicate nodes. And the redundant nodes are repeated, Cardinal is increased, and when the repeated 'personnel' nodes continue to search for adjacent nodes, traversal is repeated, so that the number of intermediate nodes is increased, and the query time is prolonged. Therefore, in embodiments of the present disclosure, a distinting pre-optimization strategy is used to reduce cardinality.

Specifically, in the embodiment of the present disclosure, the query task in the scientific and technological consultation scenario is to give person, search for its associated company, the patents owned by the company, and the industry chain labels to which the patents belong from the given person query, and output the non-repetitive company, patent, and industry chain label tuples that conform to the path. The embodiment of the disclosure uses the distinting to reduce the Cardinal optimization strategy in advance, and the deduplication operation is carried out in advance after the generation of the repeated nodes, that is, the deduplication operation is carried out immediately after the 'personnel' nodes traverse to the 'company' nodes, and 201 company intermediate nodes with repetition are reduced to 131 company nodes without repetition, so that the generation of the intermediate nodes is reduced, and the subsequent traversal time is reduced.

Further, in the embodiment of the present disclosure, the target data needs to be acquired and screened according to the service condition, and this process is filtering of the data query. There are a large number of filtering operations in a large-scale graph query task, AND various filtering conditions used in the filtering process are necessary steps for obtaining accurate data, such as basic algorithms (<, >, | >), logical operations (AND, OR, NOT), AND pattern matching.

Wherein, in the embodiment of the present disclosure, the mode advancing may include the following steps:

s31, inputting a source entity node, a path mode pattern and a filter _ pattern;

s32, initializing a mode advance set filter _ nodeb;

s33 and q are node queues and are initialized to be input source entity nodes;

s34, if q is not empty, continuing to execute step S35; otherwise, executing step S313;

s35, initializing the current queue number size;

s36, if the size is not empty, continuing to step S37; otherwise, go to step S312;

s37, popping up a node of the current queue;

s38, expanding next-layer neighbor nodes next _ nodes of the node according to the mode;

s39, judging whether the current next _ nodes node type is the filter _ nodes node type, if yes, continuing to execute the step S310; otherwise, executing step S311;

s310, traversing the next _ node of the next _ nodes set, and filtering the node if the next _ node is in the filter _ node set;

s311, adding the next _ nodes into a queue q;

s312, if the pattern is traversed completely, continuing to execute the step S313, otherwise executing the step S35;

and S313, ending.

For example, in the embodiment of the present disclosure, the query task in the scientific and technical consultation scenario is to give the tag information tag of the industry chain, and to query the company associated with the tag and the patent owned by the company, there is a filtering condition: the company can not have the operation exception, namely, the mode of the company-operation exception does not exist, and the company and the patent tuples without the duplication are output.

In particular, the mode advancement in embodiments of the present disclosure is to replace the traversal operation in the mode with efficient lookup of the sets. The mode of company-operation abnormity is made in advance, the company ID information associated with the 'operation abnormity' node is put into a hash table, then the filtering condition can judge whether the 'company' node exists in the hash table, if the 'company' node does not exist in the hash table, the 'company' node shows that the company has no operation abnormity, the set search is carried out only by the time complexity of 3292 times o (1), and therefore the query efficiency is improved.

Further, in the embodiment of the disclosure, the materialized view is mainly used for pre-calculating and storing results of operations which are time-consuming, such as table connection or aggregation, so that the operations which are time-consuming can be avoided when the query task is executed subsequently, and the query result can be obtained quickly. Under the scientific and technological consultation scene, the materialized view greatly improves the query performance of the hotspot problems which frequently use the same query result repeatedly, so that data can be quickly read from the materialized view.

For example, in the embodiment of the disclosure, in a scientific and technological consultation scenario, an industry chain tag information tag is given to a query task, a sub-industry chain tag of the query task and a company belonging to the sub-industry chain tag are queried from the tag, then the query task takes the sub-industry chain tag as a starting node, a path through which a patent finally traverses to reach a company node is queried, and company information and the number of patents which accord with the mode are counted. If each company is queried separately, it is time consuming. However, the materialized view method in the embodiment of the disclosure may obtain the patents owned by each company in advance, determine the industry chain labels to which each patent belongs and aggregate the patents, and enter the number of the patents under the industry chain labels into the attribute of the "company-industry chain label" edge, so that the precomputed materialized view improves the query efficiency.

And 103, querying the graph database by using a query optimization method, and outputting a query result.

In the embodiment of the present disclosure, the query optimization method in step 102 is used to query the graph database, and the query result is output. And, in embodiments of the present disclosure, the query results may include associations between nodes in a graph database.

The query task optimization method based on scientific and technological consultation large-scale graph data obtains identifiers of query tasks and selects corresponding query optimization methods according to the identifiers of the query tasks, wherein the query optimization methods comprise adjusting graph traversal and expansion sequence strategies, Cardinal reduction, mode advance and materialized view, then query is conducted on a graph database by using the query optimization methods, and query results are output. Therefore, in the method provided by the disclosure, the corresponding query optimization method can be selected according to the identification of the query task, and the flexibility of the query method is improved. Meanwhile, in the method provided by the disclosure, the query optimization method improves the query efficiency of the query task under different scenes of scientific and technological consultation large-scale graph data, reduces the complexity of query calculation, and shortens the time spent on query.

Fig. two is a schematic structural diagram of a query task optimization system based on scientific and technological consulting large-scale graph data according to an embodiment of the present application, and as shown in fig. 2, the system may include:

an obtaining module 201, configured to obtain an identifier of a query task;

the selection module 202 is configured to select a corresponding query optimization method according to the identifier of the query task, where the query optimization method includes adjusting a graph traversal and expansion sequence policy, reducing Cardiality, advancing a mode, and materializing a view;

the display module 203 is configured to query the graph database by using a query optimization method, and output a query result.

In the embodiment of the present disclosure, the query task may include an organization, a talent, and an industrial chain.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A query task optimization method based on scientific and technological consultation large-scale graph data is characterized by comprising the following steps:

acquiring an identifier of a query task;

2. The method of claim 1, wherein the query task comprises an organization, a talent, and an industry chain.

3. The query task optimization method of claim 1, wherein the tuning graph traverses an expansion order strategy, comprising:

s15 and S are the set of the expansion nodes of the layer;

s110, repeating the step S14;

and S111, ending.

4. The query task optimization method of claim 1, wherein the Cardinal reduction comprises:

s21, inputting a source entity node and a path pattern;

s23, removing the duplicate of the next _ nodes;

s24 and q are node queues and are initialized to next _ nodes;

s26, setting size as the current queue number;

s28, popping up a node of the current queue;

s210, adding the next _ nodes into a queue q;

and S212, ending.

5. The query task optimization method of claim 1, wherein the pattern is advanced, comprising:

s32, initializing a mode advance set filter _ nodeb;

s33 and q are node queues and are initialized to be input source entity nodes;

s35, initializing the current queue number size;

s37, popping up a node of the current queue;

s311, adding the next _ nodes into a queue q;

and S313, ending.

6. A query task optimization system based on scientific and technological consultation large-scale graph data is characterized by comprising:

the acquisition module is used for acquiring the identifier of the query task;

7. The query task optimization system of claim 6, wherein the query task comprises an organization, a talent, an industry chain.

8. A computer storage medium, wherein the computer storage medium stores computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of performing the method of any of claims 1-5.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1-5 when executing the program.