WO2023077731A1 - 基于科技咨询大规模图数据的查询任务优化方法 - Google Patents

基于科技咨询大规模图数据的查询任务优化方法 Download PDF

Info

Publication number
WO2023077731A1
WO2023077731A1 PCT/CN2022/087215 CN2022087215W WO2023077731A1 WO 2023077731 A1 WO2023077731 A1 WO 2023077731A1 CN 2022087215 W CN2022087215 W CN 2022087215W WO 2023077731 A1 WO2023077731 A1 WO 2023077731A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
node
optimization method
nodes
query task
Prior art date
Application number
PCT/CN2022/087215
Other languages
English (en)
French (fr)
Inventor
鄂海红
宋美娜
梁静茹
刘雨薇
魏秋实
Original Assignee
北京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京邮电大学 filed Critical 北京邮电大学
Publication of WO2023077731A1 publication Critical patent/WO2023077731A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Definitions

  • the present application relates to the field of large-scale graph data query, in particular to a query task optimization method, device and storage medium based on scientific and technological consulting large-scale graph data.
  • Query tasks on graph data are one of the most basic problems in the field of knowledge graphs, so efficient query processing on large-scale graph data is usually required so that users can quickly obtain query results.
  • the graph partitioning technology for graph query optimization can split graph data into multiple servers, but the communication cost and processing overhead of the server are relatively high.
  • most query optimization technologies are based on social network graph data for query optimization, which is not suitable for graph data with complex topological structures in technology consulting scenarios. Therefore, it is necessary to further optimize the query tasks based on large-scale graph data of science and technology consulting.
  • This application provides a query task optimization method, system and storage medium based on large-scale graph data of scientific and technological consulting.
  • the embodiment of the first aspect of the present application proposes a query task optimization method based on large-scale graph data of scientific and technological consulting, including:
  • the query optimization method includes adjusting the graph traversal expansion sequence strategy, Cardinality reduction, mode advance, materialized view;
  • the graph database is queried by using the query optimization method, and a query result is output.
  • the embodiment of the second aspect of the present application proposes a query task optimization system based on large-scale graph data of scientific and technological consulting, including:
  • An acquisition module configured to acquire the identifier of the query task
  • a selection module configured to select a corresponding query optimization method according to the identifier of the query task, wherein the query optimization method includes adjusting the graph traversal expansion sequence strategy, Cardinality reduction, mode advance, and materialized view;
  • a display module configured to use the query optimization method to query the graph database and output query results.
  • the computer storage medium provided by the embodiment of the third aspect of the present application, wherein the computer storage medium stores computer-executable instructions; after the computer-executable instructions are executed by a processor, the method described in the first aspect above can be implemented.
  • the computer equipment proposed in the embodiment of the fourth aspect of the present application includes a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the program, it can realize the above-mentioned first aspect. described method.
  • the computer program product provided by the embodiment of the fifth aspect of the present application includes a computer program, and when the computer program is executed by a processor, the method described in the first aspect above is implemented.
  • the identification of the query task is obtained, and the corresponding query optimization method is selected according to the identification of the query task, wherein the query optimization method includes adjusting Graph traversal expands sequential strategy, cardinality reduction, mode advance, materialized view, and then uses the query optimization method to query the graph database and output the query results.
  • the corresponding query optimization method can be selected according to the identifier of the query task, which improves the flexibility of the query method.
  • the query optimization method improves the query efficiency of query tasks in different scenarios of large-scale graph data of scientific and technological consulting, reduces the complexity of query calculation, and shortens the time spent on query.
  • FIG. 1 is a schematic flow diagram of a query task optimization method based on large-scale graph data of scientific and technological consulting provided according to an embodiment of the present application;
  • FIG. 2 is a schematic structural diagram of a query task optimization system based on large-scale graph data of scientific and technological consulting provided according to an embodiment of the present application.
  • the identification of the query task is obtained, and the corresponding query optimization method is selected according to the identification of the query task, wherein the query optimization method includes adjusting Graph traversal expands sequential strategy, cardinality reduction, mode advance, materialized view, and then uses the query optimization method to query the graph database and output the query results.
  • the corresponding query optimization method can be selected according to the identifier of the query task, which improves the flexibility of the query method.
  • the query optimization method improves the query efficiency of query tasks in different scenarios of large-scale graph data of scientific and technological consulting, reduces the complexity of query calculation, and shortens the time spent on query.
  • Figure 1 is a schematic flowchart of a query task optimization method based on large-scale graph data for scientific and technological consulting according to an embodiment of the present application. As shown in Figure 1, the method may include steps 101 to 103.
  • Step 101 Obtain the identifier of the query task.
  • the query task may include organization, talent, and industry chain.
  • the organization can be the ID of the company, and the talent can be the personnel
  • the identifier of the query task may be acquired according to the content of the query task. As an example, in the embodiment of the present disclosure, assuming that the query task is to view the company and patent information associated with a certain person, the identifier of the query task is obtained.
  • Step 102 According to the identification of the query task, select the corresponding query optimization method, wherein the query optimization method includes adjusting the graph traversal expansion order strategy, Cardinality reduction, schema advance, and materialized view.
  • different identifiers correspond to different query optimization methods, and the corresponding query method may be selected according to the identifier of the query task.
  • the query optimization prevention in the embodiment of the present disclosure may include adjusting the graph traversal expansion sequence strategy, Cardinality reduction, schema advance, and materialized views.
  • the graph traversal expansion order strategy is adjusted in combination with the actual query scenario of scientific and technological consulting, and the graph traversal expansion order of the two-way BFS is designed, and the search is started from the starting point and the end point at the same time. Once the other direction has been searched The searched position (or a certain state is visited by both directions), finds a shortest path connecting the starting point and the ending point. Then gather at a point in the middle of the shortest path, and meet at the midpoint of the path, so the number of nodes in the bidirectional BFS is on the order of 2 *nm/2+1 .
  • adjusting the graph traversal expansion sequence strategy may include the following steps:
  • step S14 If s1 or s2 is not empty, proceed to step S15; otherwise, proceed to step S111;
  • S15, s is a set of extended nodes in this layer
  • step S18 Judging the nodes in each next_nodes, if the node is in the s set, a path is found, and step S111 is performed;
  • the query task is given the industrial chain tag tag and the personnel information person, starting from the tag to query the sub-industrial chain tag, the patent belonging to the sub-industrial chain tag, and the company to which the patent belongs, Associates such as the company's employment/investment.
  • 146,284 patent intermediate nodes will be generated on the path of industry chain-sub-industrial chain label-patent. If one-way BFS is used to expand the 146,284 patents, explosive intermediate results will be produced , seriously affecting query performance.
  • a bidirectional search is performed from the starting point and the ending point, that is, the two directions of industrial chain label-sub-industrial chain label-patent and personnel-company-patent are traversed, and the industry
  • the 146,284 patent intermediate nodes generated by chain label-sub-industrial chain label-patent are processed into a hash table, and then reversed from the personnel node to generate a set of results from the personnel-company-patent path, and finally combine this set of results with Do the intersection of the hash table to find the qualified path connecting the starting point and the ending point, and the time complexity only needs o(n).
  • Cardinality represents the number of unique values after deduplication, for example, Columns Cardinality (column cardinality) refers to the number of unique values contained in a column. This number directly affects the effect of model compression and the performance of engine scanning. Therefore, it is necessary to minimize Cardinality as much as possible to shorten the time required for the query.
  • Cardinality reduction may include the following steps:
  • next_nodes is a set of nodes for extending the next layer, and is initialized as the next layer of neighbor nodes of the source entity node expanded according to the pattern;
  • S26 size is the current number of queues
  • step S27 If the size is not empty, continue to execute step S28; otherwise, execute step S211;
  • step S211 If the pattern pattern is currently traversed, continue to execute step S212, otherwise execute step S25;
  • the query task in the scientific and technological consulting scenario is to find the associated company from the given person, as well as the patents owned by the company, and the industrial chain to which the patent belongs Label, output the non-duplicated company, patent, and industry chain label tuples that match the path.
  • the embodiment of the disclosure uses the optimization strategy of distinct to reduce Cardinality in advance, and advances the deduplication operation to after the duplicate node is generated, that is, immediately performs the deduplication operation after the "personnel" node traverses to the "company" node, and the 201 duplicate companies
  • the intermediate nodes are reduced to 131 non-duplicated company nodes, thereby reducing the generation of intermediate nodes and reducing the subsequent traversal time.
  • this process is data query filtering.
  • the mode advance may include the following steps:
  • the initialization mode sets filter_nodeset in advance
  • S33, q is a node queue, initialized as an input source entity node
  • step S36 If the size is not empty, proceed to step S37; otherwise, proceed to step S312;
  • the query task in the technology consulting scenario is given the tag information of the industrial chain, starting from the tag to query its associated company, and the patents owned by the company, there is a filter condition: company There can be no business exceptions, that is, there is no company-management exception pattern, and the output has no duplicate company and patent tuples.
  • the pattern advance in the embodiments of the present disclosure is to replace the traversal operation in the pattern with efficient lookup of the collection.
  • Make the company-business abnormal model in advance put the company ID information associated with the "business abnormal" node into a hash table, and then filter the condition to determine whether the "company” node exists in the hash table, if "company” If the node does not exist in the hash table, it means that the company has no abnormal operation, and only 3292 times of O(1) time complexity are needed for set search, thus improving the query efficiency.
  • the materialized view is mainly used to pre-calculate and save the results of time-consuming operations such as table joins or aggregations, so that time-consuming operations can be avoided when query tasks are subsequently executed. , so that the query results can be quickly obtained.
  • the materialized view greatly improves the query performance of hot issues that often reuse the same query results, so that data can be quickly read from the materialized view.
  • the query task in the scientific and technological consulting scenario is given the tag information of the industrial chain, starting from the tag to query its sub-industrial chain tag, and the company that belongs to the sub-industrial chain tag, and then query with the sub-industrial chain tag
  • the industrial chain label is the starting node, and the route patent finally traverses the path to the company node, and the company information and patent quantity that conform to this model are counted. If each company is queried separately, the time consumption is very serious.
  • the materialized view method in the embodiment of the present disclosure can obtain the patents owned by each company in advance, judge and aggregate the industrial chain label to which each patent belongs, and enter the number of patents under the industrial chain label into the "company-industry Among the attributes of the "chain label" edge, the precomputed materialized view improves query efficiency.
  • Step 103 using the query optimization method to query the graph database, and output the query result.
  • the graph database is queried by using the query optimization method in step 102 above, and a query result is output.
  • the query result may include the association relationship between the nodes in the graph database.
  • the identification of the query task is obtained, and the corresponding query optimization method is selected according to the identification of the query task, wherein the query optimization method includes adjusting the graph traversal expansion order strategy , Cardinality reduction, mode advance, materialized view, and then use the query optimization method to query the graph database and output the query results.
  • the corresponding query optimization method can be selected according to the identifier of the query task, which improves the flexibility of the query method.
  • the query optimization method improves the query efficiency of query tasks in different scenarios of large-scale graph data of scientific and technological consulting, reduces the complexity of query calculation, and shortens the time spent on query.
  • Figure 2 is a schematic structural diagram of a query task optimization system based on large-scale graph data for scientific and technological consulting provided according to an embodiment of the present application. As shown in Figure 2, the system may include:
  • Obtaining module 201 configured to obtain the identification of the query task
  • the selection module 202 is configured to select a corresponding query optimization method according to the identification of the query task, wherein the query optimization method includes adjusting the graph traversal expansion order strategy, Cardinality reduction, mode advance, and materialized view;
  • the display module 203 is configured to use the query optimization method to query the graph database and output the query results.
  • the query task may include organization, talent, and industry chain.
  • the identification of the query task is obtained, and the corresponding query optimization method is selected according to the identification of the query task, wherein the query optimization method includes adjusting Graph traversal expands sequential strategy, cardinality reduction, mode advance, materialized view, and then uses the query optimization method to query the graph database and output the query results.
  • the corresponding query optimization method can be selected according to the identifier of the query task, which improves the flexibility of the query method.
  • the query optimization method improves the query efficiency of query tasks in different scenarios of large-scale graph data of scientific and technological consulting, reduces the complexity of query calculation, and shortens the time spent on query.
  • the computer storage medium provided by the embodiment of the third aspect of the present application, wherein the computer storage medium stores computer-executable instructions; after the computer-executable instructions are executed by a processor, the method described in the first aspect above can be implemented.
  • the computer equipment proposed in the embodiment of the fourth aspect of the present application includes a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the program, it can realize the above-mentioned first aspect. described method.
  • the computer program product provided by the embodiment of the fifth aspect of the present application includes a computer program, and when the computer program is executed by a processor, the method described in the first aspect above is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供的基于科技咨询大规模图数据的查询任务优化方法、系统及存储介质中,获取查询任务的标识,并根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图,然后利用查询优化方法对图数据库进行查询,输出查询结果。

Description

基于科技咨询大规模图数据的查询任务优化方法
相关申请的交叉引用
本申请基于申请号为202111316037.1、申请日为2021年11月08日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及大规模图数据查询领域,尤其涉及一种基于科技咨询大规模图数据的查询任务优化方法、装置及存储介质。
背景技术
图数据上的查询任务是知识图谱领域最基础的问题之一,因此通常需要在大规模图数据上进行高效的查询处理,以使得用户可以快速得到查询结果。
目前,图查询优化的图分区技术,可以将图数据拆分到多个服务器,但是服务器的通信成本和处理开销较高等。并且,大部分查询优化技术中是基于社交网络的图数据进行查询优化,并不适用科技咨询场景的复杂拓扑结构的图数据。因此,需要进一步优化基于科技咨询大规模图数据的查询任务。
发明内容
本申请提供一种基于科技咨询大规模图数据的查询任务优化方法、系统及存储介质。
本申请第一方面实施例提出一种基于科技咨询大规模图数据的查询任务优化方法,包括:
获取查询任务的标识;
根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
利用所述查询优化方法对图数据库进行查询,输出查询结果。
本申请第二方面实施例提出一种基于科技咨询大规模图数据的查询任务优化系统,包括:
获取模块,用于获取查询任务的标识;
选择模块,用于根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
显示模块,用于利用所述查询优化方法对图数据库进行查询,输出查询结果。
本申请第三方面实施例提出的计算机存储介质,其中,所述计算机存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现如上第一方面所述的方法。
本申请第四方面实施例提出的计算机设备,其中,包括存储器、处理器及存储在存 储器上并可在处理器上运行的计算机程序,处理器执行所述程序时,能够实现如上第一方面所述的方法。
本申请第五方面实施例提供的计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如上第一方面所述的方法。
本公开提供的基于科技咨询大规模图数据的查询任务优化方法、系统及存储介质中,获取查询任务的标识,并根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图,然后利用查询优化方法对图数据库进行查询,输出查询结果。由此可知,本公开提出的方法中,可以根据查询任务的标识选择对应的查询优化方法,提高了查询方法的灵活性。同时,本公开提出的方法中,查询优化方法提高了科技咨询大规模图数据不同场景下查询任务的查询效率,降低了查询计算的复杂度,缩短了查询所花费的时间。
本申请附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为根据本申请一个实施例提供的基于科技咨询大规模图数据的查询任务优化方法的流程示意图;
图2为根据本申请一个实施例提供的基于科技咨询大规模图数据的查询任务优化系统的结构示意图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
本公开提供的基于科技咨询大规模图数据的查询任务优化方法、系统及存储介质中,获取查询任务的标识,并根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图,然后利用查询优化方法对图数据库进行查询,输出查询结果。由此可知,本公开提出的方法中,可以根据查询任务的标识选择对应的查询优化方法,提高了查询方法的灵活性。同时,本公开提出的方法中,查询优化方法提高了科技咨询大规模图数据不同场景下查询任务的查询效率,降低了查询计算的复杂度,缩短了查询所花费的时间。
下面参考附图描述本申请实施例的基于科技咨询大规模图数据的查询任务优化方法及系统。
图一为根据本申请一个实施例提供的基于科技咨询大规模图数据的查询任务优化 方法的流程示意图,如图1所示,所述方法可以包括步骤101至步骤103。
步骤101、获取查询任务的标识。
需要说明的是,本公开的实施例中,查询任务可以包括机构、人才、产业链。其中,本公开的实施例中,机构可以是公司的ID,人才可以是人员
其中,本公开的实施例中,可以根据查询任务的内容获取该查询任务的标识。示例的,本公开的实施例中,假设查询任务是查看某人员关联的公司、专利情况,则获取该查询任务的标识。
步骤102、根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图。
其中,本公开的实施例中,不同的标识对应不同的查询优化方法,可以根据查询任务的标识,选择对应的查询方法。
以及,本公开的实施例中查询优化防范可以包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图。
进一步地,本公开的实施例中,调整图遍历展开顺序策略结合科技咨询实际查询场景,设计双向BFS的图遍历展开顺序,同时从起点和终点两个方向开始搜索,一旦搜索到另一个方向已经搜索过的位置(或者出现某个状态被两个方向均访问了),则找到了一条连通起点和终点的最短路径。然后向最短路中间的某一点汇集,在路径中点相遇,因此双向BFS的节点数是2 *nm/2+1数量级。
具体的,本公开的实施例中,调整图遍历展开顺序策略可以包括以下步骤:
S11、输入源实体节点和目标实体节点,并输入中间实体节点类型mtype,和路径模式pattern;
S12、初始化s1、s2两个节点集合,其中s1初始化为输入的源实体节点,s2初始化为输入的目标实体节点;
S13、利用pattern和mtype计算双向BFS的展开顺序,并用pattern1表示左端展开顺序,pattern2表示右端展开顺序;
S14、若s1或s2不为空,则继续执行步骤S15;否则,执行步骤S111;
S15、s为本层扩展节点的集合;
S16、交换s1和s2,交替从左端扩展和从右端扩展;
S17、对s1集合里的每个节点node,按照模式扩展node的下一层邻居节点,并用next_nodes表示;
S18、对每个next_nodes里的节点进行判断,如果节点在s集合中,即找到一条路径,进行步骤S111;
S19、将本层扩展的所有节点next_nodes加入集合s中,并将集合s复制给s1,存储路径;
S110、重复步骤S14;
S111、结束。
示例的,本公开的实施例中,查询任务给定了产业链标签tag和人员信息person,从tag出发查询其子产业链标签,以及属于该子产业链标签的专利,及专利所属的公司,公司的任职/投资等关联人员。在已经构建好的科技咨询知识图谱中,产业链-子产业链标签-专利这条路径上会生成146284个专利中间节点,若使用单向BFS再对146284个专利进行扩展将会产生爆炸性中间结果,严重影响查询性能。
若使用本公开实施例中的双向BFS的图遍历展开顺序优化策略,从起点和终点进行双向搜索,即产业链标签-子产业链标签-专利和人员-公司-专利两条方向遍历,将产业链标签-子产业链标签-专利生成的146284个专利中间节点处理成哈希表,再反向从人员节点开始,将人员-公司-专利这条路径生成一组结果,最终将这组结果与哈希表做交集,找到符合条件的连通起点和终点的路径,且时间复杂度也仅需要o(n)。
进一步地,本公开的实施例中,Cardinality表示去重后唯一值的数量,比如Columns Cardinality(列基数)指列包含的不重复值的个数。这个数量对于直接影响模型压缩的效果和引擎扫描时的性能。因此需要尽量将Cardinality减少到最低,以缩短查询需要的时间。
其中,本公开的实施例中,Cardinality减少可以包括以下步骤:
S21、输入源实体节点和路径模式pattern;
S22、next_nodes为扩展下一层的节点集合,并初始化为按照模式扩展的源实体节点的下一层邻居节点;
S23、对next_nodes节点去重;
S24、q为节点队列,初始化为next_nodes;
S25、若q不为空,继续执行步骤S26;否则执行步骤S212;
S26、size为当前队列数量;
S27、若size不为空,继续执行步骤S28;否则执行步骤S211;
S28、弹出当前队列node节点;
S29、按照模式扩展node的下一层邻居节点next_nodes;
S210、将next_nodes加入队列q;
S211、若当前遍历完模式pattern,则继续执行步骤S212,否则执行步骤S25;
S212、结束。
示例的,本公开的实施例中,在科技咨询实际场景下的知识图谱中,两点之间可能会有重边或不同类型的边,比如“公司”节点与“人员”节点存在着“公司-投资人”/“公司-公示股东-人”/“公司-任职人员”三种关系。因此,从某一公司出发寻找与其相邻的“人员”节点,可能会从上述三种关系定位到某些相同的“人员”节点,从而产生重复的节点。而重复冗余的节点会增加Cardinality,当重复的“人员”节点继续寻找相邻节点时,就会重复的进行遍历,从而会增加中间节点数量,进行增加了查询时间。因此,本公开的实施例中,使用distinct提前优化策略来减少cardinality。
具体的,本公开的实施例中,在科技咨询场景下的查询任务是给定了人员person, 从给定的person查询寻找其关联的公司,以及该公司拥有的专利,及专利所属的产业链标签,输出符合路径的无重复的公司、专利、产业链标签元组。本公开实施例使用distinct提前减少Cardinality的优化策略,将去重操作提前至重复节点产生之后,即在“人员”节点遍历到“公司”节点后立刻进行去重操作,将201个有重复的公司中间节点减少至无重复的公司节点131个,从而减少了中间节点的产生,降低后续遍历时间。
进一步地,本公开的实施例中,需要根据业务条件获取并筛选出目标数据,这个过程为数据查询的过滤。在大规模图查询任务中会存在大量的过滤操作,而过滤过程中使用的各种过滤条件是获取精准数据的必要步骤,例如,基本运算法(<、>、=、!=)、逻辑运算(AND、OR、NOT)和模式匹配。
其中,本公开的实施例中,模式提前可以包括以下步骤:
S31、输入源实体节点、路径模式pattern、过滤模式filter_pattern;
S32、初始化模式提前集合filter_nodeset;
S33、q为节点队列,初始化为为输入的源实体节点;
S34、若q不为空,则继续执行步骤S35;否则执行步骤S313;
S35、初始化当前队列数量size;
S36、若size不为空,则继续进行步骤S37;否则执行步骤S312;
S37、弹出当前队列node节点;
S38、按照模式扩展node的下一层邻居节点next_nodes;
S39、判断当前next_nodes节点类型是否为filter_nodeset的节点类型,若是,则继续执行步骤S310;否则执行步骤S311;
S310、遍历next_nodes集合的节点next_node,若节点next_node在filter_nodeset集合中,则过滤掉该节点;
S311、将next_nodes加入队列q;
S312、若当前遍历完模式pattern,则继续执行步骤S313,否则执行步骤S35;
S313、结束。
示例的,本公开的实施例中,在科技咨询场景下的查询任务是给定产业链标签信息tag,从tag出发查询其关联的公司,以及该公司拥有的专利,存在一个过滤条件为:公司不能有经营异常,即不存在公司-经营异常的模式,输出无重复的公司、专利元组。
具体的,本公开的实施例中的模式提前是利用集合的高效查找代替模式里的遍历操作。将公司-经营异常这个模式提前做,将与“经营异常”节点关联的公司ID信息放入一个哈希表中,然后过滤条件会判断“公司”节点是否存在哈希表中,若“公司”节点不存在哈希表中,则表示该公司无经营异常,则仅需要3292次o(1)的时间复杂度进行集合查找,从而提高了查询效率。
进一步地,本公开的实施例中,主要是利用物化视图预先计算并保存表连接或聚集等耗时较多的操作的结果,以便在后续执行查询任务时,可以避免进行耗时较多的操作,从而可以快速得到查询结果。在科技咨询场景下,物化视图对于那些经常重复使用相同 的查询结果的热点问题查询性能大幅提升,从而快速地从物化视图中读取数据。
示例的,本公开的实施例中,在科技咨询场景下的查询任务给定产业链标签信息tag,从tag出发查询其子产业链标签,以及属于该子产业链标签的公司,然后查询以子产业链标签为起始节点,途径专利最终遍历到达公司节点的路径,统计符合该模式的公司信息和专利数量。若对每个公司都单独进行查询,耗时非常严重。但是,本公开实施例中的物化视图方法可以提前获取每个公司所拥有的专利,对每个专利判断其所属的产业链标签并聚合,得到产业链标签下的专利数量录入到“公司-产业链标签”边的属性之中,预计算的物化视图提高了查询效率。
步骤103、利用查询优化方法对图数据库进行查询,输出查询结果。
其中,本公开的实施例中,利用上述步骤102中的查询优化方法对图数据库进行查询,并输出查询的结果。以及,本公开的实施例中,查询结果可以包括节点之间在图数据库中的关联关系。
本公开提供的基于科技咨询大规模图数据的查询任务优化方法中,获取查询任务的标识,并根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图,然后利用查询优化方法对图数据库进行查询,输出查询结果。由此可知,本公开提出的方法中,可以根据查询任务的标识选择对应的查询优化方法,提高了查询方法的灵活性。同时,本公开提出的方法中,查询优化方法提高了科技咨询大规模图数据不同场景下查询任务的查询效率,降低了查询计算的复杂度,缩短了查询所花费的时间。
图二为根据本申请一个实施例提供的基于科技咨询大规模图数据的查询任务优化系统的结构示意图,如图2所示,所述系统可以包括:
获取模块201,用于获取查询任务的标识;
选择模块202,用于根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
显示模块203,用于利用查询优化方法对图数据库进行查询,输出查询结果。
其中,本公开的实施例中,查询任务可以包括机构、人才、产业链。
本公开提供的基于科技咨询大规模图数据的查询任务优化方法、系统及存储介质中,获取查询任务的标识,并根据查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图,然后利用查询优化方法对图数据库进行查询,输出查询结果。由此可知,本公开提出的方法中,可以根据查询任务的标识选择对应的查询优化方法,提高了查询方法的灵活性。同时,本公开提出的方法中,查询优化方法提高了科技咨询大规模图数据不同场景下查询任务的查询效率,降低了查询计算的复杂度,缩短了查询所花费的时间。
本申请第三方面实施例提出的计算机存储介质,其中,所述计算机存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现如上第一方面所述的方法。
本申请第四方面实施例提出的计算机设备,其中,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行所述程序时,能够实现如上第一方面所述的方法。
本申请第五方面实施例提供的计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如上第一方面所述的方法。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (10)

  1. 一种基于科技咨询大规模图数据的查询任务优化方法,其特征在于,所述方法包括:
    获取查询任务的标识;
    根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
    利用所述查询优化方法对图数据库进行查询,输出查询结果。
  2. 根据权利要求1所述的查询任务优化方法,其特征在于,所述查询任务包括机构、人才、产业链。
  3. 根据权利要求1所述的查询任务优化方法,其特征在于,所述调整图遍历展开顺序策略,包括:
    S11、输入源实体节点和目标实体节点,并输入中间实体节点类型mtype,和路径模式pattern;
    S12、初始化s1、s2两个节点集合,其中s1初始化为输入的源实体节点,s2初始化为输入的目标实体节点;
    S13、利用pattern和mtype计算双向BFS的展开顺序,并用pattern1表示左端展开顺序,pattern2表示右端展开顺序;
    S14、若s1或s2不为空,则继续执行步骤S15;否则,执行步骤S111;
    S15、s为本层扩展节点的集合;
    S16、交换s1和s2,交替从左端扩展和从右端扩展;
    S17、对s1集合里的每个节点node,按照模式扩展node的下一层邻居节点,并用next_nodes表示;
    S18、对每个next_nodes里的节点进行判断,如果节点在s集合中,即找到一条路径,进行步骤S111;
    S19、将本层扩展的所有节点next_nodes加入集合s中,并将集合s复制给s1,存储路径;
    S110、重复步骤S14;
    S111、结束。
  4. 根据权利要求1所述的查询任务优化方法,其特征在于,所述Cardinality减少,包括:
    S21、输入源实体节点和路径模式pattern;
    S22、next_nodes为扩展下一层的节点集合,并初始化为按照模式扩展的源实体节点的下一层邻居节点;
    S23、对next_nodes节点去重;
    S24、q为节点队列,初始化为next_nodes;
    S25、若q不为空,继续执行步骤S26;否则执行步骤S212;
    S26、size为当前队列数量;
    S27、若size不为空,继续执行步骤S28;否则执行步骤S211;
    S28、弹出当前队列node节点;
    S29、按照模式扩展node的下一层邻居节点next_nodes;
    S210、将next_nodes加入队列q;
    S211、若当前遍历完模式pattern,则继续执行步骤S212,否则执行步骤S25;
    S212、结束。
  5. 根据权利要求1所述的查询任务优化方法,其特征在于,所述模式提前,包括:
    S31、输入源实体节点、路径模式pattern、过滤模式filter_pattern;
    S32、初始化模式提前集合filter_nodeset;
    S33、q为节点队列,初始化为为输入的源实体节点;
    S34、若q不为空,则继续执行步骤S35;否则执行步骤S313;
    S35、初始化当前队列数量size;
    S36、若size不为空,则继续进行步骤S37;否则执行步骤S312;
    S37、弹出当前队列node节点;
    S38、按照模式扩展node的下一层邻居节点next_nodes;
    S39、判断当前next_nodes节点类型是否为filter_nodeset的节点类型,若是,则继续执行步骤S310;否则执行步骤S311;
    S310、遍历next_nodes集合的节点next_node,若节点next_node在filter_nodeset集合中,则过滤掉该节点;
    S311、将next_nodes加入队列q;
    S312、若当前遍历完模式pattern,则继续执行步骤S313,否则执行步骤S35;
    S313、结束。
  6. 一种基于科技咨询大规模图数据的查询任务优化系统,其特征在于,所述系统包括:
    获取模块,用于获取查询任务的标识;
    选择模块,用于根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
    显示模块,用于利用所述查询优化方法对图数据库进行查询,输出查询结果。
  7. 根据权利要求6所述的查询任务优化系统,其特征在于,所述查询任务包括机构、人才、产业链。
  8. 一种计算机存储介质,其中,所述计算机存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现以下步骤:
    获取查询任务的标识;
    根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
    利用所述查询优化方法对图数据库进行查询,输出查询结果。
  9. 一种计算机设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现以下步骤:
    获取查询任务的标识;
    根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
    利用所述查询优化方法对图数据库进行查询,输出查询结果。
  10. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现以下步骤:
    获取查询任务的标识;
    根据所述查询任务的标识,选择对应的查询优化方法,其中,查询优化方法包括调整图遍历展开顺序策略、Cardinality减少、模式提前、物化视图;
    利用所述查询优化方法对图数据库进行查询,输出查询结果。
PCT/CN2022/087215 2021-11-08 2022-04-15 基于科技咨询大规模图数据的查询任务优化方法 WO2023077731A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111316037.1A CN114020781B (zh) 2021-11-08 2021-11-08 基于科技咨询大规模图数据的查询任务优化方法
CN202111316037.1 2021-11-08

Publications (1)

Publication Number Publication Date
WO2023077731A1 true WO2023077731A1 (zh) 2023-05-11

Family

ID=80062381

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087215 WO2023077731A1 (zh) 2021-11-08 2022-04-15 基于科技咨询大规模图数据的查询任务优化方法

Country Status (2)

Country Link
CN (1) CN114020781B (zh)
WO (1) WO2023077731A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118314245A (zh) * 2024-06-07 2024-07-09 浙江省水利河口研究院(浙江省海洋规划设计研究院) 基于优选策略的图斑面积调整方法及装置、电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020781B (zh) * 2021-11-08 2024-05-31 北京邮电大学 基于科技咨询大规模图数据的查询任务优化方法
CN114880504B (zh) * 2022-07-08 2023-03-31 支付宝(杭州)信息技术有限公司 一种图数据的查询方法、装置以及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020039198A1 (en) * 2018-08-21 2020-02-27 Shapecast Limited Machine learning optimisation method
CN111670433A (zh) * 2018-01-31 2020-09-15 易享信息技术有限公司 查询优化器约束
US20210182315A1 (en) * 2019-12-11 2021-06-17 Oracle International Corporation Hybrid in-memory bfs-dfs approach for computing graph queries against heterogeneous graphs inside relational database systems
CN114020781A (zh) * 2021-11-08 2022-02-08 北京邮电大学 基于科技咨询大规模图数据的查询任务优化方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8326825B2 (en) * 2010-11-05 2012-12-04 Microsoft Corporation Automated partitioning in parallel database systems
CN107291807B (zh) * 2017-05-16 2020-10-16 中国科学院计算机网络信息中心 一种基于图遍历的sparql查询优化方法
CN108038136A (zh) * 2017-11-23 2018-05-15 上海斯睿德信息技术有限公司 基于图模型的企业知识图谱的建立方法和图形化查询方法
CN110941741A (zh) * 2018-09-21 2020-03-31 百度在线网络技术(北京)有限公司 图数据的路径检索处理方法、装置、服务器及存储介质
CN110795456B (zh) * 2019-10-28 2022-06-28 北京百度网讯科技有限公司 图谱的查询方法、装置、计算机设备以及存储介质
CN111241350B (zh) * 2020-01-07 2024-02-02 平安科技(深圳)有限公司 图数据查询方法、装置、计算机设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111670433A (zh) * 2018-01-31 2020-09-15 易享信息技术有限公司 查询优化器约束
WO2020039198A1 (en) * 2018-08-21 2020-02-27 Shapecast Limited Machine learning optimisation method
US20210182315A1 (en) * 2019-12-11 2021-06-17 Oracle International Corporation Hybrid in-memory bfs-dfs approach for computing graph queries against heterogeneous graphs inside relational database systems
CN114020781A (zh) * 2021-11-08 2022-02-08 北京邮电大学 基于科技咨询大规模图数据的查询任务优化方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118314245A (zh) * 2024-06-07 2024-07-09 浙江省水利河口研究院(浙江省海洋规划设计研究院) 基于优选策略的图斑面积调整方法及装置、电子设备

Also Published As

Publication number Publication date
CN114020781A (zh) 2022-02-08
CN114020781B (zh) 2024-05-31

Similar Documents

Publication Publication Date Title
WO2023077731A1 (zh) 基于科技咨询大规模图数据的查询任务优化方法
US6615206B1 (en) Techniques for eliminating database table joins based on a join index
US6609131B1 (en) Parallel partition-wise joins
US10642831B2 (en) Static data caching for queries with a clause that requires multiple iterations to execute
US7734615B2 (en) Performance data for query optimization of database partitions
CN101436192B (zh) 用于优化针对垂直存储式数据库的查询的方法和设备
US8296303B2 (en) Intelligent event query publish and subscribe system
JP2003500741A (ja) 単一の集計プロセスで複数のデータマートを実装するための方法および装置
US20020161757A1 (en) Simultaneous searching across multiple data sets
US20020194157A1 (en) Partition pruning with composite partitioning
Huang et al. Query optimization of distributed pattern matching
US20160350662A1 (en) System and method for real-time graph-based recommendations
WO2007095365A1 (en) Maintenance of materialized outer-join views
WO2019161679A1 (zh) 一种用于联机分析处理的数据处理方法和装置
CN111444027B (zh) 事务处理方法、装置、计算机设备及存储介质
US20140019454A1 (en) Systems and Methods for Caching Data Object Identifiers
KR101955376B1 (ko) 비공유 아키텍처 기반의 분산 스트림 처리 엔진에서 관계형 질의를 처리하는 방법, 이를 수행하기 위한 기록 매체 및 장치
Bou et al. An improved method of keyword search over relational data streams by aggressive candidate network consolidation
Poppe et al. Sharon: Shared online event sequence aggregation
US10657126B2 (en) Meta-join and meta-group-by indexes for big data
Ahmed et al. Triangle enumeration for billion-scale graphs in rdbms
WO2023201791A1 (zh) 一种数据实体识别方法、装置、计算机设备及存储介质
JPH113354A (ja) データキューブ制御方式
Zhu et al. Optimization of generic progressive queries based on dependency analysis and materialized views
US20240152520A1 (en) Data query and data storage methods and apparatuses for relation network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888770

Country of ref document: EP

Kind code of ref document: A1