WO2024036662A1 - 一种基于数据采样的并行图规则挖掘方法及装置 - Google Patents

一种基于数据采样的并行图规则挖掘方法及装置 Download PDF

Info

Publication number
WO2024036662A1
WO2024036662A1 PCT/CN2022/114988 CN2022114988W WO2024036662A1 WO 2024036662 A1 WO2024036662 A1 WO 2024036662A1 CN 2022114988 W CN2022114988 W CN 2022114988W WO 2024036662 A1 WO2024036662 A1 WO 2024036662A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
graph
application
sampling
interest
Prior art date
Application number
PCT/CN2022/114988
Other languages
English (en)
French (fr)
Inventor
樊文飞
付文智
靳若春
陆平
田超
Original Assignee
深圳计算科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳计算科学研究院 filed Critical 深圳计算科学研究院
Publication of WO2024036662A1 publication Critical patent/WO2024036662A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Definitions

  • the invention relates to the field of computers, and in particular to a data sampling-based parallel graph rule mining method and device.
  • this application is proposed to provide a data sampling-based parallel graph rule mining method and device that overcomes the problems or at least partially solves the problems, including:
  • a parallel graph rule mining method based on data sampling is used to mine graph rules corresponding to application purposes in preset graph data.
  • the graph rules are used to match graphs related to application purposes in graph data.
  • the application purpose and generate interest data based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application;
  • Data reduction is performed based on the interest data, and the reduced interest data is mined in parallel to determine graph rules related to the application purpose.
  • the step of generating interest data based on the application purpose and preset image data includes:
  • the interest data is generated according to the sequence of tag triples.
  • the step of generating the interest data based on the sequence of tag triples includes:
  • the interest data is generated by filtering according to the application triplet.
  • the step of performing data reduction based on the interest data includes:
  • Sampling is performed based on the interest data to generate a partial sampling map, and the data-reduced interest data is generated based on the partial sampling map; wherein, there is at least one group of partial sampling maps, and the data scale of the sampling map is compared to The size of the interest data does not exceed a preset percentage.
  • the steps of sampling and generating a partial sampling map based on the interest data, and generating the data-reduced interest data based on the partial sampling map include:
  • the data-reduced interest data is generated according to the extraction fulcrum.
  • the step of conducting parallel mining on the data-reduced interest data to determine graph rules related to the application purpose includes:
  • the step of generating initial graph rules through a graph pattern generation function and a dependency generation function based on the data-reduced interest data includes:
  • the interesting data is evenly distributed to the computing nodes through the vertex cutting method, and the initial graph rules are generated through the graph pattern generation function and the dependency generation function.
  • the application also includes a parallel graph rule mining device based on data sampling.
  • the device is used to mine graph rules corresponding to the application purpose in the preset graph data.
  • the graph rules are used to match the graph rules in the graph data. Diagrams relevant to the purpose of the application, including:
  • An interest data module used to obtain the application purpose and generate interest data based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application;
  • a graph rule module is configured to perform data reduction based on the interest data, and perform parallel mining on the reduced interest data to determine graph rules related to the application purpose.
  • the application also includes an electronic device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor.
  • the computer program is executed by the processor, the described computer program is implemented.
  • a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the parallel graph rule mining method based on data sampling are implemented.
  • this application obtains the application purpose and generates interest data based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application; based on The interest data is reduced, and the reduced interest data is mined in parallel to determine graph rules related to the application purpose.
  • this application proposes an application-driven graph data sampling strategy with accuracy guarantee to reduce data size and improve rule mining efficiency.
  • This application avoids the possible lack of scalability of RDF (Resource Description Framework, resource-attribute-value) converted from attribute graphs: converting node attributes of graph data often generates a large number of RDF triples.
  • This application uses machine learning predicates and graph patterns of general subgraphs to discover graph association rules (Graph Association Rules).
  • Figure 1 is a step flow chart of a parallel graph rule mining method based on data sampling provided by an embodiment of the present application
  • Figure 2 is a step flow chart of a parallel graph rule mining method based on data sampling provided by an embodiment of the present application
  • Figure 3 is a graph data sampling diagram based on the clustering method of a parallel graph rule mining method based on data sampling provided by an embodiment of the present application;
  • Figure 4 is a graph rule parallel mining algorithm diagram of a parallel graph rule mining method based on data sampling provided by an embodiment of the present application
  • Figure 5 is a schematic structural diagram of a parallel graph rule mining device based on data sampling provided by an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • This application obtains the application purpose and generates interest data based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application; data reduction is performed based on the interest data , and conduct parallel mining on the data-reduced interest data to determine graph rules related to the application purpose.
  • this application proposes an application-driven graph data sampling strategy with accuracy guarantee to reduce data size and improve rule mining efficiency.
  • This application avoids the possible lack of scalability of RDF (Resource Description Framework, resource-attribute-value) converted from attribute graphs: converting node attributes of graph data often generates a large number of RDF triples.
  • This application uses machine learning predicates and graph patterns of general subgraphs to discover graph association rules (Graph Association Rules).
  • FIG. 1 a flow chart of a parallel graph rule mining method based on data sampling provided by an embodiment of the present application is shown, which specifically includes the following steps:
  • S120 Perform data reduction based on the interest data, and perform parallel mining on the reduced interest data to determine graph rules related to the application purpose.
  • the application purpose is obtained, and interest data is generated based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application.
  • the step S110 of "obtaining the application purpose and generating interest data based on the application purpose and preset image data can be further described in conjunction with the following description; wherein the interest data includes information related to the target application "Relevant nodes, edges and attributes" specific process.
  • a sequence of label triples is generated according to the application purpose and the preset graph data; wherein the sequence of label triples is related to the application purpose predicate; according to the sequence of the label triples
  • the sequence generates the data of interest.
  • a sequence with a frequency higher than a preset value is selected from the sequence of tag triples to construct an application triplet; filtering is performed based on the application triplet to generate the interest data.
  • a label triplet is defined as l_v, l_e, l′_v, where l_v and l′_v are the labels of two connected points, and l_e is the label of the edge connecting the two points.
  • the wildcard "_" matches any tag.
  • the trained language model M_A In the second step, taking the label triplet T(p) representing each predicate p applied as the seed input, and treating each triplet as a word, we use the trained language model M_A to generate some label triples
  • the sequence of the group is denoted as ⁇ _A. Since the algorithm models the probability of sentence generation based on the LSTM language model M_A, the generated sequence is semantically related to the seed input T(p).
  • the algorithm selects the m triples with the highest frequency of occurrence from ⁇ _A to construct a set of label triples T_A, called application triples.
  • m is a pre-given positive integer.
  • the algorithm focuses on the triples most closely related to the application.
  • Such application triplet and predicate triplet expressing application co-occur with high probability. Therefore, application-relevant graph association rules are likely to include predicates related to these label triples, and the graph pattern edges in such graph association rules also obey these triples.
  • the algorithm converts the graph G_M into an application graph G_A by retaining only those edges that obey T_A. Among them, if the adjacent edge of a node v in G_M obeys T_A, then all attributes of the node will be retained. Filtered by label triples in T_A, the graph G_A obeys T_A and contains only nodes, edges, and attributes relevant to the target application.
  • step S120 data reduction is performed based on the interest data.
  • step S120 the specific process of "data reduction based on the interest data" described in step S120 may be further explained in conjunction with the following description.
  • sampling is performed based on the interest data to generate a partial sampling map, and the data-reduced interest data is generated based on the partial sampling map; wherein, there is at least one group of partial sampling maps, and the sampling map The size of the data does not exceed the preset percentage compared to the size of the interest data.
  • the specific process of "sampling to generate a partial sampling map based on the interest data, and generating the data-reduced interest data based on the partial sampling map" can be further explained in conjunction with the following description. .
  • a fulcrum set is generated based on the interest data; vectors are extracted based on the fulcrum set, and the vectors are clustered to generate extraction fulcrums; and the data-reduced interest data is generated based on the extraction fulcrums.
  • the definition of the fulcrum set is first given below.
  • the graph pattern Qp[xp] associated with the predicate p is a subgraph of Q[x] that contains only the corresponding graph pattern nodes of the variables in p and does not contain any edges.
  • the set of fulcrums of p in graph G denoted as PS(p,G)
  • PS(p,G) is the set of matches of Qp in G. Therefore, each pivot is either a single node or a pair of nodes drawn from G that matches a label in Qp.
  • graph data sampling based on clustering method.
  • the input of the algorithm is the application graph G_A (obtained by the application-driven graph data reduction step), the number of sampling graphs N, the sampling pivot strategy M_v, the sampling strategy M_s of surrounding subgraphs, the sampling ratio ⁇ _v% and the sampling ratio ⁇ % are controlled respectively.
  • This algorithm outputs a sampling graph H containing N samples through N rounds of calculations, and the data size of the sampling graph does not exceed ⁇ % compared to the size of the application graph G_A.
  • Each round of running the algorithm obtains a partial sampling map and adds it to the set H (lines 3-9 shown in Figure 3).
  • the algorithm finds the fulcrum set related to the prediction predicate on the right hand side, and then collects all fulcrums in the set C (shown in lines 3-5 in Figure 3); then the algorithm goes through the following two stages (shown in line 6 in Figure 3 -7 lines) to get the sampling map H(A, ⁇ %) of this round:
  • the first stage deals with the fulcrum set.
  • the algorithm calls the PSample function to sample fulcrums from the set C, so that at most ⁇ _v% of the fulcrums in C appear in the sampled fulcrum set S_A.
  • the fulcrum sampling strategy is the K-means clustering algorithm: that is, first extract its vector representation for each fulcrum, then use the K-means clustering algorithm to cluster these vectors, and finally randomly select fulcrums from each cluster.
  • the second stage extracts subgraphs around the pivot point.
  • the algorithm calls the LSample function, takes the fulcrum obtained by each sample as the starting point, and uses BFS to traverse the nodes within k hops around it. Finally, these traversed nodes and the connected edges between them are extracted and retained as the sampling graph H(A, ⁇ %) of this round.
  • the sampling process ensures that the size of the sampling graph H(A, ⁇ %) does not exceed ⁇ % compared with the application graph G_A.
  • step S120 parallel mining is performed on the data-reduced interest data to determine graph rules related to the application purpose.
  • step S120 the specific process of "conducting parallel mining on the data-reduced interest data to determine graph rules related to the application purpose" described in step S120 can be further explained in conjunction with the following description.
  • initial graph rules are generated through a graph pattern generation function and a dependency generation function based on the data-reduced interest data; graph rules related to the application purpose are generated based on verification based on the initial graph rules.
  • the reduced interest data is evenly distributed to computing nodes through the vertex cutting method and initial graph rules are generated through the graph pattern generation function and the dependency generation function.
  • the input of the graph rule parallel mining algorithm is a sampling graph H containing N samples, n processors, a positive integer k, and a support threshold ⁇ '.
  • the output of the algorithm is a rule set ⁇ _H, in which the graph pattern of each rule has at most k nodes, and the support of each rule in H is not less than the threshold ⁇ ’.
  • the algorithm first allocates computing resources evenly to the sample graph (line 1 shown in Figure 4), which divides each sample graph through the vertex cutting method and assigns it to n computing nodes. Thereafter, following the BSP parallel model and the mining algorithm similar to GFD, the parallel mining algorithm uses k ⁇ 2 rounds to generate and verify the mining rules (shown in lines 3-13 of Figure 4).
  • the rule generation is mainly completed by calling the graph pattern generation (QExpand) function and the dependency generation (PExpand) function (shown in lines 4 and 9 in Figure 4).
  • the verification of the rules is to verify the generated rules on the sampled data graph H (Line 10 shown in Figure 4) thereby filtering out the rules whose support is not less than the threshold ⁇ '.
  • the graph pattern generation (QExpand) function creates a graph pattern set Q_ ⁇ q with lq edges to expand the graph pattern when iterating lq rounds.
  • QExpand generates Q_ ⁇ q by extending each pattern in Q_ ⁇ q-1 with a new edge; initially the edges in Q1_ should obey a label triplet representing the applied predicate. Then, the algorithm uses parallel graph pattern matching to calculate the matching of these generated graph patterns in the sample graph, and then deletes from Q_ ⁇ q all graph patterns whose support is less than ⁇ ′ in the sample (shown in line 5 of Figure 4).
  • the dependency generation (PExpand) function expands the dependency relationship
  • l ⁇ m_p represents the maximum number of predicates in X.
  • this application proposes an application-driven graph data sampling strategy with accuracy guarantee to reduce data scale and improve rule mining efficiency.
  • Mining algorithms discover graph rules from general attribute graphs without the need to encode graph data into RDF format like rule learners. This avoids the potential lack of scalability of RDF converted from attribute graphs: converting node attributes of graph data tends to produce a large number of RDF triples.
  • the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.
  • a parallel graph rule mining device based on data sampling provided by an embodiment of the present application is shown, specifically including the following modules:
  • Interest data module 510 used to obtain the application purpose and generate interest data based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application;
  • the graph rule module 520 is configured to perform data reduction based on the interest data, and perform parallel mining on the reduced interest data to determine graph rules related to the application purpose.
  • the interest data module 510 includes:
  • Label device used to generate a sequence of label triples based on the application purpose and preset graph data; wherein the sequence of label triples is related to the application purpose predicate;
  • Interest data device used to generate the interest data based on the sequence of tag triples.
  • the interest data device includes:
  • Triplet submodule used to select a sequence with a frequency higher than a preset value in the sequence of tag triples to construct an application triplet
  • Interest data sub-module used to filter and generate the interest data based on the application triplet.
  • the graph rule module 520 includes:
  • Sampling map device used to perform sampling based on the interest data to generate a partial sampling map, and generate the data-reduced interest data based on the partial sampling map; wherein, there is at least one group of partial sampling maps, and the sampling map The size of the data does not exceed the preset percentage compared to the size of the interest data.
  • Initial graph rule device used to generate initial graph rules through a graph pattern generation function and a dependency generation function based on the data of interest after data reduction;
  • Graph rule device used for verifying and generating graph rules related to the application purpose based on the initial graph rules.
  • the sampling map device includes:
  • Fulcrum set sub-module used to generate a fulcrum set based on the interest data
  • Extracting fulcrum submodule used to extract vectors based on the fulcrum set, and cluster the vectors to generate extraction fulcrums;
  • Interesting data sub-module used to generate the data-reduced interesting data based on the extraction fulcrum.
  • the graph rule device includes:
  • Graph rule submodule used to evenly distribute the interest data after the data reduction to computing nodes through the vertex cutting method and generate initial graph rules through the graph pattern generation function and the dependency generation function.
  • the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.
  • FIG. 6 a computer device for a data sampling-based parallel graph rule mining method of the present application is shown, which may specifically include the following:
  • the computer device 12 described above is in the form of a general computing device.
  • the components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, memory 28, connected to different system components (including the memory 28 and the processing unit 16). bus 18.
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including volatile and nonvolatile media, removable and non-removable media.
  • Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory 30 and/or cache memory 32 .
  • Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (commonly referred to as "hard drives").
  • a disk drive may be provided for reading and writing to removable non-volatile disks (e.g., "floppy disks"), and for removable non-volatile optical disks (e.g., CD-ROM, DVD-ROM or other optical media) that can read and write optical disc drives.
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • the memory may include at least one program product having a set (eg, at least one) program module 42 configured to perform the functions of various embodiments of the present application.
  • a program/utility 40 having a set of (at least one) program modules 42, which may be stored, for example, in memory.
  • Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules. 42 As well as program data, each of these examples or some combination may include an implementation of a network environment.
  • Program modules 42 generally perform functions and/or methods in the embodiments described herein.
  • Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.) and with one or more devices that enable an operator to interact with computer device 12, and /or communicate with any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication may occur via I/O interface 22. Also, computer device 12 may communicate with one or more networks (eg, local area network (LAN)), wide area network (WAN), and/or public network (eg, the Internet) through network adapter 20. As shown in FIG. 6 , network adapter 20 communicates with other modules of computer device 12 via bus 18 .
  • LAN local area network
  • WAN wide area network
  • public network eg, the Internet
  • the processing unit 16 executes various functional applications and data processing by running programs stored in the memory 28 , for example, implementing a parallel graph rule mining method based on data sampling provided in the embodiment of the present application.
  • the above-mentioned processing unit 16 executes the above-mentioned program, it achieves: obtaining the application purpose, and generating interest data according to the application purpose and the preset graph data; wherein the interest data includes nodes, edges and data related to the target application. Attributes: perform data reduction based on the interest data, and conduct parallel mining on the interest data after the data reduction to determine graph rules related to the application purpose.
  • the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a parallel graph based on data sampling is implemented as provided in all embodiments of the present application. Rule mining methods.
  • the program when executed by the processor, it achieves: obtaining the application purpose, and generating interest data based on the application purpose and preset graph data; wherein the interest data includes nodes, edges and attributes related to the target application. ; Perform data reduction based on the interest data, and perform parallel mining on the reduced interest data to determine graph rules related to the application purpose.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared or semiconductor system, apparatus or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language - such as "C" or similar programming language.
  • the program code may execute entirely on the operator's computer, partly on the operator's computer, as a stand-alone software package, partly on the operator's computer and partly on a remote computer or entirely on the remote computer or server .
  • the remote computer can be connected to the operator computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., using an Internet service provider). to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., using an Internet service provider

Abstract

本发明实施例提供了一种基于数据采样的并行图规则挖掘方法及装置,本申请获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。通过相对于从整个图中挖掘规则,本申请提出了具有准确性保证的应用驱动的图数据抽样策略,以减少数据规模,提高规则挖掘效率。本申请避免了从属性图转化而来的RDF可能缺乏可扩展性:转化图数据的节点属性往往会产生大量的RDF三元组。本申请用机器学习谓词和通用子图的图模式来发现图关联规则。

Description

一种基于数据采样的并行图规则挖掘方法及装置 技术领域
本发明涉及计算机领域,具体涉及一种基于数据采样的并行图规则挖掘方法及装置。
背景技术
基于广泛应用于数据挖掘的层次搜索算法,传统的图规则挖掘算法利用不同剪枝策略来加速图规则的挖掘:例如图函数依赖(Graph Functional Dependency)的挖掘和图模式关联规则(Graph-Pattern Association Rule)的挖掘。针对图数据上的Horn(霍恩)规则,也有一系列挖掘算法:如采用剪枝的方法挖掘规则;还有通过自底向上的方式从图数据中不同长度的路径中学习规则。
现有图规则挖掘算法的缺点主要是:挖掘耗时长,由于规则挖掘过程中需要计算复杂度极高的子图匹配算法,所以在规模较大的数据图上挖掘图规则耗时极长,规则挖掘效率低下;在挖掘过程中没有考虑到机器学习谓词。
发明内容
鉴于所述问题,提出了本申请以便提供克服所述问题或者至少部分地解决所述问题的一种基于数据采样的并行图规则挖掘方法及装置,包括:
一种基于数据采样的并行图规则挖掘方法,所述方法用于在预设图数据中挖掘出与应用目的相对应的图规则,图规则用于在图数据中匹配出与应用目的相关的图,包括:
获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;
依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
优选地,所述依据所述应用目的和预设图数据生成兴趣数据的步骤,包括:
依据所述应用目的和预设图数据生成标签三元组的序列;其中,所述标签三元组的序列与所述应用目的谓词相关;
依据所述标签三元组的序列生成所述兴趣数据。
优选地,所述依据所述标签三元组的序列生成所述兴趣数据的步骤,包括:
在所述标签三元组的序列中选择频率高于预设数值的序列构建应用三元组;
依据所述应用三元组进行筛选生成所述兴趣数据。
优选地,所述依据所述兴趣数据进行数据缩减的步骤,包括:
依据所述兴趣数据进行采样生成部分采样图,并依据所述部分采样图生成所述数据缩减后的兴趣数据;其中,所述部分采样图至少一组,所述采样图的数据规模相比于兴趣数据的规模不超过预设百分比。
优选地,所述依据所述兴趣数据进行采样生成部分采样图,并依据所述部分采样图生成所述数据缩减后的兴趣数据的步骤,包括:
依据所述兴趣数据生成支点集合;
依据所述支点集合进行提取向量,并将所述向量聚类生成抽取支点;
依据所述抽取支点生成所述数据缩减后的兴趣数据。
优选地,所述对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则的步骤,包括:
依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则;
依据所述初始图规则进行验证生成与所述应用目的相关的图规则。
优选地,所述依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则的步骤,包括:
依据所述数据缩减后的兴趣数据通过顶点切割方法进行均匀分配给计算节点并通过图模式生成函数和依赖生成函数生成初始图规则。
为实现本申请还包括一种基于数据采样的并行图规则挖掘装置,所述装置用于在预设图数据中挖掘出与应用目的相对应的图规则,图规则用于在图数据中匹配出与应用目的相关的图,包括:
兴趣数据模块,用于获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;
图规则模块,用于依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
为实现本申请还包括一种电子设备,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现所述的基于数据采样的并行图规则挖掘方法的步骤。
为实现本申请一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现所述的基于数据采样的并行图规则挖掘方法的步骤。
本申请具有以下优点:
在本申请的实施例中,本申请获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。通过相对于从整个图中挖掘规则,本申请提出了具有准确性保证的应用驱动的图数据抽样策略,以减少数据规模,提高规则挖掘效率。本申请避免了从属性图转化而来的RDF(Resource Description Framework,资源—属性—值)可能缺乏可扩展性:转化图数据的节点属性往往会产生大量的RDF三元组。本申请用机器学习谓词和通用子图的图模式来发现图关联规则(Graph Association Rule)。
附图说明
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的 一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例提供的一种基于数据采样的并行图规则挖掘方法的步骤流程图;
图2是本申请一实施例提供的一种基于数据采样的并行图规则挖掘方法的步骤流程图;
图3是本申请一实施例提供的一种基于数据采样的并行图规则挖掘方法的基于聚类方法的图数据采样图;
图4是本申请一实施例提供的一种基于数据采样的并行图规则挖掘方法的图规则并行挖掘算法图;
图5是本申请一实施例提供的一种基于数据采样的并行图规则挖掘装置的结构示意图;
图6是本发明一实施例提供的一种计算机设备的结构示意图。
具体实施方式
为使本申请的所述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请通过获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。通过相对于从整个图中挖掘规则,本申请提出了具有准确性保证的应用驱动的图数据抽样策略,以减少数据规模,提高规则挖掘效率。本申请避免了从属性图转化而来的RDF(Resource Description Framework,资源—属性—值)可能缺乏可扩展性:转化图数据的节点属性往往会产生大量的RDF三元组。本申请用机器学习谓词和通用子图的图模式来发现图关联规则(Graph Association Rule)。
需要说明的是,图规则的相关工作更像是数据库领域的一个分支。规则的诞生和使用最早是应用在数据库领域中,比如创建表时使用的‘完整性约束’,为了说明插入该表中的数据必须满足一定的约束(某个属性非空等等);此外,规则也被广泛的应用在数据挖掘等领域中。
参照图1和图2,示出了本申请一实施例提供的一种基于数据采样的并行图规则挖掘方法的步骤流程图,具体包括如下步骤:
S110、获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;
S120、依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
下面,将对本示例性实施例中的基于数据采样的并行图规则挖掘方法作进一步地说明。
如上述步骤S110所述,获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性。
在本发明一实施例中,可以结合下列描述进一步说明步骤S110所述“获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性”的具体过程。
如下列步骤所述,依据所述应用目的和预设图数据生成标签三元组的序列;其中,所述标签三元组的序列与所述应用目的谓词相关;依据所述标签三元组的序列生成所述兴趣数据。
在本发明一实施例中,可以结合下列描述进一步说明步骤所述“依据所述标签三元组的序列生成所述兴趣数据”的具体过程。
如下列步骤所述,在所述标签三元组的序列中选择频率高于预设数值的序列构建应用三元组;依据所述应用三元组进行筛选生成所述兴趣数据。
在一具体实施例中,在介绍算法步骤之前,首先引入“标签三元组”这个概念。一个标签三元组定义为l_v,l_e,l′_v,其中l_v和l′_v是两个相连的点的标签,l_e是连接两点之间的边的标签。我们定义如果点v的标 签L(v)=l_v、边的标签l=l_e、点v′的标签L(v′)=l′_v,那么边e=(v,l,v′)服从标签三元组t=l_v,le,l′_v。通配符“_”匹配任意的标签。我们将L(v),l,L(v′)称为边e的标签三元组T(e)。对于标签三元组集合T而言,若存在标签三元组t∈T使得e服从标签三元组t,那么称边e服从标签三元组集合T。若图G中的每条边e都服从标签三元组集合T,那么图G服从标签三元组集合T。我们定义图模式Q[x]的谓词p的标签三元组(记为T(p))如下:(1)若p为边谓词l(x,y)或机器学习谓词M(x,y,l),那么其标签三元组为{L_Q(μ(x)),l,L_Q(μ(y))};(2)若p为属性谓词x.A或者常量谓词x.A=c,那么其标签三元组为{L_Q(μ(x)),_,_,_,_,L_Q(μ(x))};(3)若p为变量谓词x.A=y.B,那么其标签三元组为{L_Q(μ(x)),_,L_Q(μ(y)),L_Q(μ(y)),_,L_Q(μ(x))}。
接下来介绍算法步骤。给定用户感兴趣的应用、一个链接预测机器学习模型M(x,y,l)和数据图G,我们采用语言模型M_A(长短期记忆(LSTM)网络),经过以下四个步骤导出应用图G_A。
第一步,算法通过增加由M(x,y,l)预测的边,将图G扩展为G_M=(V,E_M,L,F)。这使得算法在发现应用图G_A中的图关联规则时,可以统一考虑机器学习谓词。
第二步,以表示应用的每个谓词p的标签三元组T(p)为种子输入,并将每个三元组视为一个词,我们利用训练后的语言模型M_A生成一些标签三元组的序列,记为Θ_A。由于算法基于LSTM语言模型M_A对句子生成的概率进行建模,生成的序列在语义上与种子输入T(p)相关。
第三步,算法从Θ_A中选择出现频率最高的m个三元组来构建一个标签三元组的集合T_A,称为应用三元组。这里m是一个预先给定的正整数。也就是说,算法关注的是与应用关系最密切的三元组。这样的应用三联体和表达应用的谓词三元组以高概率共同出现。因此,与应用相关的图关联规则很可能包括与这些标签三元组相关的谓词,而且这种图关联规则中的图模式边也服从这些三元组。
第四步,通过只保留那些服从T_A的边,算法将图G_M转化为应用图 G_A。其中,如果G_M中节点v某个的相邻边服从T_A,那么该节点的所有属性都会被保留。通过T_A中的标签三元组过滤,图G_A服从T_A,并且只包含与目标应用有关的节点、边和属性。
如上述步骤S120所述,依据所述兴趣数据进行数据缩减。
在本发明一实施例中,可以结合下列描述进一步说明步骤S120所述“依据所述兴趣数据进行数据缩减”的具体过程。
如下列步骤所述,依据所述兴趣数据进行采样生成部分采样图,并依据所述部分采样图生成所述数据缩减后的兴趣数据;其中,所述部分采样图至少一组,所述采样图的数据规模相比于兴趣数据的规模不超过预设百分比。
在本发明一实施例中,可以结合下列描述进一步说明步骤所述“依据所述兴趣数据进行采样生成部分采样图,并依据所述部分采样图生成所述数据缩减后的兴趣数据”的具体过程。
如下列步骤所述,依据所述兴趣数据生成支点集合;依据所述支点集合进行提取向量,并将所述向量聚类生成抽取支点;依据所述抽取支点生成所述数据缩减后的兴趣数据。
在一具体实施例中,以下首先给出支点集合的定义。考虑图模式Q[x]的一个谓词p。与谓词p相关的图模式Qp[xp]是Q[x]的子图,其只包含p中变量的相应图模式节点,而不包含任何边。图G中p的支点集合,记为PS(p,G),是Qp在G中的匹配的集合。因此,每个支点要么是一个单一的节点,要么是从G中抽取的与Qp中的标签匹配的节点对。如图3所示,基于聚类方法的图数据采样。
基于以上“支点集合”的定义,我们给出基于聚类方法的图数据采样算法。
算法的输入为应用图G_A(由应用驱动的图数据缩减步骤得到),采样图的数量N,采样支点的策略M_v,采样周围子图的策略M_s,采样比率ρ_v%和采样比率ρ%分别控制采样节点的比例和采样图规模的比例。该算法通过N轮计算,输出包含N个样本的采样图H,且采样图的数据规模相比于应用图G_A的规模不超过ρ%。
算法的每一轮运行得到部分采样图,并加入到集合H中(图3所示第3-9行)。首先,算法找出有关于右手边预测谓词的支点集合,然后将所有支点收集在集合C中(图3所示第3-5行);而后算法通过以下两个阶段(图3所示第6-7行)得出本轮的采样图H(A,ρ%):
第一阶段针对支点集合进行处理。算法调用PSample函数从集合C中采样出支点,使得C中至多ρ_v%的支点出现在采样的支点集合S_A中。其中的支点采样策略为K均值聚类算法:即首先针对每一个支点提取其向量表示,然后利用K均值聚类算法将这些向量聚类,最后从每个聚类中随机抽取支点。
第二阶段抽取支点周围的子图。算法调用LSample函数,以每个采样得到的支点为起点,采用BFS遍历其周围k跳以内的节点。最后将这些遍历得到的节点及其之间相连的边抽取保留,作为该轮的采样图H(A,ρ%)。采样过程保证采样图H(A,ρ%)的规模与应用图G_A的相比不超过ρ%。
如上述步骤S120所述,对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
在本发明一实施例中,可以结合下列描述进一步说明步骤S120所述“对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则”的具体过程。
如下列步骤所述,依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则;依据所述初始图规则进行验证生成与所述应用目的相关的图规则。
在本发明一实施例中,可以结合下列描述进一步说明步骤所述“依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则”的具体过程。
如下列步骤所述,依据所述数据缩减后的兴趣数据通过顶点切割方法进行均匀分配给计算节点并通过图模式生成函数和依赖生成函数生成初始图规则。
在一具体实施例中,图规则并行挖掘算法的输入为包含N个样本的采样图H,n个处理器,正整数k,以及支持度阈值σ’。算法的输出为规则集 合Σ_H,其中的每个规则的图模式至多有k个节点,且每个规则的支持度在H中不小于阈值σ’。
算法首先将计算资源均匀地分配给样本图(图4所示第1行),其通过顶点切割方法将每个样本图分割并分配给n个计算节点。此后,沿袭BSP并行模型和类似于GFD的挖掘算法,并行挖掘算法利用k^2轮来生成并验证挖掘的规则(图4所示第3-13行)。规则生成主要调用图模式生成(QExpand)函数和依赖生成(PExpand)函数完成(图4所示第4、9行),规则的验证是将生成好的规则在采样得到的数据图H上进行验证(图4所示第10行)从而筛选出支持度不小于阈值σ’的规则。
图模式生成(QExpand)函数在迭代lq轮时,创建一个具有lq条边的图模式集合Q_□q来扩展图模式。QExpand通过用一条新的边扩展Q_□q-1中的每个模式来生成Q_□q;最初Q1_中的边应该服从表示应用的谓词的标签三元组。然后,算法采用并行图模式匹配计算样本图中这些生成的图模式的匹配情况,再从Q_□q中删除所有在样本中支持度小于σ′的图模式(图4所示第5行)。
给定图模式Q_□q,依赖生成(PExpand)函数在lp层扩展依赖关系X→p0,以产生候选图规则,迭代次数为l^m_p(图4所示第8-12行)。这里l^m_p表示X中最大的谓词数量。在每个迭代lp中,函数计算出集合Σ^□p的图规则,这样每个图规则的都有一个来自Q□q的图模式,且相应前提条件X来自lp(当lp=0时
Figure PCTCN2022114988-appb-000001
为空集)(图4所示第9行),其中X是由Σ^□p-1中的一个对应谓词扩展而来,并新增了一个新的谓词。
在一具体实施例中,本申请相对于从整个图中挖掘规则,我们提出了具有准确性保证的应用驱动的图数据抽样策略,以减少数据规模,提高规则挖掘效率。挖掘算法从一般属性图中发现图规则,而不需要像规则学习器那样将图数据编码为RDF格式。这避免了从属性图转化而来的RDF可能缺乏可扩展性:转化图数据的节点属性往往会产生大量的RDF三元组。用机器学习谓词和通用子图的图模式来发现图关联规则(Graph Association Rule)。相比之下,之前的方法都没有考虑到机器学习谓词,大部分的方法只研究路径 模式。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
参照图5,示出了本申请一实施例提供的一种基于数据采样的并行图规则挖掘装置,具体包括如下模块:
兴趣数据模块510:用于获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;
图规则模块520:用于依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
在本发明一实施例中,所述兴趣数据模块510包括:
标签装置:用于依据所述应用目的和预设图数据生成标签三元组的序列;其中,所述标签三元组的序列与所述应用目的谓词相关;
兴趣数据装置:用于依据所述标签三元组的序列生成所述兴趣数据。
在本发明一实施例中,兴趣数据装置包括:
三元组子模块:用于在所述标签三元组的序列中选择频率高于预设数值的序列构建应用三元组;
兴趣数据子模块:用于依据所述应用三元组进行筛选生成所述兴趣数据。
在本发明一实施例中,所述图规则模块520包括:
采样图装置:用于依据所述兴趣数据进行采样生成部分采样图,并依据所述部分采样图生成所述数据缩减后的兴趣数据;其中,所述部分采样图至少一组,所述采样图的数据规模相比于兴趣数据的规模不超过预设百分比。
初始图规则装置:用于依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则;
图规则装置:用于依据所述初始图规则进行验证生成与所述应用目的相关的图规则。
在本发明一实施例中,所述采样图装置包括:
支点集合子模块:用于依据所述兴趣数据生成支点集合;
抽取支点子模块:用于依据所述支点集合进行提取向量,并将所述向量聚类生成抽取支点;
兴趣数据子模块:用于依据所述抽取支点生成所述数据缩减后的兴趣数据。
在本发明一实施例中,所述图规则装置包括:
图规则子模块:用于依据所述数据缩减后的兴趣数据通过顶点切割方法进行均匀分配给计算节点并通过图模式生成函数和依赖生成函数生成初始图规则。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。
在本具体实施例与上述具体实施例中有重复的操作步骤,本具体实施例仅做简单描述,其余方案参考上述具体实施例描述即可。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
参照图6,示出了本申请的一种基于数据采样的并行图规则挖掘方法的计算机设备,具体可以包括如下:
上述计算机设备12以通用计算设备的形式表现,计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,内存28,连接不同系统组件(包括内存28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、音视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是 任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
内存28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器30和/或高速缓存存储器32。计算机设备12可以进一步包括其他移动/不可移动的、易失性/非易失性计算机体统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(通常称为“硬盘驱动器”)。尽管图6中未示出,可以提供用于对可移动非易失性磁盘(如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其他光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质界面与总线18相连。存储器可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块42,这些程序模块42被配置以执行本申请各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器中,这样的程序模块42包括——但不限于——操作系统、一个或者多个应用程序、其他程序模块42以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24、摄像头等)通信,还可与一个或者多个使得操作人员能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其他计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过I/O接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN)),广域网(WAN)和/或公共网络(例如因特网)通信。如图6所示,网络适配器20通过总线18与计算机设备12的其他模块通信。应当明白,尽管图6中未示出,可以结合计算机设备12使用其他硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元16、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统34等。
处理单元16通过运行存储在内存28中的程序,从而执行各种功能应用以及数据处理,例如实现本申请实施例所提供的一种基于数据采样的并行图规则挖掘方法。
也即,上述处理单元16执行上述程序时实现:获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
在本申请实施例中,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请所有实施例提供的一种基于数据采样的并行图规则挖掘方法。
也即,给程序被处理器执行时实现:获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的 组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言——诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在操作人员计算机上执行、部分地在操作人员计算机上执行、作为一个独立的软件包执行、部分在操作人员计算机上部分在远程计算机上执行或者完全在远程计算机或者服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到操作人员计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种基于数据采样的并行图规则挖掘方法及装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种基于数据采样的并行图规则挖掘方法,所述方法用于在预设图数据中挖掘出与应用目的相对应的图规则,图规则用于在图数据中匹配出与应用目的相关的图,其特征在于,包括:
    获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;
    依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
  2. 根据权利要求1所述的基于数据采样的并行图规则挖掘方法,其特征在于,所述依据所述应用目的和预设图数据生成兴趣数据的步骤,包括:
    依据所述应用目的和预设图数据生成标签三元组的序列;其中,所述标签三元组的序列与所述应用目的谓词相关;
    依据所述标签三元组的序列生成所述兴趣数据。
  3. 根据权利要求2所述的基于数据采样的并行图规则挖掘方法,其特征在于,所述依据所述标签三元组的序列生成所述兴趣数据的步骤,包括:
    在所述标签三元组的序列中选择频率高于预设数值的序列构建应用三元组;
    依据所述应用三元组进行筛选生成所述兴趣数据。
  4. 根据权利要求1所述的基于数据采样的并行图规则挖掘方法,其特征在于,所述依据所述兴趣数据进行数据缩减的步骤,包括:
    依据所述兴趣数据进行采样生成部分采样图,并依据所述部分采样图生成所述数据缩减后的兴趣数据;其中,所述部分采样图至少一组,所述采样图的数据规模相比于兴趣数据的规模不超过预设百分比。
  5. 根据权利要求4所述的基于数据采样的并行图规则挖掘方法,其特征在于,所述依据所述兴趣数据进行采样生成部分采样图,并依据所述部分 采样图生成所述数据缩减后的兴趣数据的步骤,包括:
    依据所述兴趣数据生成支点集合;
    依据所述支点集合进行提取向量,并将所述向量聚类生成抽取支点;
    依据所述抽取支点生成所述数据缩减后的兴趣数据。
  6. 根据权利要求1所述的基于数据采样的并行图规则挖掘方法,其特征在于,所述对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则的步骤,包括:
    依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则;
    依据所述初始图规则进行验证生成与所述应用目的相关的图规则。
  7. 根据权利要求6所述的基于数据采样的并行图规则挖掘方法,其特征在于,所述依据所述数据缩减后的兴趣数据通过图模式生成函数和依赖生成函数生成初始图规则的步骤,包括:
    依据所述数据缩减后的兴趣数据通过顶点切割方法进行均匀分配给计算节点并通过图模式生成函数和依赖生成函数生成初始图规则。
  8. 一种基于数据采样的并行图规则挖掘装置,所述装置用于在预设图数据中挖掘出与应用目的相对应的图规则,图规则用于在图数据中匹配出与应用目的相关的图,其特征在于,包括:
    兴趣数据模块,用于获取所述应用目的,并依据所述应用目的和预设图数据生成兴趣数据;其中,所述兴趣数据包括与目标应用有关的节点、边和属性;
    图规则模块,用于依据所述兴趣数据进行数据缩减,并对所述数据缩减后的兴趣数据进行并行挖掘确定出与所述应用目的相关的图规则。
  9. 一种电子设备,其特征在于,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至7中任一项所述的基于数据采样的并行图规 则挖掘方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的基于数据采样的并行图规则挖掘方法的步骤。
PCT/CN2022/114988 2022-08-17 2022-08-26 一种基于数据采样的并行图规则挖掘方法及装置 WO2024036662A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210988458.7A CN115358397A (zh) 2022-08-17 2022-08-17 一种基于数据采样的并行图规则挖掘方法及装置
CN202210988458.7 2022-08-17

Publications (1)

Publication Number Publication Date
WO2024036662A1 true WO2024036662A1 (zh) 2024-02-22

Family

ID=84002879

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/114988 WO2024036662A1 (zh) 2022-08-17 2022-08-26 一种基于数据采样的并行图规则挖掘方法及装置

Country Status (2)

Country Link
CN (1) CN115358397A (zh)
WO (1) WO2024036662A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116610725B (zh) * 2023-05-18 2024-03-12 深圳计算科学研究院 一种应用于大数据的实体增强规则挖掘方法及装置
CN117077802A (zh) * 2023-06-15 2023-11-17 深圳计算科学研究院 一种时序性数据的排序预测方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140602A1 (en) * 2006-12-11 2008-06-12 International Business Machines Corporation Using a data mining algorithm to discover data rules
US20160092515A1 (en) * 2014-09-29 2016-03-31 International Business Machines Corporation Mining association rules in the map-reduce framework
US20170228448A1 (en) * 2016-02-08 2017-08-10 Futurewei Technologies, Inc. Method and apparatus for association rules with graph patterns
CN108595711A (zh) * 2018-05-11 2018-09-28 成都华数天成科技有限公司 一种分布式环境下图模式关联规则挖掘方法
CN114741460A (zh) * 2022-06-10 2022-07-12 山东大学 基于规则间关联的知识图谱数据扩展方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140602A1 (en) * 2006-12-11 2008-06-12 International Business Machines Corporation Using a data mining algorithm to discover data rules
US20160092515A1 (en) * 2014-09-29 2016-03-31 International Business Machines Corporation Mining association rules in the map-reduce framework
US20170228448A1 (en) * 2016-02-08 2017-08-10 Futurewei Technologies, Inc. Method and apparatus for association rules with graph patterns
CN108595711A (zh) * 2018-05-11 2018-09-28 成都华数天成科技有限公司 一种分布式环境下图模式关联规则挖掘方法
CN114741460A (zh) * 2022-06-10 2022-07-12 山东大学 基于规则间关联的知识图谱数据扩展方法及系统

Also Published As

Publication number Publication date
CN115358397A (zh) 2022-11-18

Similar Documents

Publication Publication Date Title
US10664505B2 (en) Method for deducing entity relationships across corpora using cluster based dictionary vocabulary lexicon
US9971967B2 (en) Generating a superset of question/answer action paths based on dynamically generated type sets
WO2024036662A1 (zh) 一种基于数据采样的并行图规则挖掘方法及装置
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN111522927B (zh) 基于知识图谱的实体查询方法和装置
US11030402B2 (en) Dictionary expansion using neural language models
CN116127020A (zh) 生成式大语言模型训练方法以及基于模型的搜索方法
CN110708285B (zh) 流量监控方法、装置、介质及电子设备
CN112749300B (zh) 用于视频分类的方法、装置、设备、存储介质和程序产品
CN112988753B (zh) 一种数据搜索方法和装置
CN109033456B (zh) 一种条件查询方法、装置、电子设备和存储介质
CN112582073B (zh) 医疗信息获取方法、装置、电子设备和介质
CN111984745B (zh) 数据库字段动态扩展方法、装置、设备及存储介质
US9201937B2 (en) Rapid provisioning of information for business analytics
EP4109300A2 (en) Method and apparatus for querying writing material, electronic device and storage medium
CN110675865A (zh) 用于训练混合语言识别模型的方法和装置
US9536193B1 (en) Mining biological networks to explain and rank hypotheses
CN110795424A (zh) 特征工程变量数据请求处理方法、装置及电子设备
CN115238805B (zh) 异常数据识别模型的训练方法及相关设备
US20230385252A1 (en) Data quality analyze execution in data governance
CN116226686B (zh) 一种表格相似性分析方法、装置、设备和存储介质
CN112685574B (zh) 领域术语层次关系的确定方法、装置
US11636391B2 (en) Automatic combinatoric feature generation for enhanced machine learning
CN111046146B (zh) 用于生成信息的方法和装置
WO2023168659A1 (zh) 一种横跨图数据与关系数据的实体对识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22955441

Country of ref document: EP

Kind code of ref document: A1